<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; url</title>
	<atom:link href="http://iregex.org/blog/tag/url/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Sun, 27 Jun 2010 04:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>[译]从文本中析取有效URL链接</title>
		<link>http://iregex.org/blog/translate-detecting-urls-in-a-block-of-text.html</link>
		<comments>http://iregex.org/blog/translate-detecting-urls-in-a-block-of-text.html#comments</comments>
		<pubDate>Fri, 07 Nov 2008 08:11:42 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[translation]]></category>
		<category><![CDATA[url]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=36</guid>
		<description><![CDATA[原文作者是Jan Goyvaerts(Regex Guru)，原页面链接是Detecting URLs in a Block of Text， 翻译者：rex，译者博客(http://iregex.org)。 rex注：URL是Uniform Resource Locator的缩写(wiki)，中文叫作统一资源定位符（百科）... ]]></description>
			<content:encoded><![CDATA[<h5>原文作者是Jan Goyvaerts(<a href="http://www.regex-guru.info">Regex Guru</a>)，原页面链接是<a href="http://www.regex-guru.info/2008/11/detecting-urls-in-a-block-of-text/">Detecting URLs in a Block of Text</a>，</p>
<p>翻译者：rex，译者博客(<a href="http://iregex.org">http://iregex.org</a>)。</h5>
<p>rex注：URL是Uniform Resource Locator的缩写(<a target="_blank" href="http://en.wikipedia.org/wiki/URL">wiki</a>)，中文叫作<strong>统一资源定位符</strong>（<a target="_blank" href="http://baike.baidu.com/view/1496.htm">百科</a>），解释如下：Internet上的每一个网页都具有一个唯一的名称标识，通常称之为URL地址，这种地址可以是本地磁盘，也可以是局域网上的某一台计算机，更多的是Internet上的站点。简单地说，URL就是Web地址，俗称&#8220;网址&#8221;。</p>
<p><span id="more-36"></span></p>
<p>In his blog post <a href="http://www.codinghorror.com/blog/archives/001181.html">The Problem with URLs</a> points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.</p>
<p>在他的博客文章<a target="_blank" href="http://www.codinghorror.com/blog/archives/001181.html">URL难题</a>一文中指出了使用正则表达式在大量文本中尝试检测URL所遇到的一些问题。</p>
<p>The short answer is that it <b>can&#8217;t be done</b>. Pretty much <b>any character is valid in URLs</b>. The very simplistic \bhttp://\S+ not only fails to differentiate between punctuation that&#8217;s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I&#8217;ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.</p>
<p>简言之，答案是<strong>做不到</strong>。在多数情况下，<strong>任何字符在URL中都是合法字符</strong>。这条过分简单化的表达式<tt class="regex">\bhttp://\S+</tt>之所以失败，不单因为它无法区分作为URL一部分的标点符号与引用URL的标点符号，还在于它对URL中包含空格的情况也是无能为力。是的，空格在URL中也是合法的，这几年我颇遇到一些网址中包含空格的情况。本条正则式还忽略了其它的网络协议，例如https。</p>
<p>In <a href="http://www.regexbuddy.com/library.html">RegexBuddy&#8217;s library</a>, you&#8217;ll find this regex if you look up &#8220;URL: Find in full text&#8221;:</p>
<p>在<a href="http://www.regexbuddy.com/library.html">RegexBuddy&#8217;s library</a>(RegexBuddy标准库)中，你如果你查找&#8220;URL: Find in full text&#8221;：</p>
<p><tt class="regex">\b(https?|ftp|file)://[-A-Z0-9+&amp;@#/%?=~_|!:,.;]*[A-Z0-9+&amp;@#/%=~_|]</tt> (case insensitive大小写不敏感)</p>
<p>Like every other regex for extracting URLs, it&#8217;s <b>not perfect</b>. The key benefit of this regex is that it uses a separate character class for the last character in the URL, which allows less punctuation characters than the character class for the other characters in the URL. It excludes punctuation that is unlikely to occur at the end of the URL, and more likely to be punctuation that&#8217;s part of the sentence the URL is quoted in. It does not allow parentheses at all.</p>
<p>与其它试图析取URL的正则式一样，它<strong>并不完美</strong>。但是这条正则式的主要优势是，它在结尾处使用单独的文本类，这就限定了结尾处所允许出现的标点字符种类要少于URL的其它部分。它排除了在URL结尾处不太可能出现，而更像是URL所在文本结尾的标点。它根本不允许出现括号。</p>
<p>In <a href="http://www.editpadpro.com/cscs.html">EditPad Pro&#8217;s syntax coloring schemes</a>, which are fully editable and entirely based on regular expressions, you&#8217;ll often find this regex:</p>
<p><a target="_blank" href="http://www.editpadpro.com/cscs.html">EditPad Pro的语法色彩主题</a>是可自定义的，完全基于正则表达式的。在其中，你会经常发现这条正则式：</p>
<p><tt class="regex">\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&amp;@#/%=~_|$?!:,.]*[A-Z0-9+&amp;@#/%=~_|$]</tt></p>
<p>(case insensitive大小写不敏感)</p>
<p>The main difference with the previous regex is that this one matches URLs such as www.regex-guru.info <b>without the http:// protocol</b>. People often type URLs that way in their documents and messages, because most browsers accept them that way too.</p>
<p>本条正则式与上一条的主要区别是，它匹配<tt class="string">www.regex-guru.info</tt>之类的URL，之前没有<tt class="string">http://</tt>协议。人们经常在文件或消息中使用这种方式输入网址，同时大多数浏览器也接受这种方式。</p>
<p>EditPad&#8217;s built-in &#8220;clickable URLs&#8221; syntax highlighting uses this regex:<br />
  <br />EditPad的内置&#8220;可点击的URL&#8221;语法高亮，是由这条正则式实现的：</p>
<p><tt class="regex">\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&amp;@#/%?=~_|$!:,.;]*[-A-Z0-9+&amp;@#/%=~_|$]</tt></p>
<p><tt class="regex">&#160;&#160; | ((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,4})\b)</tt></p>
<p><tt class="regex">|&#8221;(?:(?:https?|ftp|file)://|www\.|ftp\.)[^&quot;\r\n]+&#8221;?</tt></p>
<p><tt class="regex">|&#8217;(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+&#8217;?</tt> (free-spacing, case insensitive空格宽松模式，大小写不敏感)</p>
<p>This log regex adds three alternatives to the previous regex. It adds the ability to match <b>email addresses</b>, with or without mailto:, and it matches <b>URLs between single or double quotes</b>. When the URL is quoted, it allows all characters in the URL, except line breaks and the delimiting quote. This way, any URL with weird punctuation can be highlighted correctly by placing it between a pair of quote characters. Because this regex is used to highlight text as you type, the closing quotes are optional. The highlighting will run until the end of the line until you type the closing quote. Remove the question marks after the quote characters if you will use this regex to extract URLs.</p>
<p>这条正则式在前一条的基础上增加了三种备选匹配项。现在它可以匹配电邮地址，有无mailto:均可；匹配<strong>单引号或双引号之间的URL</strong>。当URL在引号内时，它允许出现除换行符或起界作用的引号之外的任意字符。使用这种方法，不论使用何种怪异标点引用的URL，都能将URL与引号分离开来，从而正确高亮显示。由于此正则式是在你输入的同时高亮文本，因此右侧的结尾引号不是必需的。高亮持续显示到本行结尾，直到你输入右侧的结尾引号。如果你要使用本正则式析取URL，请删除引号后面的引号。</p>
<p>So how about Jeff&#8217;s problem?<br />
  <br />我们再来看一下Jeff的问题：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>I couldn&#8217;t come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.<br />
    <br />我实在想不出，怎样仅仅使用正则式，就能正确地区分使用以括号结尾的URL和使用括号括起来的URL。</p>
</blockquote>
<p>That&#8217;s not too hard, if we add the restriction that we only allow unnested pairs of parentheses in URLs. Using the second regex in this article as the starting point, <b>add an alternative for a pair of parentheses to both character classes</b> in that regex:</p>
<p>如果限制在URL只使用不嵌套的括号的话，那么该问题不难解决。使用本文中第二条正则式作为开始，<strong>在正则式的两组字符类中都加上一对可选的括号</strong>：</p>
<p><tt class="regex">\b(?:(?:https?|ftp|file)://|www\.|ftp\.) </tt><br />
    <br /><tt class="regex">&#160; (?:\([-A-Z0-9+&amp;@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&amp;@#/%=~_|$?!:,.])* </tt></p>
<p><tt class="regex">&#160; (?:\([-A-Z0-9+&amp;@#/%=~_|$?!:,.]*\)|[A-Z0-9+&amp;@#/%=~_|$])</tt> (free-spacing, case insensitive空格宽松模式，大小写不敏感)</tt></p>
<p>This regex allows the same set of characters in the middle of the URL, mixed with zero or more sequences of those characters between parentheses. It allows the URL to end with the same reduced set of characters, or a final run between parentheses. Because we require the opening parenthesis to be in the URL, we don&#8217;t have to do anything complicated to check if any closing parentheses we encounter are part of the URL or not.<br />
  <br />这条正则式允许相同的字符集出现在URL中的括号内，组成0个或多个字符的序列。它允许URL以简化的字符集结尾，或者在结尾出现最后一组括号。由于我们要求在URL中出现开括号，所以就不必再做任何复杂工作来验证之后所遇到的闭括号是不是URL的一部分了。</p>
<p>It&#8217;s important that you observe that in order to allow any number of pairs of parentheses in the middle of the regex, I <b>moved the star</b> from the character class to the group it is now in. I did <b>not add another star</b> to the group. A double-star combination like <tt class="regex">(a|b*)*</tt> is a sure-fire recipe for <a href="http://www.regular-expressions.info/catastrophic.html">catastrophic backtracking</a>.</p>
<p>你或许注意到，为了允许正则式中出现任意多次的成对括号，我将字符类中的<strong>星号移到它所在组</strong>。我并没有再给此组加星号。这一点至关重要。像<tt class="regex">(a|b*)*</tt>这样的双星号组合，是<a target="_blank" href="http://www.regular-expressions.info/catastrophic.html">灾难式回溯</a>的保证。</p>
<p></p>
<p>All the regexes in this article will be included in RegexBuddy&#8217;s library with the next free minor update. Current version is 3.2.0.<br />
  <br />本文提及的所有的正则式都将在RegexBuddy下次免费次要升级中添加到正则库库中。当前RegexBuddy的版本是3.2.0。</p>
<p>rex注：在RegexBuddy里，可以使用Alt+7唤出自带的正则库。输入URL回车，查到的结果如下：</p>
<p>&#160; <br /><a title="我爱正则达式" target="_blank" href="http://iregex.org"><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i37.tinypic.com/20sj0j7.jpg" /></a> </p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">译后：</h3>
<p>我自己探索时，走得很兴奋；有人引路时，很走得很稳健。</p>
<p>抓饭最初就是使用正则表达式析取饭否消息的， 但是所有的文本都遵循良好的格式：XML。当然不可避免非法字符，这些在作为XML展示时会报错。不过，如果镶嵌在HTML中展示时安然无恙。</p>
<p>饭否、叽歪、做啥，都能完美地解析所输入的消息中的URL，并自做主张处理一下，变成自有格式，例如，饭否会把输入的<a href="http://regex.me">http://regex.me</a> 转换为<a title="http://fanfou.com/linkto/aHR0cDovL3JlZ2V4Lm1l" href="http://fanfou.com/linkto/aHR0cDovL3JlZ2V4Lm1l">http://fanfou.com/linkto/aHR0cDovL3JlZ2V4Lm1l</a>，即使在<a href="http://regex.me">http://regex.me</a>之后加上一两个汉字，也不影响转换结果。不过twitter就没这么强大了。在twitter中输入&#8220;<a href="http://regex.me">http://regex.me</a>正则表达式交流论坛&#8221;，输入的的结果是<a href="http://tinyurl.com/5azr6y">http://tinyurl.com/5azr6y</a>，点击展开，URL就成了<a title="http://www.regex.xn--me-y82c39klqi9nmf5umndl14f76g8ol/" href="http://www.regex.me">http://www.regex.me正则表达式交流论坛/</a>，很显然这是一个错误的URL。错误的原因就在于它没有正确地从文本解析URL。</p>
<p><a target="_blank" href="http://fanfou.com/statuses/xWerhzyROOU">撕烤者</a>说过：&#8220;汉语的一个优势是，我们可以很轻易的把它们和代码区分开。&#8221;颇然其说。中英文的区别，不仅仅是肉眼上一目了然，即使使用程序来区分也是毫不费力。不同的编码方案有利于正确解析HTML代码，找到其中的URL结束点。可惜twitter这一点做得实在太滥。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/translate-detecting-urls-in-a-block-of-text.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>python正则表达式链接</title>
		<link>http://iregex.org/blog/python-regular-expression.html</link>
		<comments>http://iregex.org/blog/python-regular-expression.html#comments</comments>
		<pubDate>Mon, 26 May 2008 15:43:24 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[url]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=12</guid>
		<description><![CDATA[最近迷上了python，对它的三重引号赞不绝口，原来在Perl中一直困扰我的utf8字串问题，在python中得到圆满解决。我指的是一直在写的fanfou应用程序中，发送私信的编码问题。调用饭否API向饭否发... ]]></description>
			<content:encoded><![CDATA[<p>最近迷上了python，对它的三重引号赞不绝口，原来在Perl中一直困扰我的utf8字串问题，在python中得到圆满解决。我指的是一直在写的<a href="http://code.google.com/p/fanfoufans/" target="_blank">fanfou应用程序</a>中，发送私信的编码问题。调用饭否API向饭否发送普通消息没有问题，因为它兼容utf8与gb2312；而发送私时，却只允许使用utf8编码。最见效的例子是发“<font color="#ff0084">联通</font>”两个字。</p>
<p>闲话打住，切入正题，说一说python中的正则式。推荐两个网址：</p>
<ol>
<li><a target="_blank" href="http://wiki.ubuntu.org.cn/Python%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%93%8D%E4%BD%9C%E6%8C%87%E5%8D%97">Python正则表达式操作指南</a>，由A.M. Kuchling（amk@amk.ca）原创，由FireHare翻译，发布在<a target="_blank" href="http://forum.ubuntu.org.cn">ubuntu.org.cn</a>上。该文档可以当作手册来查阅。</li>
<li><a target="_blank" href="http://www.woodpecker.org.cn/diveintopython/regular_expressions/index.html">Dive into python中的正则式在线文档</a>，发布在<a target="_blank" href="http://www.woodpecker.org.cn"><span class="trans">啄木鸟</span></a>上。该文档深入浅出，以例子入手，适合当作自学教材。</li>
</ol>
<p>在《mastering regular expressions》一书中perl与php都拿出整整一章来讲解，唯独没有python的单独章节。好在既然已经知道了正则式的大概，剩下的只是查语法就是了，上面的第一个链接足矣。</p>
<p>另外再引用一下《mastering regular expressions》中的原话：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>I <b><i><span class="docEmphasis">thought</span> </i></b>I knew regular expressions until I read  <i><span class="docEmphasis">Mastering Regular Expressions. <b>Now</b></span></i><b> </b>I do.</p></blockquote>
<p>神往<strong>精通</strong>正则表达式的境界。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-regular-expression.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
