[译]从文本中析取有效URL链接
2008.11.07 4:11 pm
原文作者是Jan Goyvaerts(Regex Guru),原页面链接是Detecting URLs in a Block of Text,
翻译者:rex,译者博客(http://iregex.org)。
rex注:URL是Uniform Resource Locator的缩写(wiki),中文叫作统一资源定位符(百科),解释如下:Internet上的每一个网页都具有一个唯一的名称标识,通常称之为URL地址,这种地址可以是本地磁盘,也可以是局域网上的某一台计算机,更多的是Internet上的站点。简单地说,URL就是Web地址,俗称“网址”。
In his blog post The Problem with URLs points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.
在他的博客文章URL难题一文中指出了使用正则表达式在大量文本中尝试检测URL所遇到的一些问题。
The short answer is that it can’t be done. Pretty much any character is valid in URLs. The very simplistic \bhttp://\S+ not only fails to differentiate between punctuation that’s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I’ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.
简言之,答案是做不到。在多数情况下,任何字符在URL中都是合法字符。这条过分简单化的表达式\bhttp://\S+之所以失败,不单因为它无法区分作为URL一部分的标点符号与引用URL的标点符号,还在于它对URL中包含空格的情况也是无能为力。是的,空格在URL中也是合法的,这几年我颇遇到一些网址中包含空格的情况。本条正则式还忽略了其它的网络协议,例如https。
In RegexBuddy’s library, you’ll find this regex if you look up “URL: Find in full text”:
在RegexBuddy’s library(RegexBuddy标准库)中,你如果你查找“URL: Find in full text”:
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|] (case insensitive大小写不敏感)
Like every other regex for extracting URLs, it’s not perfect. The key benefit of this regex is that it uses a separate character class for the last character in the URL, which allows less punctuation characters than the character class for the other characters in the URL. It excludes punctuation that is unlikely to occur at the end of the URL, and more likely to be punctuation that’s part of the sentence the URL is quoted in. It does not allow parentheses at all.
与其它试图析取URL的正则式一样,它并不完美。但是这条正则式的主要优势是,它在结尾处使用单独的文本类,这就限定了结尾处所允许出现的标点字符种类要少于URL的其它部分。它排除了在URL结尾处不太可能出现,而更像是URL所在文本结尾的标点。它根本不允许出现括号。
In EditPad Pro’s syntax coloring schemes, which are fully editable and entirely based on regular expressions, you’ll often find this regex:
EditPad Pro的语法色彩主题是可自定义的,完全基于正则表达式的。在其中,你会经常发现这条正则式:
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]
(case insensitive大小写不敏感)
The main difference with the previous regex is that this one matches URLs such as www.regex-guru.info without the http:// protocol. People often type URLs that way in their documents and messages, because most browsers accept them that way too.
本条正则式与上一条的主要区别是,它匹配www.regex-guru.info之类的URL,之前没有http://协议。人们经常在文件或消息中使用这种方式输入网址,同时大多数浏览器也接受这种方式。
EditPad’s built-in “clickable URLs” syntax highlighting uses this regex:
EditPad的内置“可点击的URL”语法高亮,是由这条正则式实现的:
\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]
| ((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,4})\b)
|”(?:(?:https?|ftp|file)://|www\.|ftp\.)[^"\r\n]+”?
|’(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+’? (free-spacing, case insensitive空格宽松模式,大小写不敏感)
This log regex adds three alternatives to the previous regex. It adds the ability to match email addresses, with or without mailto:, and it matches URLs between single or double quotes. When the URL is quoted, it allows all characters in the URL, except line breaks and the delimiting quote. This way, any URL with weird punctuation can be highlighted correctly by placing it between a pair of quote characters. Because this regex is used to highlight text as you type, the closing quotes are optional. The highlighting will run until the end of the line until you type the closing quote. Remove the question marks after the quote characters if you will use this regex to extract URLs.
这条正则式在前一条的基础上增加了三种备选匹配项。现在它可以匹配电邮地址,有无mailto:均可;匹配单引号或双引号之间的URL。当URL在引号内时,它允许出现除换行符或起界作用的引号之外的任意字符。使用这种方法,不论使用何种怪异标点引用的URL,都能将URL与引号分离开来,从而正确高亮显示。由于此正则式是在你输入的同时高亮文本,因此右侧的结尾引号不是必需的。高亮持续显示到本行结尾,直到你输入右侧的结尾引号。如果你要使用本正则式析取URL,请删除引号后面的引号。
So how about Jeff’s problem?
我们再来看一下Jeff的问题:
I couldn’t come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.
我实在想不出,怎样仅仅使用正则式,就能正确地区分使用以括号结尾的URL和使用括号括起来的URL。
That’s not too hard, if we add the restriction that we only allow unnested pairs of parentheses in URLs. Using the second regex in this article as the starting point, add an alternative for a pair of parentheses to both character classes in that regex:
如果限制在URL只使用不嵌套的括号的话,那么该问题不难解决。使用本文中第二条正则式作为开始,在正则式的两组字符类中都加上一对可选的括号:
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$]) (free-spacing, case insensitive空格宽松模式,大小写不敏感)
This regex allows the same set of characters in the middle of the URL, mixed with zero or more sequences of those characters between parentheses. It allows the URL to end with the same reduced set of characters, or a final run between parentheses. Because we require the opening parenthesis to be in the URL, we don’t have to do anything complicated to check if any closing parentheses we encounter are part of the URL or not.
这条正则式允许相同的字符集出现在URL中的括号内,组成0个或多个字符的序列。它允许URL以简化的字符集结尾,或者在结尾出现最后一组括号。由于我们要求在URL中出现开括号,所以就不必再做任何复杂工作来验证之后所遇到的闭括号是不是URL的一部分了。
It’s important that you observe that in order to allow any number of pairs of parentheses in the middle of the regex, I moved the star from the character class to the group it is now in. I did not add another star to the group. A double-star combination like (a|b*)* is a sure-fire recipe for catastrophic backtracking.
你或许注意到,为了允许正则式中出现任意多次的成对括号,我将字符类中的星号移到它所在组。我并没有再给此组加星号。这一点至关重要。像(a|b*)*这样的双星号组合,是灾难式回溯的保证。
All the regexes in this article will be included in RegexBuddy’s library with the next free minor update. Current version is 3.2.0.
本文提及的所有的正则式都将在RegexBuddy下次免费次要升级中添加到正则库库中。当前RegexBuddy的版本是3.2.0。
rex注:在RegexBuddy里,可以使用Alt+7唤出自带的正则库。输入URL回车,查到的结果如下:
译后:
我自己探索时,走得很兴奋;有人引路时,很走得很稳健。
抓饭最初就是使用正则表达式析取饭否消息的, 但是所有的文本都遵循良好的格式:XML。当然不可避免非法字符,这些在作为XML展示时会报错。不过,如果镶嵌在HTML中展示时安然无恙。
饭否、叽歪、做啥,都能完美地解析所输入的消息中的URL,并自做主张处理一下,变成自有格式,例如,饭否会把输入的http://regex.me 转换为http://fanfou.com/linkto/aHR0cDovL3JlZ2V4Lm1l,即使在http://regex.me之后加上一两个汉字,也不影响转换结果。不过twitter就没这么强大了。在twitter中输入“http://regex.me正则表达式交流论坛”,输入的的结果是http://tinyurl.com/5azr6y,点击展开,URL就成了http://www.regex.me正则表达式交流论坛/,很显然这是一个错误的URL。错误的原因就在于它没有正确地从文本解析URL。
撕烤者说过:“汉语的一个优势是,我们可以很轻易的把它们和代码区分开。”颇然其说。中英文的区别,不仅仅是肉眼上一目了然,即使使用程序来区分也是毫不费力。不同的编码方案有利于正确解析HTML代码,找到其中的URL结束点。可惜twitter这一点做得实在太滥。


