<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; rex</title>
	<atom:link href="http://iregex.org/blog/author/admin/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Sun, 27 Jun 2010 04:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>Python 中文正则笔记</title>
		<link>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html</link>
		<comments>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html#comments</comments>
		<pubDate>Sun, 27 Jun 2010 03:50:41 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[cjk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=129</guid>
		<description><![CDATA[总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个... ]]></description>
			<content:encoded><![CDATA[<p>总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。</p>
<p><span id="more-129"></span></p>
<p>从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一点经验</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>可以使用 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数查看字串的原始格式。这对于写正则表达式有所帮助。
            </li>
<li>Python 的 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span></span></code>模块有两个相似的函数：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span>.<span style="color: black;">match</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">search</span></span></code> 。两个函数的匹配过程完全一致，只是起点不同。<code class="codecolorer python default"><span class="python">match</span></code>只从字串的开始位置进行匹配，如果失败，它就此放弃；而<code class="codecolorer python default"><span class="python">search</span></code>则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解<code class="codecolorer python default"><span class="python">match</span></code>的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，<code class="codecolorer python default"><span class="python">search</span></code>通常是你需要的那个函数。</li>
<li>从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用<code class="codecolorer python default"><span class="python">findall<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>这个函数。例子见后面的代码。</li>
<li><code class="codecolorer python default"><span class="python">utf8</span></code>下，每个汉字占据3个字符位置，正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#91;</span>\x80-\xff<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#125;</span></span></code>，这个都知道了吧。</li>
<li><code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>下，汉字的格式如<code class="codecolorer python default"><span class="python">\uXXXX</span></code>，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。</li>
<li>两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span></span></code>，来自定义所需要匹配的文本。</li>
<li>匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的<code class="codecolorer python default"><span class="python">utf8</span></code>，此时你不用额外做什么；如果是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>，就需要在正则式之前加上<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;&quot;</span></span></code>格式。</li>
<li>可以这样定义<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>字符串：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">string</span>=u<span style="color: #483d8b;">&quot;我爱正则表达式&quot;</span></span></code>。如果字串不是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>的，可以使用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数转换之。如果你知道源字串的编码，可以使用<code class="codecolorer python default"><span class="python">newstr=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>oldstring, original_coding_name<span style="color: black;">&#41;</span></span></code>的方式转换，例如 linux 下常用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>, <span style="color: #483d8b;">&quot;utf8&quot;</span><span style="color: black;">&#41;</span></span></code>，windows 下或许会用<code class="codecolorer python default"><span class="python">cp936</span></code>吧，没测试。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例程序</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;py_utf8_unicode.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-06-27 09:11</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> findPart<span style="color: black;">&#40;</span>regex, text, name<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; res=<span style="color: #dc143c;">re</span>.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>regex, text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;There are %d %s parts:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>res<span style="color: black;">&#41;</span>, name<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>,r<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span><br />
<br />
<span style="color: #808080; font-style: italic;">#sample is utf8 by default.</span><br />
sample=<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'en: Regular expression is a powerful tool for manipulating text.<br />
zh: 正则表达式是一种很有用的处理文本的工具。<br />
jp: 正規表現は非常に役に立つツールテキストを操作することです。<br />
jp-char: あアいイうウえエおオ<br />
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.<br />
puc: 。？！、，；：“ ”‘ ’——……·－·《》〈〉！￥％＆＊＃<br />
'</span><span style="color: #483d8b;">''</span><br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw utf8 string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#find the non-ascii chars:</span><br />
findPart<span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]+&quot;</span>,sample,<span style="color: #483d8b;">&quot;non-ascii&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#convert the utf8 to unicode</span><br />
usample=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sample,<span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw unicode string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>usample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#get each language parts:</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode chinese&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>ac00-<span style="color: #000099; font-weight: bold;">\u</span>d7ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode korean&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese katakana&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese hiragana&quot;</span><span style="color: black;">&#41;</span> <br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3000-<span style="color: #000099; font-weight: bold;">\u</span>303f<span style="color: #000099; font-weight: bold;">\u</span>fb00-<span style="color: #000099; font-weight: bold;">\u</span>fffd]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode cjk Punctuation&quot;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">the raw utf8 string is:<br />
'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n'<br />
<br />
There are 14 non-ascii parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具。<br />
&nbsp; &nbsp; 正規表現は非常に役に立つツールテキストを操作することです。<br />
&nbsp; &nbsp; あアいイうウえエおオ<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
&nbsp; &nbsp; 。？！、，；：“<br />
&nbsp; &nbsp; ”‘<br />
&nbsp; &nbsp; ’——……·－·《》〈〉！￥％＆＊＃<br />
<br />
the raw unicode string is:<br />
u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n'<br />
<br />
There are 6 unicode chinese parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具<br />
&nbsp; &nbsp; 正規表現<br />
&nbsp; &nbsp; 非常<br />
&nbsp; &nbsp; 役<br />
&nbsp; &nbsp; 立<br />
&nbsp; &nbsp; 操作<br />
<br />
There are 8 unicode korean parts:<br />
<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
<br />
There are 6 unicode japanese katakana parts:<br />
<br />
&nbsp; &nbsp; ツールテキスト<br />
&nbsp; &nbsp; ア<br />
&nbsp; &nbsp; イ<br />
&nbsp; &nbsp; ウ<br />
&nbsp; &nbsp; エ<br />
&nbsp; &nbsp; オ<br />
<br />
There are 11 unicode japanese hiragana parts:<br />
<br />
&nbsp; &nbsp; は<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; つ<br />
&nbsp; &nbsp; を<br />
&nbsp; &nbsp; することです<br />
&nbsp; &nbsp; あ<br />
&nbsp; &nbsp; い<br />
&nbsp; &nbsp; う<br />
&nbsp; &nbsp; え<br />
&nbsp; &nbsp; お<br />
<br />
There are 5 unicode cjk Punctuation parts:<br />
<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。？！、，；：<br />
&nbsp; &nbsp; －<br />
&nbsp; &nbsp; 《》〈〉！￥％＆＊＃</div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>正则笔记</title>
		<link>http://iregex.org/blog/regex-note-20100621.html</link>
		<comments>http://iregex.org/blog/regex-note-20100621.html#comments</comments>
		<pubDate>Mon, 21 Jun 2010 15:04:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[pos]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=128</guid>
		<description><![CDATA[笔记三则，贴在这里。 首字母大小写无关模式 有一段时间，我在写正则表达式来匹配Drug关键字时，经常写出 /viagra&#124;cialis&#124;anti-ed/ 这样的表达式。为了让它更美观，我会给关键词排序；为... ]]></description>
			<content:encoded><![CDATA[<p>笔记三则，贴在这里。</p>
<p><span id="more-128"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">首字母大小写无关模式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>有一段时间，我在写正则表达式来匹配<code class="codecolorer text default"><span class="text">Drug</span></code>关键字时，经常写出 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">|</span>cialis<span style="color: #339933;">|</span>anti<span style="color: #339933;">-</span>ed<span style="color: #339933;">/</span></span></code> 这样的表达式。为了让它更美观，我会给关键词排序；为了提升速度，我会使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span>Vv<span style="color: #009900;">&#93;</span>iagra<span style="color: #339933;">/</span></span></code> 而非<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">/</span>i</span></code> ，只让必要的部分进行大小写通配模式。确切地说，我是需要对每个单词的首字母进行大小写无关的匹配。 </p>
<p>我写了这样的一个函数，专门用来批量转换。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#convert regex to sorted list, then provide both lower/upper case for the first letter of each word</span><br />
<span style="color: #666666; font-style: italic;">#luf means lower upper first</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> luf<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; split the regex with the delimiter |</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@arr</span><span style="color: #339933;">=</span><span style="color: #000066;">sort</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\|/</span><span style="color: #339933;">,</span><span style="color: #000066;">shift</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; provide both the upper and lower case for the &nbsp;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; first leffer of each word </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #0000ff;">\b</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>zA<span style="color: #339933;">-</span>Z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\l</span><span style="color: #0000ff;">$1</span><span style="color: #0000ff;">\u</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">/</span>g<span style="color: #339933;">;</span><span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; join the keyword to a regex again</span><br />
&nbsp; &nbsp; <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'|'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000066;">print</span> luf <span style="color: #ff0000;">&quot;sex pill|viagra|cialis|anti-ed&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; the output is:[aA]nti-[eE]d|[cC]ialis|[sS]ex [pP]ill|[vV]iagra</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">控制全局匹配下次开始的位置</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>记得jyf曾经问过我，如何控制匹配开始的位置。嗯，现在我可以回答这个问题了。Perl 提供了 pos 函数，可以在 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>g</span></code> 全局匹配中调整下次匹配开始的位置。举例如下：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>其输出结果是每两个字母，即<code class="codecolorer text default"><span class="text">ab, cd, ef</span></code></p>
<p>可以使用 pos($_)来重新定位下一次匹配开始的位置，如：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">pos</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">--;</span> &nbsp;<span style="color: #666666; font-style: italic;">#pos($_)++;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">pos($_)--: &nbsp;ab, bc, cd, de, ef, fg.<br />
pos($_)++: &nbsp;ab, de.</div></td></tr></tbody></table></div>
<p>可以阅读 Perl 文档中关于 <a href="http://perldoc.perl.org/functions/pos.html" title="我爱正则表达式" target="_blank">pos</a>的章节获取详细信息。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">散列与正则表达式替换</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
《effective-perl-2e》第三章有这样一个例子（见下面的代码），将特殊符号转义。</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">my %ent = { '&amp;' =&gt; 'amp', '&lt;' =&gt; 'lt', '&gt;' =&gt; 'gt' };<br />
$html =~ s/([&amp;&lt;&gt;])/&amp;$ent{$1};/g;</div></td></tr></tbody></table></div>
<p>这个例子非常非常巧妙。它灵活地运用了散列这种数据结构，将待替换的部分作为 key ，将与其对应的替换内容作为 value 。这样只要有匹配就会捕获，然后将捕获的部分作为 key ，反查到 value 并运用到替换中，体现了高级语言的效率。</p>
<p>不过，这样的 Perl 代码，能否移植到 Python 中呢？ Python 同样支持正则，支持散列（Python 中叫做 Dictionary），但是似乎不支持在替换过程中插入太多花哨的东西（替换行内变量内插）。</p>
<p>查阅 Python 的文档，（在 shell 下 执行 python ，然后 import re，然后 help(re)），：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">sub(pattern, repl, string, count=0)<br />
&nbsp; &nbsp; Return the string obtained by replacing the leftmost<br />
&nbsp; &nbsp; non-overlapping occurrences of the pattern in string by the<br />
&nbsp; &nbsp; replacement repl. &nbsp;repl can be either a string or a callable;<br />
&nbsp; &nbsp; if a string, backslash escapes in it are processed. &nbsp;If it is<br />
&nbsp; &nbsp; a callable, it's passed the match object and must return<br />
&nbsp; &nbsp; a replacement string to be used.</div></td></tr></tbody></table></div>
<p>原来 python 和 php 一样，是支持在替换的过程中使用 callable 回调函数的。该函数的默认参数是一个匹配对象变量。这样一来，问题就简单了：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">ent=<span style="color: black;">&#123;</span><span style="color: #483d8b;">'&lt;'</span>:<span style="color: #483d8b;">&quot;lt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&gt;'</span>:<span style="color: #483d8b;">&quot;gt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&amp;'</span>:<span style="color: #483d8b;">&quot;amp&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> rep<span style="color: black;">&#40;</span>mo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> ent<span style="color: black;">&#91;</span>mo.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><br />
<br />
html=<span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;([&amp;&lt;&gt;])&quot;</span>,rep, html<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>python 替换函数 callback 的关键点在于其参数是一个匹配对象变量。只要明白了这一点，查一下手册，看看该种对象都有哪些属性，一一拿来使用，就能写出灵活高效的 python 正则替换代码。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-note-20100621.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Superor老师的正则表达式视频教程</title>
		<link>http://iregex.org/blog/regex-tutorial-by-superor.html</link>
		<comments>http://iregex.org/blog/regex-tutorial-by-superor.html#comments</comments>
		<pubDate>Sun, 20 Jun 2010 01:35:06 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=127</guid>
		<description><![CDATA[浏览CU时发现Superor老师的《探索Perl的世界(更新到40集)-Perl 教学视频》（国人，中文），其中有5集是讲正则表达式的。观看之后觉得不错，贴在这里。 之前贴过余晟老师的正则表达式视频，由... ]]></description>
			<content:encoded><![CDATA[<p>浏览CU时发现Superor老师的《<a href="http://bbs.chinaunix.net/thread-1707137-1-2.html">探索Perl的世界(更新到40集)-Perl 教学视频</a>》（国人，中文），其中有5集是讲正则表达式的。观看之后觉得不错，贴在这里。<span id="more-127"></span></p>
<p>之前贴过余晟老师的正则表达式视频，由于各种不可抗力，所上传到各大空间的，也都渐渐不再能访问。我最早是从老友牛腩粉那里得到的，地址<a href="http://tieba.baidu.com/f?kz=464065073">在此</a>，可以留言，碰碰运气。</p>
<p>Superor老师的视频，其实不限于正则表达式，而是系统地讲解 Perl 的教程。我是断章取义，将正则表达式部分摘过来了。Superor老师也说过了，他在学习Perl时，准备了不少书，但是不是系统地看完，而是用到哪一部分，就细读这一部分的全部内容。学正则表达式也可如此。</p>
<p>视频是在线的，效果不错，虽然会插播广告。</p>
<p>第20集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3748" target="_blank">http://www.boobooke.com/v/bbk3748</a></p>
<p>第21集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3749" target="_blank">http://www.boobooke.com/v/bbk3749</a></p>
<p>第22集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3750" target="_blank">http://www.boobooke.com/v/bbk3750</a></p>
<p>第23集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3751" target="_blank">http://www.boobooke.com/v/bbk3751</a></p>
<p>第24集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3752" target="_blank">http://www.boobooke.com/v/bbk3752</a> </p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-tutorial-by-superor.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>小议“排除型匹配”</title>
		<link>http://iregex.org/blog/negate-match.html</link>
		<comments>http://iregex.org/blog/negate-match.html#comments</comments>
		<pubDate>Mon, 24 May 2010 08:46:29 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[问答]]></category>
		<category><![CDATA[exclude]]></category>
		<category><![CDATA[lookaround]]></category>
		<category><![CDATA[negate]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=122</guid>
		<description><![CDATA[网友cfc4n问及关于(?!)的正则表达式问题。回答之后，顺便总结了一下Perl语言中如何匹配“不出现”某元素，贴在这里。 问题 问题描述 有如下文本，如何使用正则式，将其中不含color选项的item... ]]></description>
			<content:encoded><![CDATA[<p>网友cfc4n问及关于(?!)的正则表达式问题。回答之后，顺便总结了一下Perl语言中如何匹配“不出现”某元素，贴在这里。<span id="more-122"></span></p>
<h2 style="background-color: rgb(153, 204, 0); border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); font-size: 21px; line-height: 35px; padding-top: 3px; text-indent: 6px;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">问题描述</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
    有如下文本，如何使用正则式，将其中<b>不含color选项的item</b>匹配出来？</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&lt;item&gt;<br />
&nbsp; &nbsp; color:red;<br />
&lt;/item&gt;<br />
&lt;item&gt;<br />
&nbsp; &nbsp; size:12;<br />
&nbsp; &nbsp; number:45;<br />
&nbsp; &nbsp; type:good;<br />
&lt;/item&gt;</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">典型的错误答案</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>新手容易提供这样的错误答案：<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.*?&lt;/</span>item<span style="color: #339933;">&gt;</span></span></code>。其出发点是正确的：只有当color不出现在目标字串时，该匹配才是所需要的。事实上，这样的正则表达式不能如君所愿，它匹配所有的<code class="codecolorer text default"><span class="text">&lt;item&gt;...&lt;/item&gt;</span></code>。这是为什么呢？</p>
</blockquote>
</blockquote>
<h2 style="background-color: rgb(153, 204, 0); border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); font-size: 21px; line-height: 35px; padding-top: 3px; text-indent: 6px;">Perl之排除型匹配</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">最简单的排除型匹配</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>匹配是<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">=~</span></span></code>, 不匹配当然是 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">!~</span></span></code> 了。写到这里想到，在正则式中，凡是由<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">=</span></span></code>组成的正则式符号，全可以使用<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">!</span></span></code>来替代，以表现相反的意思。例如<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?=</span><span style="color: #009900;">&#41;</span></span></code>与<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span><span style="color: #009900;">&#41;</span></span></code>，<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;=</span><span style="color: #009900;">&#41;</span></span></code>与<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!</span><span style="color: #009900;">&#41;</span></span></code>，<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">=~</span></span></code>与<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">!~</span></span></code>。</p>
<p>返回正题，看个例子。如果要检测某字串是否含有good，当然要用<code class="codecolorer perl default"><span class="perl"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$string</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/good/</span><span style="color: #009900;">&#41;</span></span></code>，如果<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">$string</span></span></code>里有good则条件为真，否则为假；</p>
<p>如果要检测某字串是否<b>不</b>含有good，可以用<code class="codecolorer perl default"><span class="perl"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$string</span> <span style="color: #339933;">!~</span> <span style="color: #009966; font-style: italic;">/good/</span><span style="color: #009900;">&#41;</span></span></code>，如果<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">$string</span></span></code>里没有good则条件为真，否则为假。</p>
<p>这种匹配测试，较适合于在大段的字串中搜索某个简单的模式，然后对于匹配的结果作出两种不同的判断，非此即彼。虽然迅速干练，但是对于复杂情况的判断，还是有些累赘。</p>
<p>对于文章开始提出的问题而言，当然可以这样解决：先搜索所有的 <code class="codecolorer text default"><span class="text">&lt;item&gt;...&lt;/item&gt;</span></code>，然后分别判断是否存在color项即可：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=</span><span style="color: #cc0000; font-style: italic;">&lt;&lt;END;<br />
&lt;item&gt;<br />
&nbsp; &nbsp; color:red;<br />
&lt;/item&gt;<br />
&lt;item&gt;<br />
&nbsp; &nbsp; size:12;<br />
&nbsp; &nbsp; number:45;<br />
&nbsp; &nbsp; type:good;<br />
&lt;/item&gt;<br />
END</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@result</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #339933;">!</span><span style="color: #009999;">&lt;item&gt;</span><span style="color: #339933;">.*?&lt;/</span>item<span style="color: #339933;">&gt;!</span>sg<span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$item</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$item</span> <span style="color: #339933;">!~</span> <span style="color: #009966; font-style: italic;">/color/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;$item&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果是:</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&lt;item&gt;<br />
&nbsp; &nbsp; size:12;<br />
&nbsp; &nbsp; number:45;<br />
&nbsp; &nbsp; type:good;<br />
&lt;/item&gt;</div></td></tr></tbody></table></div>
<p>虽然也不错，但是它总是“宁可错杀不可错放”地找完所有可能项，再一一重新进行排除。能否一开始就先界定，我们要找的是<strong>不含color的item</strong>呢？<span style="color:#ff008c">排除型匹配</span>正是为此而生。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">排除型匹配</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>不好意思，“排除型匹配”这个词是我生造的。其它的说法或许是“否定断言”，“否定环视”等等。后两者的命名，都是从匹配过程的角度出发；而此处命名，是从结果出发。具体说来，就是使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!...</span><span style="color: #009900;">&#41;</span></span></code>和<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!...</span><span style="color: #009900;">&#41;</span></span></code>作为辅助条件判断，来简化正则表达式，方便快捷地找到符合要求的匹配。</p>
<p>这两个东东的使用方法类似，都是指，当前位置<span style="color:#ff008c">不出现</span>某种模式。不同的是，<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!...</span><span style="color: #009900;">&#41;</span></span></code>是指当前位置的右边，而<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!</span><span style="color: #009900;">&#41;</span></span></code>自然就是指左边了。</p>
<p>这里隆重推出<a href="http://anrs.sacredfir.com/" target="_blank" title="我爱正则表达式">Anrs</a>同学翻译的教程: <a href="http://anrs.sacredfir.com/archives/295" target="_blank" title="我爱正则表达式">环视一</a>以及<a href="http://anrs.sacredfir.com/archives/338" target="_blank" title="我爱正则表达式">环视二</a>。仔细阅读这两文章，彻底明白环视这两个概念，将会提升您的正则表达式功力。后文将建立在您已经理解环视这个概念的基础上。</p>
<p>闲话一句。既然使用“左边”和“右边”既形象又好懂，为什么没见过“左瞻”，“右瞻”，“左向”，“右向”，反而全是些“前瞻后瞻”，“正向逆向”这样的不好理解的说法呢？<a href="https://twitter.com/kwl_01_skz/status/14069944812" target="_blank" title="我爱正则表达式">撕烤者</a>也同有此问。我的理解是，或许是为了照顾阿语等从右向左书写的用户的习惯吧。无论如何，将从 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">^</span></span></code>到 <code class="codecolorer perl default"><span class="perl">$</span></code>的方向称之为“向前”总不会错。</p>
<p>描述当前位置（左侧或右侧）的模式，从而辅助判断正则式是否匹配，是环视的作用。它只描述，不消耗字符；只辅助判断，从不单独出现。这与<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">^</span></span></code>和<code class="codecolorer perl default"><span class="perl">$</span></code>简直如出一辙。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一则例子</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>例子. </strong>现在有许多与fanfou.com类似的网址。如何写一条正则表达式，来匹配域名含fanfou，但是TLS不是.com的模式？</p>
<p><strong>答案：</strong><code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #0000ff;">\bfanfou</span>\<span style="color: #339933;">.</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>com<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4</span><span style="color: #009900;">&#125;</span><span style="color: #0000ff;">\b</span><span style="color: #339933;">/</span>i</span></code>。分析这条正则表达式：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>以<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">\b</span></span></code>开始，明确字符边界；</li>
<li>fanfou主域名不可少；</li>
<li><code class="codecolorer perl default"><span class="perl">\<span style="color: #339933;">.</span></span></code>匹配一个普通的点号；此处不要使用点号元字符；</li>
<li><code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>com<span style="color: #009900;">&#41;</span></span></code>表示此处（即从<code class="codecolorer text default"><span class="text">fanfou.</span></code>的右边）不得出现com三个连续字符；</li>
<li><code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4</span><span style="color: #009900;">&#125;</span></span></code>表示是2至4位的拉丁字母；因为域名的TLS最短是2位（如.au, .us），最长可为4位（如.info, .asia）；</li>
<li>右侧边界同样重要，否则我们之前的{2,4}就白费了；</li>
<li>使用i表示不分大小写；这是域名的特征之一。</li>
</ul>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">回到本题</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
        按照要求，一步步建立这条正则式。</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>该正则式匹配的是<code class="codecolorer text default"><span class="text">&lt;item&gt;...&lt;/item&gt;</span></code>结构。因此，正则式以<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span></span></code>开始。</li>
<li>在<code class="codecolorer text default"><span class="text">&lt;item&gt;</span></code>和<code class="codecolorer text default"><span class="text">&lt;/item&gt;</span></code>之间不得出现color，是这条正则式的难点。因为，<code class="codecolorer text default"><span class="text">color</span></code>可能位于这个结构之内的任意一点，因此要规定，此内任意一点都不得出现color一词。这样的点为：<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span></span></code>。这样的点重复1+次，正则式写为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span></span></code>。注意这里有个小陷阱：不要写为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.+</span></span></code>，否则它只描述了最左侧的一点不得出现color，其余部分则都无所谓。而写为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span></span></code>则保证每一点都不出现color。</li>
<li>正则式此时为<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+?&lt;/</span>item<span style="color: #339933;">&gt;</span></span></code>。为了节省资源，括号通常写成非捕获模式<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:...</span><span style="color: #009900;">&#41;</span></span></code>；为了保证点号匹配换行符，可以指定s模式或使用<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span>\<span style="color: #000066;">s</span><span style="color: #0000ff;">\S</span><span style="color: #009900;">&#93;</span></span></code>代替点号元字符。此处仍使用点号。正则式修改为<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+?&lt;/</span>item<span style="color: #339933;">&gt;</span></span></code>。</li>
</ul>
</blockquote>
</blockquote>
</blockquote>
<p>总体来说，环视相对于基本的元字符还是要抽象一些。不过一旦理解并掌握了它，就会发现它在精确匹配和替换时十分有用。上面的分析，希望有所帮助。如果您有类似的问题，欢迎提出。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/negate-match.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>打造自己的正则表达式助手程序</title>
		<link>http://iregex.org/blog/diy-regexbuddy.html</link>
		<comments>http://iregex.org/blog/diy-regexbuddy.html#comments</comments>
		<pubDate>Wed, 12 May 2010 05:32:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[cgi]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[regexbuddy]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=115</guid>
		<description><![CDATA[其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出... ]]></description>
			<content:encoded><![CDATA[<p>其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出雏形。现在将思路公布在这里，与各位交流一下。</p>
<p><span id="more-115"></span></p>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">缘由</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">为什么不用RegexBuddy了</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>它是收费软件，价格不算便宜。$39.95。Google一下，或有惊喜。</li>
<li>它只能用于Windows平台。虽然在ubuntu下，我会额外安装wine，仅仅是为了驱动RegexBuddy。</li>
<li>Mac下无法使用RegexBuddy。近来我开始使用Mac环境了，不想再为windows软件单独运行环境了。regexbuddy似乎要失之交臂了。搜索了一下，<a href="http://search.macupdate.com/search.php?keywords=regex&#038;os=mac" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，<a href="http://www.apple.com/search/?q=regex&#038;sec=downloads" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，找到的软件聊聊无几，性能也乏善可陈：大多仅支持JavaScript这样比较朴素的正则，缺乏多语言、多选项的支持。&#8211;RegexBuddy出色的表现，已经将我对正则辅助软件的期望值训练得极为挑剔，一般软件难以落入老夫的法眼了，呵呵。</li>
</ul>
<p>没有现成的解决方案，我就考虑，如何自己DIY一个了。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我理想中的正则辅助软件</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>像RegexBuddy一样，支持以下属性：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>支持多语言正则。至少要支持Perl, Python, PHP, JavaScript吧。.Net的用得不多（只在回答别人问题时用过，不算），可以无视；</li>
<li>支持匹配、替换、分割(split)；</li>
<li>支持生成代码片段；这一点很重要。我平常不会死背硬记一些电脑可以代劳的冬冬，除非经常用&#8211;经常用的，慢慢也就变成肌肉记忆了。</li>
</ol>
</blockquote>
<li>除此之外，它最好还能：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>兼容于各种常见平台。我指的是，Win/Lin/Mac。</li>
<li>对于语言的支持要原生。说实话，我怀疑RegexBuddy还在用Perl5.8风格的正则。5.10中的许多新奇好用的特性，还没有在RegexBuddy中得到支持。究其原因，RegexBuddy的作者大概是自行从头构建的Perl等正则引擎，在细节、版本上，与最新版有所差异。说到语言，想起余晟老师的一点意见，就是思考正则问题时，先不要考虑是什么语言、版本的正则，心中要有统一的语法。我同意余老师的观点，但是也觉得，在了解了貌似通用的正则语法基础之后，应该比较清晰地了解自己最常用的正则语言的语法细节，以及与其它语言的差异，以避免似是而非。跑题，打住。</li>
<li>开源，正版，免费。我们向其他人介绍正则，总得有一款可以拿得出手的工具吧？免费这条倒是不苛求，话说好软件还是应该有所回报的。</li>
</ol>
</blockquote>
</ol>
<p>问题是，这么好的软件，到那里去找呢？找不到的话，自己想从头实现，该如何动手呢？ </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我的思路历程</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>使用Objective-C来实现。不过，这想法没多久就像萝莉一样被推倒了。Obj-C固然是要学的，但我等不及了。RegexBuddy这类的软件我是天天都在用。这个目标似乎比上一条还要临渴掘井。为mac平台开发了，代码至少还要为win/lin单独编译吧？再者，如果用了Obj-C，正则引擎怎么办？从头实现？xiaofei说，要实现一个好用的正则引擎，要一个优秀的团队半年的时间。当然，Obj-C也可以调用现成的模块，这也引出了我现在的思路。</li>
<li>做成网页程序，前端接收用户输入，后端使用CGI调用服务器上的原生正则引擎（perl、python），匹配、替换后展现在前端。它最大的好处是，语言百分百原生，Native；只要网络在，打开浏览器就能用；即使没有网络，本机localhost也可用，而且更快。JavaScript/PHP就不必劳驾CGI了，原汤化原食就可以。</li>
</ul>
<p>                             话说我已经选择了第二套方案，于是就着手实现。
                        </p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">目前的进度</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>已经使用HTML+jQuery画出了简单的界面，实现了perl 5.10版的CGI程序，能够进行匹配、替换、分割（Split)。</li>
<li>未实现的功能：代码Snippets自动生成；其它语言版本的实现。</li>
<li>对于我自己来说，基本上已经可以使用了。我现在就正在 eat my own dog food，一边用它，一边完善它。不过要想发布出来供大家使用，还需要旷日持久的功能完善、界面美化。</li>
<li>截图见文章末尾。<br/>
</ul>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">Perl CGI 代码以及简要说明</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">代码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #000000; font-weight: bold;">use</span> CGI<span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> cl <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#0ff&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> h_color<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$color</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$counter</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">?</span> <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">&quot;#0ff&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&lt;span style='background-color:$color'&gt;&quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$a</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;&lt;/span&gt;&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">=</span>CGI<span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">die</span> <span style="color: #ff0000;">&quot;$!&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">header</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">-</span>type<span style="color: #339933;">=&gt;</span><span style="color: #ff0000;">&quot;text/html; charset=UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;regex&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;">#quit immediatly if no $regex input</span><br />
<span style="color: #000066;">die</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;text&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$mode</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;mode&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;space&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$action</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;action&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\s+//g</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;match&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'$text =~ s@$regex'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@&amp;h_color($&amp;)'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@eg'</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$mode</span><span style="color: #339933;">.</span><span style="color: #ff0000;">';'</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br /&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$text</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$replace</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\$</span>text =~ s:<span style="color: #000099; font-weight: bold;">\$</span>regex:$replace:g;&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #ff0000;">&quot;$code&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br/&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;&lt;pre&gt;$text&lt;/pre&gt;&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;split&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#@result=split(m@$regex@mode, $text);</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=split(m@$regex@'</span><span style="color: #339933;">.</span> <span style="color: #0000ff;">$mode</span> <span style="color: #339933;">.</span> <span style="color: #ff0000;">', $text);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=grep /\S/, @result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'my $count=@result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;font color=\&quot;#ff008c\&quot;&gt;$count&lt;/font&gt; record(s) returned:&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;ol&gt;&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;li&gt;&quot;.&amp;h_color($_).&quot;&lt;/li&gt;&quot; foreach (@result);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;/ol&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
<p>代码…还算简洁。主要就是接收并简单处理一下各个参数，然后按照不同的动作要求（match/replace/splie）进行相应的动态代码生成，然后eval执行结果，返回输出。在match/split中，还插入了代码高亮的小功能。基于perl代码的高效紧凑，实现起来倒也不至于冗长。感谢<a href="http://twitter.com/cnhacktnt">cnhacktnt</a>的协助。</p>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">截图</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match_cn.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/replace.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/split_cn.png" border="0" alt="Photobucket"></a></li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/diy-regexbuddy.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>PHP中的递归正则</title>
		<link>http://iregex.org/blog/recursive-regex-in-php.html</link>
		<comments>http://iregex.org/blog/recursive-regex-in-php.html#comments</comments>
		<pubDate>Mon, 10 May 2010 04:57:28 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[翻译]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[recursive]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=84</guid>
		<description><![CDATA[之前一篇文章翻译了Perl语言中的递归正则表达式. 其实不少语言中的正则都是支持递归的, 例如本文要介绍的PHP正则递归. 虽然, 工作中最常用的正则表达式都很&#8221;正则&#8221;, 只用最基本的语... ]]></description>
			<content:encoded><![CDATA[<p>之前一篇文章翻译了<a title="我爱正则表达式|Perl语言中的递归正则表达式" target="_blank" href="http://iregex.org/blog/recursive-regular-expressions.html">Perl语言中的递归正则表达式</a>. 其实不少语言中的正则都是支持递归的, 例如本文要介绍的PHP正则递归. 虽然, 工作中最常用的正则表达式都很&#8221;正则&#8221;, 只用最基本的语法就能解决85%以上的问题, 而且合理有效地使用普通正则来解决复杂问题也是一门技巧与学问; 但是高级一点的语法的确有它存的价值, 有时不用它还真办不了事儿; 况且学习正则的乐趣也在于<font color="#ff008c">尝试各种各样的可能性, 满足自己无穷无尽的好奇心</font>.</p>
<p><a href="http://iregex.org/blog/recursive-regex-in-php.html" title="我爱正则表达式 | PHP中的递归正则">本文</a>内容, 整理自网文<A HREF="http://www.skdevelopment.com/php-regular-expressions.php">Finer points of PHP regular expressions</A>. 其分析过程剥茧抽丝, 丝丝入扣, 值得一读. 该文系统地列出了PHP中正则表达式常见特性, 我只摘取其中递归部分翻译整理出来. </p>
<p></a><span id="more-84"></span></p>
<h2 style="background-color: rgb(153, 204, 0); line-height: 35px; border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); text-indent: 6px; font-size: 21px;">正文</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例子</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>什么时候会用到递归正则表达式呢? 当然是待匹配的字串中递归地出现某种模式时(貌似废话). 最经典的例子, 就是递归正则处理嵌套括号的问题了. 例子如下.  </p>
<p>假设你的文本中包含了正确配对的嵌套括号. 括号的深度可以是无限层. 你想捕获这样的括号组.</p>
<p>恕我剧透, 标准答案是这样的:</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span><br />
<span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;some text (a(b(c)d)e) more text&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">preg_match</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/\(([^()]+|(?R))*\)/&quot;</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;&lt;pre&gt;&quot;</span><span style="color: #339933;">;</span> <span style="color: #990000;">print_r</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;&lt;/pre&gt;&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>其输出结果是:</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #990000;">Array</span><br />
<span style="color: #009900;">&#40;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#40;</span>a<span style="color: #009900;">&#40;</span>b<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#41;</span>d<span style="color: #009900;">&#41;</span>e<span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=&gt;</span> e &nbsp; &nbsp;<br />
<span style="color: #009900;">&#41;</span></div></td></tr></tbody></table></div>
<p>可见, 我们所需要的文本, 已经捕获到<code class="codecolorer php default"><span class="php"><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span></span></code>中了.</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">原理</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>现在思考原理.</p>
<p>上面的正则表达式中的关键点是<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>?R<span style="color: #009900;">&#41;</span></span></code>. <code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>?R<span style="color: #009900;">&#41;</span></span></code>的作用就是递归地替换它所在的整条正则表达式. 在每次迭代时, PHP 语法分析器都会将<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>?R<span style="color: #009900;">&#41;</span></span></code>替换为&#8221;<code class="codecolorer php default"><span class="php">\<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>^<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+|</span><span style="color: #009900;">&#40;</span>?R<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">*</span>\<span style="color: #009900;">&#41;</span></span></code>&#8220;.</p>
<p>因此, 具体到上述的例子, 其正则表达式等价于:</p>
<p>            <code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;/\(([^()]+|\(([^()]+|\(([^()]+)*\))*\))*\)/&quot;</span></span></code></p>
<p>但是上面的代码只适合深度为3层的括号. 对于未知深度的括号嵌套, 就只好使用这种正则了:</p>
<p>            <code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;/\(([^()]+|(?R))*\)/&quot;</span></span></code> </p>
<p>它不但能够匹配无限深度, 还简化了正则表达式的语法. 功能强大, 语法简洁.</p>
<p>现在来细看一下<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;/\(([^()]+|(?R))*\)/&quot;</span></span></code>是怎样匹配<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;(a(b(c)d)e)&quot;</span></span></code>的:</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li><code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;(c)&quot;</span></span></code>这部分被正则式 <code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;\(([^()]+)*\)&quot;</span></span></code> 匹配. 请注意, <code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#41;</span></span></code> 其实就相当于整个递归的一个缩影, 麻雀虽小五脏俱全, 因此它用到了整个正则表达式.<br />换言之, 下一步中的<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#41;</span></span></code>, 可以使用<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>?R<span style="color: #009900;">&#41;</span></span></code> 来匹配.</li>
<li><code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>b<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#41;</span>d<span style="color: #009900;">&#41;</span></span></code>的匹配过程为:
<ol>
<li><code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;\(&quot;</span></span></code>匹配<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;(&quot;</span></span></code>;</li>
<li><code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;[^()]+&quot;</span></span></code>匹配<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;b&quot;</span></span></code>;</li>
<li><code class="codecolorer php default"><span class="php">&nbsp;<span style="color: #009900;">&#40;</span>?R<span style="color: #009900;">&#41;</span></span></code>匹配<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;(c)&quot;</span></span></code>;</li>
<li><code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;[^()]+&quot;</span></span></code>匹配<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;d&quot;</span></span></code>;</li>
<li><code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;\)&quot;</span></span></code>匹配<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">&quot;)&quot;</span></span></code>.</li>
</ol>
</li>
</ol>
<p>根据上面的匹配原理, 不难理解为什么数组的第2个元素<code class="codecolorer php default"><span class="php"><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span></span></code>与<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">'e'</span></span></code>等价. 子串<code class="codecolorer php default"><span class="php"><span style="color: #0000ff;">'e'</span></span></code>是在最后一次匹配迭代中被捕获. 匹配过程中, <font color="#ff008c">只有最后一次的捕获结果才会保存到数组中</font>.</p>
<blockquote  style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>rex注: 关于这个特性, 可以自行尝试一下, 看看使用正则式<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>z<span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">9</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span></span></code>来匹配字串<code class="codecolorer php default"><span class="php">abc123xyz890</span></code>, 其捕获结果<code class="codecolorer php default"><span class="php">$<span style="color: #cc66cc;">1</span></span></code>是什么. 注意, 其结果与 Left Longest 原理并不冲突.</p></blockquote>
</blockquote>
<p>         如果我们只需要捕获 <code class="codecolorer php default"><span class="php"><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span></span></code>, 可以这样做:</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;some text (a(b(c)d)e) more text&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">preg_match</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/\((?:[^()]+|(?R))*\)/&quot;</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;&lt;pre&gt;&quot;</span><span style="color: #339933;">;</span> <span style="color: #990000;">print_r</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;&lt;/pre&gt;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>产生的结果相同:</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #990000;">Array</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#40;</span><br />
&nbsp; &nbsp; &nbsp;<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#40;</span>a<span style="color: #009900;">&#40;</span>b<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#41;</span>d<span style="color: #009900;">&#41;</span>e<span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#41;</span></div></td></tr></tbody></table></div>
<p>所做的改动是捕获括号<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span></span></code>改为非捕获捕获括号<code class="codecolorer php default"><span class="php"><span style="color: #009900;">&#40;</span>?<span style="color: #339933;">:</span><span style="color: #009900;">&#41;</span></span></code>了.</p>
<p>还可以进一步完善为:</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;some text (a(b(c)d)e) more text&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">preg_match</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/\((?&gt;[^()]+|(?R))*\)/&quot;</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;&lt;pre&gt;&quot;</span><span style="color: #339933;">;</span> <span style="color: #990000;">print_r</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;&lt;/pre&gt;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>这里我们用到了所谓的一次性模式(rex注: 余晟先生译的《精通正则表达式v3.0》中, 谓之&#8221;固化分组&#8221;. 可参考该书.) PHP手册也推荐只要条件允许, 就尽可能使用这种模式, 以便提升正则表达式的速度.</p>
<p>一次性模式很简单, 这里不再详述. 如果感兴趣, 可以参考PHP 官方手册. 如果您想深入学习PERL兼容式正则表达式, 请参考文末链接.<br />
</blockquote>
</blockquote>
<h2  style="background-color: rgb(153, 204, 0); line-height: 35px; border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); text-indent: 6px; font-size: 21px;">提到的链接</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>原文: <a href="http://www.skdevelopment.com/php-regular-expressions.php">Finer points of PHP regular expressions</a></li>
<li>Perl兼容正则表达式 <a href="http://www.pcre.org">官网</a> <a href="http://pcre.org/pcre.txt">文档</a></li>
<li><a href="http://us3.php.net/manual/en/reference.pcre.pattern.syntax.php">PHP官网的PCRE正则文档</a></li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/recursive-regex-in-php.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>搬家完毕</title>
		<link>http://iregex.org/blog/moved.html</link>
		<comments>http://iregex.org/blog/moved.html#comments</comments>
		<pubDate>Tue, 04 May 2010 01:57:32 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[host]]></category>
		<category><![CDATA[regexbuddy]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=86</guid>
		<description><![CDATA[原空间(homezz.com)到期, 顺便换为法国alwaysdata.com的了. 搬家很顺利, 使用scp将数据从原空间隔空传物过去; 配置好域名dns, 数据文件, 新空间就激活了. 很顺利. 另, 新站的访问速度如何? 可以接受吗?... ]]></description>
			<content:encoded><![CDATA[<p>原空间(homezz.com)到期, 顺便换为法国alwaysdata.com的了. 搬家很顺利, 使用<a href="http://en.wikipedia.org/wiki/Secure_copy">scp</a>将数据从原空间隔空传物过去; 配置好域名dns, 数据文件, 新空间就激活了. 很顺利.</p>
<p>另, 新站的访问速度如何? 可以接受吗?</p>
<p>目前在蕴酿一篇关于&#8221;如何DIY一款像RegexBuddy那样的正则表达式工具&#8221;的博客, 敬请期待.</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/moved.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>五一搬家</title>
		<link>http://iregex.org/blog/moving.html</link>
		<comments>http://iregex.org/blog/moving.html#comments</comments>
		<pubDate>Thu, 29 Apr 2010 10:53:30 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=85</guid>
		<description><![CDATA[本站在五一期间将更换空间... ]]></description>
			<content:encoded><![CDATA[<p>本站在五一期间将更换空间。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/moving.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>使用正则表达式删除注释</title>
		<link>http://iregex.org/blog/uncomment-program-with-regex.html</link>
		<comments>http://iregex.org/blog/uncomment-program-with-regex.html#comments</comments>
		<pubDate>Sat, 03 Apr 2010 09:51:56 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[问答]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[negative lookaround]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=83</guid>
		<description><![CDATA[问题 以下摘自某网友来信: 难点 javascript不支持点号匹配换行符, 因此无法直接进行多行匹配; 处理前面没有http:的//, 当然要用否定前瞻( negative lookbehine)了:&#40;?&#60;!http:&#41;\/\/. 可惜javascript不支... ]]></description>
			<content:encoded><![CDATA[<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">以下摘自某网友来信: </h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
<a href="http://iregex.org/blog/uncomment-program-with-regex.html" target="_blank" title="javascript正则中的否定前瞻"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100402104810-1.png" border="0" alt="javascript正则中的否定前瞻"></a>
</p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">难点</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>javascript不支持点号匹配换行符, 因此<strong>无法直接</strong>进行多行匹配; </li>
<li>处理前面没有<code class="codecolorer text default"><span class="text">http:</span></code>的<code class="codecolorer text default"><span class="text">//</span></code>, 当然要用否定前瞻( negative lookbehine)了:<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!</span>http<span style="color: #339933;">:</span><span style="color: #009900;">&#41;</span>\<span style="color: #339933;">/</span>\<span style="color: #339933;">/</span></span></code>. 可惜javascript不支持.<br /><a href="http://iregex.org/blog/uncomment-program-with-regex.html" target="_blank" title="javascript正则中的否定前瞻"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100401091312.png" border="0" alt="javascript正则中的否定前瞻"></a></li>
</ol>
</blockquote>
<p><span id="more-83"></span></p>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">思路</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">关于多行匹配</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>这个问题, 之前我已经说过, 要点是使用<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\S</span>\<span style="color: #000066;">s</span><span style="color: #009900;">&#93;</span></span></code>来模拟匹配换行符的点号. 原文在这里:《<a href="http://iregex.org/blog/diy-match-all-mode-dot.html">DIY万能通配符</a>》.  可以以此写出这样的javascript代码来消除多行注释:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-style: italic;">//to uncomment C-style multiple line comment</span><br />
<span style="color: #003366; font-weight: bold;">function</span> uncomment_multi<span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp;<span style="color: #000066; font-weight: bold;">return</span> str.<span style="color: #660066;">replace</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\/\*[\S\s]*?\*\//g</span><span style="color: #339933;">,</span> <span style="color: #3366CC;">&quot;&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">单行注释之javascript实现(不完善)</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>单行注释并没有想像中的那样简单. 如果你认为只要 <code class="codecolorer javascript default"><span class="javascript">str.<span style="color: #660066;">replace</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;//.*$&quot;</span><span style="color: #009900;">&#41;</span></span></code>即可, 那么必须保证所要处理的文本都是最简单的, 如下:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #003366; font-weight: bold;">var</span> pig<span style="color: #339933;">=</span><span style="color: #3366CC;">&quot;ase&quot;</span><span style="color: #339933;">;</span> <span style="color: #006600; font-style: italic;">//this is a comment.</span></div></td></tr></tbody></table></div>
<p>事实上这是行不通的. 现实程序中下面的例子比比皆是:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #003366; font-weight: bold;">var</span> url<span style="color: #339933;">=</span><span style="color: #3366CC;">&quot;http://iregex.org&quot;</span><span style="color: #339933;">;</span> <span style="color: #006600; font-style: italic;">//this is my site.</span><br />
<span style="color: #003366; font-weight: bold;">var</span> url<span style="color: #339933;">=</span><span style="color: #3366CC;">&quot;//not real comment here http://iregex.org&quot;</span><span style="color: #339933;">;</span> <span style="color: #006600; font-style: italic;">//this is my site.</span></div></td></tr></tbody></table></div>
<p>我尝试使用javascript写了个模拟否定前瞻的函数, 可以处理<code class="codecolorer text default"><span class="text">http://</span></code>这种情况, 但是该函数看起来并不令人赏心悦目, 而且也不能处理引号中有双斜杠的情况. 我对javascript的正则式支持的特性之简陋实在很失望. 于是, 我求助于perl完成这一任务. 先看一下我写的javascript的删除单行注释的函数:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #003366; font-weight: bold;">function</span> uncomment_single<span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> result<span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> single<span style="color: #339933;">=</span><span style="color: #003366; font-weight: bold;">new</span> RegExp<span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;<span style="color: #000099; font-weight: bold;">\/</span><span style="color: #000099; font-weight: bold;">\/</span>.&quot;</span><span style="color: #339933;">,</span><span style="color: #3366CC;">&quot;ig&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> start<span style="color: #339933;">=</span><span style="color: #CC0000;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>result<span style="color: #339933;">=</span>single.<span style="color: #660066;">exec</span><span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">!=</span><span style="color: #003366; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> part<span style="color: #339933;">=</span>str.<span style="color: #660066;">slice</span><span style="color: #009900;">&#40;</span>start<span style="color: #339933;">,</span>result.<span style="color: #660066;">index</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> negLeft<span style="color: #339933;">=</span><span style="color: #003366; font-weight: bold;">new</span> RegExp<span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;http:$&quot;</span><span style="color: #339933;">,</span><span style="color: #3366CC;">&quot;i&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span> negLeft.<span style="color: #660066;">test</span><span style="color: #009900;">&#40;</span>part<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">return</span> str.<span style="color: #660066;">slice</span><span style="color: #009900;">&#40;</span><span style="color: #CC0000;">0</span><span style="color: #339933;">,</span>result.<span style="color: #660066;">index</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; start<span style="color: #339933;">=</span>result.<span style="color: #660066;">index</span><span style="color: #339933;">+</span>result<span style="color: #009900;">&#91;</span><span style="color: #CC0000;">0</span><span style="color: #009900;">&#93;</span>.<span style="color: #660066;">length</span><span style="color: #339933;">-</span><span style="color: #CC0000;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">return</span> str<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">perl版删除注释思路及源码(相对完善)</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">待测试文本</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
好吧, 既然祭出了强大的perl, 之前的小打小闹似的例子就一边去吧. 我将使用如下相对复杂的文本来验证我的程序:</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&lt;!DOCTYPE h/tml PUBLIC &quot;-//W3C//DTD XHTML\&quot; 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt; sdfasdf//real comment here//&quot;</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">认真分析单行注释的特点</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
正确地分析其特点, 是写出合理高效的程序的前提. 观察可知, 单行注释的特点如下:</p>
<ol>
<li>引号内(包括单引号和双引号)的双斜线不算注释.</li>
<li>引号是配对出现的, 两个引号之间的以反斜线转义掉的引号不算结束符. 例如<code class="codecolorer text default"><span class="text">&quot;hello \&quot; //world&quot;</span></code>, 这里的<code class="codecolorer text default"><span class="text">//world</span></code>部分不能算做注释.</li>
<li>由连续的非引号非斜线部分组成的字符串也不是注释. 特别指出, 单个斜线不能算做注释. 为什么前半部分不但要非引号而且要非斜线呢? 因为<code class="codecolorer text default"><span class="text">[^'&quot;]+</span></code>是有可能误匹配<code class="codecolorer text default"><span class="text">abcde//real comment &quot;quoted string in comment&quot;</span></code>这样的情况, 因此我们归纳出一个条件<code class="codecolorer text default"><span class="text">[^'&quot;/]+</span></code>; 又因为还要避免<code class="codecolorer text default"><span class="text">abcde/real comment &quot;quoted string in comment&quot;</span></code>这样的情况, 还需要特别补充规定单个的斜线不是注释. 正则式是<code class="codecolorer text default"><span class="text">[^'&quot;/]|(?&lt;!/)/(?!/)</span></code>.</li>
<li>除去上述内容以外, 以双斜线开始直至行尾的部分就是注释. 因为我们用到了<strong>行尾</strong>这个概念, 需要在正则式中特别指出是<code class="codecolorer text default"><span class="text">^$</span></code>匹配行首行尾的多行模式. 使用<code class="codecolorer text default"><span class="text">//m</span></code>来表示.</li>
</ol>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">正则实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<span style="color: #0000ff;">$str</span> <span style="color: #339933;">=</span> <span style="color: #cc0000; font-style: italic;">&lt;&lt;&quot;EOF&quot;;<br />
&lt;!DOCTYPE h/tml PUBLIC &quot;-//W3C//DTD XHTML\&quot; 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt; sdfasdf//real comment here//&quot; <br />
EOF</span><br />
<span style="color: #666666; font-style: italic;">#print $str;</span><br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$str</span><span style="color: #339933;">=~</span> <br />
&nbsp; &nbsp; m<span style="color: #339933;">%</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">'&quot;/]|<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?&lt;!/)/(?!/)|<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?&lt;quote&gt;['</span><span style="color: #ff0000;">&quot;])<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?:<span style="color: #000099; font-weight: bold;">\\</span> <span style="color: #000099; font-weight: bold;">\g</span>{quote}|<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?!<span style="color: #000099; font-weight: bold;">\g</span>{quote}).)*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000099; font-weight: bold;">\g</span>{quote}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?&lt;comment&gt;//.*)<br />
&nbsp; &nbsp; &nbsp; &nbsp; $<br />
&nbsp; &nbsp; %xm) <br />
&nbsp; &nbsp; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; print $+{comment}; <br />
}</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">几点补充</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>该程序在perl5.10版才能运行成功. 因为用到了命名捕获<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?</span><span style="color: #009999;">&lt;quote&gt;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'&quot;])</span></span></code>这样比较高阶的特性. 当然, 不使用5.10也并非没有办法, 我们大可以使用numbered capture, 只不过看起来更不直观罢了.</li>
<li>匹配结束后, 命名捕获都保存在hash表<code class="codecolorer text default"><span class="text">%+</span></code>中了. 使用<code class="codecolorer text default"><span class="text">print $+{comment}</span></code>这样的方式可以方便地调用.
<li>指定了x模式, 以便加入空白字符和换行, 让正则表达式看起来有层次感. 事实上, 对于复杂的正则表达式, 不使用x模式是极其不明智的做法.
<li>为了在字串中方便地表示单双引号, 使用了heredoc的方式. 个人觉得不如python的三重引号方便.
</ul>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">小结</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
从正则表达式的角度来说, javascript实在太弱. 当然, 也与本人的javascript功底较浅有关系. perl对于正则表达式的支持实在是强撼且不遗余力. 上面的实现, 应该可以涵盖绝大多数的注释情况了. 如果您测试出现bug, 或者遇到更BT的字串, 欢迎留言讨论.
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/uncomment-program-with-regex.html/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>[链接]使用正则表达式搜索文本文件</title>
		<link>http://iregex.org/blog/search-text-file-with-regex.html</link>
		<comments>http://iregex.org/blog/search-text-file-with-regex.html#comments</comments>
		<pubDate>Sat, 27 Mar 2010 09:15:46 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[url ibm newbie]]></category>

		<guid isPermaLink="false">http://iregex.org/blog/%e9%93%be%e6%8e%a5%e4%bd%bf%e7%94%a8%e6%ad%a3%e5%88%99%e8%a1%a8%e8%be%be%e5%bc%8f%e6%90%9c%e7%b4%a2%e6%96%87%e6%9c%ac%e6%96%87%e4%bb%b6.html</guid>
		<description><![CDATA[IBM文库发布一篇新文章： 学习 Linux，101: 使用正则表达式搜索文本文件-在大海里捞针 正则表达式新手和老手都可以一读。副标题“在大海里捞针”很有意思... ]]></description>
			<content:encoded><![CDATA[<p>IBM文库发布一篇新文章：</p>
<ul>
<li><a href="http://www.ibm.com/developerworks/cn/linux/l-lpic1-v3-103-7/?ca=drs-tp4608">学习 Linux，101: 使用正则表达式搜索文本文件-在大海里捞针</a></li>
</ul>
<p><a href="http://iregex.org" title="我爱正则表达式">正则表达式</a>新手和老手都可以一读。副标题“在大海里捞针”很有意思。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/search-text-file-with-regex.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
