<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; perl</title>
	<atom:link href="http://iregex.org/blog/tag/perl/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Sun, 27 Jun 2010 04:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>正则笔记</title>
		<link>http://iregex.org/blog/regex-note-20100621.html</link>
		<comments>http://iregex.org/blog/regex-note-20100621.html#comments</comments>
		<pubDate>Mon, 21 Jun 2010 15:04:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[pos]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=128</guid>
		<description><![CDATA[笔记三则，贴在这里。 首字母大小写无关模式 有一段时间，我在写正则表达式来匹配Drug关键字时，经常写出 /viagra&#124;cialis&#124;anti-ed/ 这样的表达式。为了让它更美观，我会给关键词排序；为... ]]></description>
			<content:encoded><![CDATA[<p>笔记三则，贴在这里。</p>
<p><span id="more-128"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">首字母大小写无关模式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>有一段时间，我在写正则表达式来匹配<code class="codecolorer text default"><span class="text">Drug</span></code>关键字时，经常写出 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">|</span>cialis<span style="color: #339933;">|</span>anti<span style="color: #339933;">-</span>ed<span style="color: #339933;">/</span></span></code> 这样的表达式。为了让它更美观，我会给关键词排序；为了提升速度，我会使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span>Vv<span style="color: #009900;">&#93;</span>iagra<span style="color: #339933;">/</span></span></code> 而非<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">/</span>i</span></code> ，只让必要的部分进行大小写通配模式。确切地说，我是需要对每个单词的首字母进行大小写无关的匹配。 </p>
<p>我写了这样的一个函数，专门用来批量转换。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#convert regex to sorted list, then provide both lower/upper case for the first letter of each word</span><br />
<span style="color: #666666; font-style: italic;">#luf means lower upper first</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> luf<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; split the regex with the delimiter |</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@arr</span><span style="color: #339933;">=</span><span style="color: #000066;">sort</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\|/</span><span style="color: #339933;">,</span><span style="color: #000066;">shift</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; provide both the upper and lower case for the &nbsp;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; first leffer of each word </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #0000ff;">\b</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>zA<span style="color: #339933;">-</span>Z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\l</span><span style="color: #0000ff;">$1</span><span style="color: #0000ff;">\u</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">/</span>g<span style="color: #339933;">;</span><span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; join the keyword to a regex again</span><br />
&nbsp; &nbsp; <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'|'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000066;">print</span> luf <span style="color: #ff0000;">&quot;sex pill|viagra|cialis|anti-ed&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; the output is:[aA]nti-[eE]d|[cC]ialis|[sS]ex [pP]ill|[vV]iagra</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">控制全局匹配下次开始的位置</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>记得jyf曾经问过我，如何控制匹配开始的位置。嗯，现在我可以回答这个问题了。Perl 提供了 pos 函数，可以在 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>g</span></code> 全局匹配中调整下次匹配开始的位置。举例如下：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>其输出结果是每两个字母，即<code class="codecolorer text default"><span class="text">ab, cd, ef</span></code></p>
<p>可以使用 pos($_)来重新定位下一次匹配开始的位置，如：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">pos</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">--;</span> &nbsp;<span style="color: #666666; font-style: italic;">#pos($_)++;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">pos($_)--: &nbsp;ab, bc, cd, de, ef, fg.<br />
pos($_)++: &nbsp;ab, de.</div></td></tr></tbody></table></div>
<p>可以阅读 Perl 文档中关于 <a href="http://perldoc.perl.org/functions/pos.html" title="我爱正则表达式" target="_blank">pos</a>的章节获取详细信息。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">散列与正则表达式替换</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
《effective-perl-2e》第三章有这样一个例子（见下面的代码），将特殊符号转义。</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">my %ent = { '&amp;' =&gt; 'amp', '&lt;' =&gt; 'lt', '&gt;' =&gt; 'gt' };<br />
$html =~ s/([&amp;&lt;&gt;])/&amp;$ent{$1};/g;</div></td></tr></tbody></table></div>
<p>这个例子非常非常巧妙。它灵活地运用了散列这种数据结构，将待替换的部分作为 key ，将与其对应的替换内容作为 value 。这样只要有匹配就会捕获，然后将捕获的部分作为 key ，反查到 value 并运用到替换中，体现了高级语言的效率。</p>
<p>不过，这样的 Perl 代码，能否移植到 Python 中呢？ Python 同样支持正则，支持散列（Python 中叫做 Dictionary），但是似乎不支持在替换过程中插入太多花哨的东西（替换行内变量内插）。</p>
<p>查阅 Python 的文档，（在 shell 下 执行 python ，然后 import re，然后 help(re)），：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">sub(pattern, repl, string, count=0)<br />
&nbsp; &nbsp; Return the string obtained by replacing the leftmost<br />
&nbsp; &nbsp; non-overlapping occurrences of the pattern in string by the<br />
&nbsp; &nbsp; replacement repl. &nbsp;repl can be either a string or a callable;<br />
&nbsp; &nbsp; if a string, backslash escapes in it are processed. &nbsp;If it is<br />
&nbsp; &nbsp; a callable, it's passed the match object and must return<br />
&nbsp; &nbsp; a replacement string to be used.</div></td></tr></tbody></table></div>
<p>原来 python 和 php 一样，是支持在替换的过程中使用 callable 回调函数的。该函数的默认参数是一个匹配对象变量。这样一来，问题就简单了：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">ent=<span style="color: black;">&#123;</span><span style="color: #483d8b;">'&lt;'</span>:<span style="color: #483d8b;">&quot;lt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&gt;'</span>:<span style="color: #483d8b;">&quot;gt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&amp;'</span>:<span style="color: #483d8b;">&quot;amp&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> rep<span style="color: black;">&#40;</span>mo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> ent<span style="color: black;">&#91;</span>mo.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><br />
<br />
html=<span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;([&amp;&lt;&gt;])&quot;</span>,rep, html<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>python 替换函数 callback 的关键点在于其参数是一个匹配对象变量。只要明白了这一点，查一下手册，看看该种对象都有哪些属性，一一拿来使用，就能写出灵活高效的 python 正则替换代码。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-note-20100621.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>小议“排除型匹配”</title>
		<link>http://iregex.org/blog/negate-match.html</link>
		<comments>http://iregex.org/blog/negate-match.html#comments</comments>
		<pubDate>Mon, 24 May 2010 08:46:29 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[问答]]></category>
		<category><![CDATA[exclude]]></category>
		<category><![CDATA[lookaround]]></category>
		<category><![CDATA[negate]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=122</guid>
		<description><![CDATA[网友cfc4n问及关于(?!)的正则表达式问题。回答之后，顺便总结了一下Perl语言中如何匹配“不出现”某元素，贴在这里。 问题 问题描述 有如下文本，如何使用正则式，将其中不含color选项的item... ]]></description>
			<content:encoded><![CDATA[<p>网友cfc4n问及关于(?!)的正则表达式问题。回答之后，顺便总结了一下Perl语言中如何匹配“不出现”某元素，贴在这里。<span id="more-122"></span></p>
<h2 style="background-color: rgb(153, 204, 0); border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); font-size: 21px; line-height: 35px; padding-top: 3px; text-indent: 6px;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">问题描述</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
    有如下文本，如何使用正则式，将其中<b>不含color选项的item</b>匹配出来？</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&lt;item&gt;<br />
&nbsp; &nbsp; color:red;<br />
&lt;/item&gt;<br />
&lt;item&gt;<br />
&nbsp; &nbsp; size:12;<br />
&nbsp; &nbsp; number:45;<br />
&nbsp; &nbsp; type:good;<br />
&lt;/item&gt;</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">典型的错误答案</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>新手容易提供这样的错误答案：<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.*?&lt;/</span>item<span style="color: #339933;">&gt;</span></span></code>。其出发点是正确的：只有当color不出现在目标字串时，该匹配才是所需要的。事实上，这样的正则表达式不能如君所愿，它匹配所有的<code class="codecolorer text default"><span class="text">&lt;item&gt;...&lt;/item&gt;</span></code>。这是为什么呢？</p>
</blockquote>
</blockquote>
<h2 style="background-color: rgb(153, 204, 0); border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); font-size: 21px; line-height: 35px; padding-top: 3px; text-indent: 6px;">Perl之排除型匹配</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">最简单的排除型匹配</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>匹配是<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">=~</span></span></code>, 不匹配当然是 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">!~</span></span></code> 了。写到这里想到，在正则式中，凡是由<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">=</span></span></code>组成的正则式符号，全可以使用<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">!</span></span></code>来替代，以表现相反的意思。例如<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?=</span><span style="color: #009900;">&#41;</span></span></code>与<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span><span style="color: #009900;">&#41;</span></span></code>，<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;=</span><span style="color: #009900;">&#41;</span></span></code>与<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!</span><span style="color: #009900;">&#41;</span></span></code>，<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">=~</span></span></code>与<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">!~</span></span></code>。</p>
<p>返回正题，看个例子。如果要检测某字串是否含有good，当然要用<code class="codecolorer perl default"><span class="perl"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$string</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/good/</span><span style="color: #009900;">&#41;</span></span></code>，如果<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">$string</span></span></code>里有good则条件为真，否则为假；</p>
<p>如果要检测某字串是否<b>不</b>含有good，可以用<code class="codecolorer perl default"><span class="perl"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$string</span> <span style="color: #339933;">!~</span> <span style="color: #009966; font-style: italic;">/good/</span><span style="color: #009900;">&#41;</span></span></code>，如果<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">$string</span></span></code>里没有good则条件为真，否则为假。</p>
<p>这种匹配测试，较适合于在大段的字串中搜索某个简单的模式，然后对于匹配的结果作出两种不同的判断，非此即彼。虽然迅速干练，但是对于复杂情况的判断，还是有些累赘。</p>
<p>对于文章开始提出的问题而言，当然可以这样解决：先搜索所有的 <code class="codecolorer text default"><span class="text">&lt;item&gt;...&lt;/item&gt;</span></code>，然后分别判断是否存在color项即可：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=</span><span style="color: #cc0000; font-style: italic;">&lt;&lt;END;<br />
&lt;item&gt;<br />
&nbsp; &nbsp; color:red;<br />
&lt;/item&gt;<br />
&lt;item&gt;<br />
&nbsp; &nbsp; size:12;<br />
&nbsp; &nbsp; number:45;<br />
&nbsp; &nbsp; type:good;<br />
&lt;/item&gt;<br />
END</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@result</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #339933;">!</span><span style="color: #009999;">&lt;item&gt;</span><span style="color: #339933;">.*?&lt;/</span>item<span style="color: #339933;">&gt;!</span>sg<span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$item</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$item</span> <span style="color: #339933;">!~</span> <span style="color: #009966; font-style: italic;">/color/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;$item&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果是:</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&lt;item&gt;<br />
&nbsp; &nbsp; size:12;<br />
&nbsp; &nbsp; number:45;<br />
&nbsp; &nbsp; type:good;<br />
&lt;/item&gt;</div></td></tr></tbody></table></div>
<p>虽然也不错，但是它总是“宁可错杀不可错放”地找完所有可能项，再一一重新进行排除。能否一开始就先界定，我们要找的是<strong>不含color的item</strong>呢？<span style="color:#ff008c">排除型匹配</span>正是为此而生。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">排除型匹配</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>不好意思，“排除型匹配”这个词是我生造的。其它的说法或许是“否定断言”，“否定环视”等等。后两者的命名，都是从匹配过程的角度出发；而此处命名，是从结果出发。具体说来，就是使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!...</span><span style="color: #009900;">&#41;</span></span></code>和<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!...</span><span style="color: #009900;">&#41;</span></span></code>作为辅助条件判断，来简化正则表达式，方便快捷地找到符合要求的匹配。</p>
<p>这两个东东的使用方法类似，都是指，当前位置<span style="color:#ff008c">不出现</span>某种模式。不同的是，<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!...</span><span style="color: #009900;">&#41;</span></span></code>是指当前位置的右边，而<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!</span><span style="color: #009900;">&#41;</span></span></code>自然就是指左边了。</p>
<p>这里隆重推出<a href="http://anrs.sacredfir.com/" target="_blank" title="我爱正则表达式">Anrs</a>同学翻译的教程: <a href="http://anrs.sacredfir.com/archives/295" target="_blank" title="我爱正则表达式">环视一</a>以及<a href="http://anrs.sacredfir.com/archives/338" target="_blank" title="我爱正则表达式">环视二</a>。仔细阅读这两文章，彻底明白环视这两个概念，将会提升您的正则表达式功力。后文将建立在您已经理解环视这个概念的基础上。</p>
<p>闲话一句。既然使用“左边”和“右边”既形象又好懂，为什么没见过“左瞻”，“右瞻”，“左向”，“右向”，反而全是些“前瞻后瞻”，“正向逆向”这样的不好理解的说法呢？<a href="https://twitter.com/kwl_01_skz/status/14069944812" target="_blank" title="我爱正则表达式">撕烤者</a>也同有此问。我的理解是，或许是为了照顾阿语等从右向左书写的用户的习惯吧。无论如何，将从 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">^</span></span></code>到 <code class="codecolorer perl default"><span class="perl">$</span></code>的方向称之为“向前”总不会错。</p>
<p>描述当前位置（左侧或右侧）的模式，从而辅助判断正则式是否匹配，是环视的作用。它只描述，不消耗字符；只辅助判断，从不单独出现。这与<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">^</span></span></code>和<code class="codecolorer perl default"><span class="perl">$</span></code>简直如出一辙。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一则例子</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>例子. </strong>现在有许多与fanfou.com类似的网址。如何写一条正则表达式，来匹配域名含fanfou，但是TLS不是.com的模式？</p>
<p><strong>答案：</strong><code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #0000ff;">\bfanfou</span>\<span style="color: #339933;">.</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>com<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4</span><span style="color: #009900;">&#125;</span><span style="color: #0000ff;">\b</span><span style="color: #339933;">/</span>i</span></code>。分析这条正则表达式：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>以<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">\b</span></span></code>开始，明确字符边界；</li>
<li>fanfou主域名不可少；</li>
<li><code class="codecolorer perl default"><span class="perl">\<span style="color: #339933;">.</span></span></code>匹配一个普通的点号；此处不要使用点号元字符；</li>
<li><code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>com<span style="color: #009900;">&#41;</span></span></code>表示此处（即从<code class="codecolorer text default"><span class="text">fanfou.</span></code>的右边）不得出现com三个连续字符；</li>
<li><code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4</span><span style="color: #009900;">&#125;</span></span></code>表示是2至4位的拉丁字母；因为域名的TLS最短是2位（如.au, .us），最长可为4位（如.info, .asia）；</li>
<li>右侧边界同样重要，否则我们之前的{2,4}就白费了；</li>
<li>使用i表示不分大小写；这是域名的特征之一。</li>
</ul>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">回到本题</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
        按照要求，一步步建立这条正则式。</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>该正则式匹配的是<code class="codecolorer text default"><span class="text">&lt;item&gt;...&lt;/item&gt;</span></code>结构。因此，正则式以<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span></span></code>开始。</li>
<li>在<code class="codecolorer text default"><span class="text">&lt;item&gt;</span></code>和<code class="codecolorer text default"><span class="text">&lt;/item&gt;</span></code>之间不得出现color，是这条正则式的难点。因为，<code class="codecolorer text default"><span class="text">color</span></code>可能位于这个结构之内的任意一点，因此要规定，此内任意一点都不得出现color一词。这样的点为：<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span></span></code>。这样的点重复1+次，正则式写为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span></span></code>。注意这里有个小陷阱：不要写为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.+</span></span></code>，否则它只描述了最左侧的一点不得出现color，其余部分则都无所谓。而写为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span></span></code>则保证每一点都不出现color。</li>
<li>正则式此时为<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+?&lt;/</span>item<span style="color: #339933;">&gt;</span></span></code>。为了节省资源，括号通常写成非捕获模式<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:...</span><span style="color: #009900;">&#41;</span></span></code>；为了保证点号匹配换行符，可以指定s模式或使用<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span>\<span style="color: #000066;">s</span><span style="color: #0000ff;">\S</span><span style="color: #009900;">&#93;</span></span></code>代替点号元字符。此处仍使用点号。正则式修改为<code class="codecolorer perl default"><span class="perl"><span style="color: #009999;">&lt;item&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?!</span>color<span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+?&lt;/</span>item<span style="color: #339933;">&gt;</span></span></code>。</li>
</ul>
</blockquote>
</blockquote>
</blockquote>
<p>总体来说，环视相对于基本的元字符还是要抽象一些。不过一旦理解并掌握了它，就会发现它在精确匹配和替换时十分有用。上面的分析，希望有所帮助。如果您有类似的问题，欢迎提出。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/negate-match.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>打造自己的正则表达式助手程序</title>
		<link>http://iregex.org/blog/diy-regexbuddy.html</link>
		<comments>http://iregex.org/blog/diy-regexbuddy.html#comments</comments>
		<pubDate>Wed, 12 May 2010 05:32:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[cgi]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[regexbuddy]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=115</guid>
		<description><![CDATA[其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出... ]]></description>
			<content:encoded><![CDATA[<p>其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出雏形。现在将思路公布在这里，与各位交流一下。</p>
<p><span id="more-115"></span></p>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">缘由</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">为什么不用RegexBuddy了</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>它是收费软件，价格不算便宜。$39.95。Google一下，或有惊喜。</li>
<li>它只能用于Windows平台。虽然在ubuntu下，我会额外安装wine，仅仅是为了驱动RegexBuddy。</li>
<li>Mac下无法使用RegexBuddy。近来我开始使用Mac环境了，不想再为windows软件单独运行环境了。regexbuddy似乎要失之交臂了。搜索了一下，<a href="http://search.macupdate.com/search.php?keywords=regex&#038;os=mac" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，<a href="http://www.apple.com/search/?q=regex&#038;sec=downloads" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，找到的软件聊聊无几，性能也乏善可陈：大多仅支持JavaScript这样比较朴素的正则，缺乏多语言、多选项的支持。&#8211;RegexBuddy出色的表现，已经将我对正则辅助软件的期望值训练得极为挑剔，一般软件难以落入老夫的法眼了，呵呵。</li>
</ul>
<p>没有现成的解决方案，我就考虑，如何自己DIY一个了。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我理想中的正则辅助软件</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>像RegexBuddy一样，支持以下属性：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>支持多语言正则。至少要支持Perl, Python, PHP, JavaScript吧。.Net的用得不多（只在回答别人问题时用过，不算），可以无视；</li>
<li>支持匹配、替换、分割(split)；</li>
<li>支持生成代码片段；这一点很重要。我平常不会死背硬记一些电脑可以代劳的冬冬，除非经常用&#8211;经常用的，慢慢也就变成肌肉记忆了。</li>
</ol>
</blockquote>
<li>除此之外，它最好还能：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>兼容于各种常见平台。我指的是，Win/Lin/Mac。</li>
<li>对于语言的支持要原生。说实话，我怀疑RegexBuddy还在用Perl5.8风格的正则。5.10中的许多新奇好用的特性，还没有在RegexBuddy中得到支持。究其原因，RegexBuddy的作者大概是自行从头构建的Perl等正则引擎，在细节、版本上，与最新版有所差异。说到语言，想起余晟老师的一点意见，就是思考正则问题时，先不要考虑是什么语言、版本的正则，心中要有统一的语法。我同意余老师的观点，但是也觉得，在了解了貌似通用的正则语法基础之后，应该比较清晰地了解自己最常用的正则语言的语法细节，以及与其它语言的差异，以避免似是而非。跑题，打住。</li>
<li>开源，正版，免费。我们向其他人介绍正则，总得有一款可以拿得出手的工具吧？免费这条倒是不苛求，话说好软件还是应该有所回报的。</li>
</ol>
</blockquote>
</ol>
<p>问题是，这么好的软件，到那里去找呢？找不到的话，自己想从头实现，该如何动手呢？ </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我的思路历程</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>使用Objective-C来实现。不过，这想法没多久就像萝莉一样被推倒了。Obj-C固然是要学的，但我等不及了。RegexBuddy这类的软件我是天天都在用。这个目标似乎比上一条还要临渴掘井。为mac平台开发了，代码至少还要为win/lin单独编译吧？再者，如果用了Obj-C，正则引擎怎么办？从头实现？xiaofei说，要实现一个好用的正则引擎，要一个优秀的团队半年的时间。当然，Obj-C也可以调用现成的模块，这也引出了我现在的思路。</li>
<li>做成网页程序，前端接收用户输入，后端使用CGI调用服务器上的原生正则引擎（perl、python），匹配、替换后展现在前端。它最大的好处是，语言百分百原生，Native；只要网络在，打开浏览器就能用；即使没有网络，本机localhost也可用，而且更快。JavaScript/PHP就不必劳驾CGI了，原汤化原食就可以。</li>
</ul>
<p>                             话说我已经选择了第二套方案，于是就着手实现。
                        </p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">目前的进度</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>已经使用HTML+jQuery画出了简单的界面，实现了perl 5.10版的CGI程序，能够进行匹配、替换、分割（Split)。</li>
<li>未实现的功能：代码Snippets自动生成；其它语言版本的实现。</li>
<li>对于我自己来说，基本上已经可以使用了。我现在就正在 eat my own dog food，一边用它，一边完善它。不过要想发布出来供大家使用，还需要旷日持久的功能完善、界面美化。</li>
<li>截图见文章末尾。<br/>
</ul>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">Perl CGI 代码以及简要说明</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">代码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #000000; font-weight: bold;">use</span> CGI<span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> cl <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#0ff&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> h_color<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$color</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$counter</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">?</span> <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">&quot;#0ff&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&lt;span style='background-color:$color'&gt;&quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$a</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;&lt;/span&gt;&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">=</span>CGI<span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">die</span> <span style="color: #ff0000;">&quot;$!&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">header</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">-</span>type<span style="color: #339933;">=&gt;</span><span style="color: #ff0000;">&quot;text/html; charset=UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;regex&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;">#quit immediatly if no $regex input</span><br />
<span style="color: #000066;">die</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;text&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$mode</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;mode&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;space&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$action</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;action&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\s+//g</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;match&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'$text =~ s@$regex'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@&amp;h_color($&amp;)'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@eg'</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$mode</span><span style="color: #339933;">.</span><span style="color: #ff0000;">';'</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br /&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$text</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$replace</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\$</span>text =~ s:<span style="color: #000099; font-weight: bold;">\$</span>regex:$replace:g;&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #ff0000;">&quot;$code&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br/&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;&lt;pre&gt;$text&lt;/pre&gt;&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;split&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#@result=split(m@$regex@mode, $text);</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=split(m@$regex@'</span><span style="color: #339933;">.</span> <span style="color: #0000ff;">$mode</span> <span style="color: #339933;">.</span> <span style="color: #ff0000;">', $text);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=grep /\S/, @result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'my $count=@result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;font color=\&quot;#ff008c\&quot;&gt;$count&lt;/font&gt; record(s) returned:&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;ol&gt;&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;li&gt;&quot;.&amp;h_color($_).&quot;&lt;/li&gt;&quot; foreach (@result);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;/ol&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
<p>代码…还算简洁。主要就是接收并简单处理一下各个参数，然后按照不同的动作要求（match/replace/splie）进行相应的动态代码生成，然后eval执行结果，返回输出。在match/split中，还插入了代码高亮的小功能。基于perl代码的高效紧凑，实现起来倒也不至于冗长。感谢<a href="http://twitter.com/cnhacktnt">cnhacktnt</a>的协助。</p>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">截图</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match_cn.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/replace.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/split_cn.png" border="0" alt="Photobucket"></a></li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/diy-regexbuddy.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>使用正则表达式删除注释</title>
		<link>http://iregex.org/blog/uncomment-program-with-regex.html</link>
		<comments>http://iregex.org/blog/uncomment-program-with-regex.html#comments</comments>
		<pubDate>Sat, 03 Apr 2010 09:51:56 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[问答]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[negative lookaround]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=83</guid>
		<description><![CDATA[问题 以下摘自某网友来信: 难点 javascript不支持点号匹配换行符, 因此无法直接进行多行匹配; 处理前面没有http:的//, 当然要用否定前瞻( negative lookbehine)了:&#40;?&#60;!http:&#41;\/\/. 可惜javascript不支... ]]></description>
			<content:encoded><![CDATA[<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">以下摘自某网友来信: </h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
<a href="http://iregex.org/blog/uncomment-program-with-regex.html" target="_blank" title="javascript正则中的否定前瞻"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100402104810-1.png" border="0" alt="javascript正则中的否定前瞻"></a>
</p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">难点</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>javascript不支持点号匹配换行符, 因此<strong>无法直接</strong>进行多行匹配; </li>
<li>处理前面没有<code class="codecolorer text default"><span class="text">http:</span></code>的<code class="codecolorer text default"><span class="text">//</span></code>, 当然要用否定前瞻( negative lookbehine)了:<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!</span>http<span style="color: #339933;">:</span><span style="color: #009900;">&#41;</span>\<span style="color: #339933;">/</span>\<span style="color: #339933;">/</span></span></code>. 可惜javascript不支持.<br /><a href="http://iregex.org/blog/uncomment-program-with-regex.html" target="_blank" title="javascript正则中的否定前瞻"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100401091312.png" border="0" alt="javascript正则中的否定前瞻"></a></li>
</ol>
</blockquote>
<p><span id="more-83"></span></p>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">思路</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">关于多行匹配</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>这个问题, 之前我已经说过, 要点是使用<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\S</span>\<span style="color: #000066;">s</span><span style="color: #009900;">&#93;</span></span></code>来模拟匹配换行符的点号. 原文在这里:《<a href="http://iregex.org/blog/diy-match-all-mode-dot.html">DIY万能通配符</a>》.  可以以此写出这样的javascript代码来消除多行注释:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-style: italic;">//to uncomment C-style multiple line comment</span><br />
<span style="color: #003366; font-weight: bold;">function</span> uncomment_multi<span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp;<span style="color: #000066; font-weight: bold;">return</span> str.<span style="color: #660066;">replace</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\/\*[\S\s]*?\*\//g</span><span style="color: #339933;">,</span> <span style="color: #3366CC;">&quot;&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">单行注释之javascript实现(不完善)</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>单行注释并没有想像中的那样简单. 如果你认为只要 <code class="codecolorer javascript default"><span class="javascript">str.<span style="color: #660066;">replace</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;//.*$&quot;</span><span style="color: #009900;">&#41;</span></span></code>即可, 那么必须保证所要处理的文本都是最简单的, 如下:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #003366; font-weight: bold;">var</span> pig<span style="color: #339933;">=</span><span style="color: #3366CC;">&quot;ase&quot;</span><span style="color: #339933;">;</span> <span style="color: #006600; font-style: italic;">//this is a comment.</span></div></td></tr></tbody></table></div>
<p>事实上这是行不通的. 现实程序中下面的例子比比皆是:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #003366; font-weight: bold;">var</span> url<span style="color: #339933;">=</span><span style="color: #3366CC;">&quot;http://iregex.org&quot;</span><span style="color: #339933;">;</span> <span style="color: #006600; font-style: italic;">//this is my site.</span><br />
<span style="color: #003366; font-weight: bold;">var</span> url<span style="color: #339933;">=</span><span style="color: #3366CC;">&quot;//not real comment here http://iregex.org&quot;</span><span style="color: #339933;">;</span> <span style="color: #006600; font-style: italic;">//this is my site.</span></div></td></tr></tbody></table></div>
<p>我尝试使用javascript写了个模拟否定前瞻的函数, 可以处理<code class="codecolorer text default"><span class="text">http://</span></code>这种情况, 但是该函数看起来并不令人赏心悦目, 而且也不能处理引号中有双斜杠的情况. 我对javascript的正则式支持的特性之简陋实在很失望. 于是, 我求助于perl完成这一任务. 先看一下我写的javascript的删除单行注释的函数:</p>
<div class="codecolorer-container javascript mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="javascript codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #003366; font-weight: bold;">function</span> uncomment_single<span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> result<span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> single<span style="color: #339933;">=</span><span style="color: #003366; font-weight: bold;">new</span> RegExp<span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;<span style="color: #000099; font-weight: bold;">\/</span><span style="color: #000099; font-weight: bold;">\/</span>.&quot;</span><span style="color: #339933;">,</span><span style="color: #3366CC;">&quot;ig&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> start<span style="color: #339933;">=</span><span style="color: #CC0000;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>result<span style="color: #339933;">=</span>single.<span style="color: #660066;">exec</span><span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">!=</span><span style="color: #003366; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> part<span style="color: #339933;">=</span>str.<span style="color: #660066;">slice</span><span style="color: #009900;">&#40;</span>start<span style="color: #339933;">,</span>result.<span style="color: #660066;">index</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #003366; font-weight: bold;">var</span> negLeft<span style="color: #339933;">=</span><span style="color: #003366; font-weight: bold;">new</span> RegExp<span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;http:$&quot;</span><span style="color: #339933;">,</span><span style="color: #3366CC;">&quot;i&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span> negLeft.<span style="color: #660066;">test</span><span style="color: #009900;">&#40;</span>part<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">return</span> str.<span style="color: #660066;">slice</span><span style="color: #009900;">&#40;</span><span style="color: #CC0000;">0</span><span style="color: #339933;">,</span>result.<span style="color: #660066;">index</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; start<span style="color: #339933;">=</span>result.<span style="color: #660066;">index</span><span style="color: #339933;">+</span>result<span style="color: #009900;">&#91;</span><span style="color: #CC0000;">0</span><span style="color: #009900;">&#93;</span>.<span style="color: #660066;">length</span><span style="color: #339933;">-</span><span style="color: #CC0000;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066; font-weight: bold;">return</span> str<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">perl版删除注释思路及源码(相对完善)</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">待测试文本</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
好吧, 既然祭出了强大的perl, 之前的小打小闹似的例子就一边去吧. 我将使用如下相对复杂的文本来验证我的程序:</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&lt;!DOCTYPE h/tml PUBLIC &quot;-//W3C//DTD XHTML\&quot; 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt; sdfasdf//real comment here//&quot;</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">认真分析单行注释的特点</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
正确地分析其特点, 是写出合理高效的程序的前提. 观察可知, 单行注释的特点如下:</p>
<ol>
<li>引号内(包括单引号和双引号)的双斜线不算注释.</li>
<li>引号是配对出现的, 两个引号之间的以反斜线转义掉的引号不算结束符. 例如<code class="codecolorer text default"><span class="text">&quot;hello \&quot; //world&quot;</span></code>, 这里的<code class="codecolorer text default"><span class="text">//world</span></code>部分不能算做注释.</li>
<li>由连续的非引号非斜线部分组成的字符串也不是注释. 特别指出, 单个斜线不能算做注释. 为什么前半部分不但要非引号而且要非斜线呢? 因为<code class="codecolorer text default"><span class="text">[^'&quot;]+</span></code>是有可能误匹配<code class="codecolorer text default"><span class="text">abcde//real comment &quot;quoted string in comment&quot;</span></code>这样的情况, 因此我们归纳出一个条件<code class="codecolorer text default"><span class="text">[^'&quot;/]+</span></code>; 又因为还要避免<code class="codecolorer text default"><span class="text">abcde/real comment &quot;quoted string in comment&quot;</span></code>这样的情况, 还需要特别补充规定单个的斜线不是注释. 正则式是<code class="codecolorer text default"><span class="text">[^'&quot;/]|(?&lt;!/)/(?!/)</span></code>.</li>
<li>除去上述内容以外, 以双斜线开始直至行尾的部分就是注释. 因为我们用到了<strong>行尾</strong>这个概念, 需要在正则式中特别指出是<code class="codecolorer text default"><span class="text">^$</span></code>匹配行首行尾的多行模式. 使用<code class="codecolorer text default"><span class="text">//m</span></code>来表示.</li>
</ol>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">正则实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<span style="color: #0000ff;">$str</span> <span style="color: #339933;">=</span> <span style="color: #cc0000; font-style: italic;">&lt;&lt;&quot;EOF&quot;;<br />
&lt;!DOCTYPE h/tml PUBLIC &quot;-//W3C//DTD XHTML\&quot; 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt; sdfasdf//real comment here//&quot; <br />
EOF</span><br />
<span style="color: #666666; font-style: italic;">#print $str;</span><br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$str</span><span style="color: #339933;">=~</span> <br />
&nbsp; &nbsp; m<span style="color: #339933;">%</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">'&quot;/]|<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?&lt;!/)/(?!/)|<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?&lt;quote&gt;['</span><span style="color: #ff0000;">&quot;])<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?:<span style="color: #000099; font-weight: bold;">\\</span> <span style="color: #000099; font-weight: bold;">\g</span>{quote}|<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?!<span style="color: #000099; font-weight: bold;">\g</span>{quote}).)*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000099; font-weight: bold;">\g</span>{quote}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (?&lt;comment&gt;//.*)<br />
&nbsp; &nbsp; &nbsp; &nbsp; $<br />
&nbsp; &nbsp; %xm) <br />
&nbsp; &nbsp; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; print $+{comment}; <br />
}</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">几点补充</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>该程序在perl5.10版才能运行成功. 因为用到了命名捕获<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?</span><span style="color: #009999;">&lt;quote&gt;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'&quot;])</span></span></code>这样比较高阶的特性. 当然, 不使用5.10也并非没有办法, 我们大可以使用numbered capture, 只不过看起来更不直观罢了.</li>
<li>匹配结束后, 命名捕获都保存在hash表<code class="codecolorer text default"><span class="text">%+</span></code>中了. 使用<code class="codecolorer text default"><span class="text">print $+{comment}</span></code>这样的方式可以方便地调用.
<li>指定了x模式, 以便加入空白字符和换行, 让正则表达式看起来有层次感. 事实上, 对于复杂的正则表达式, 不使用x模式是极其不明智的做法.
<li>为了在字串中方便地表示单双引号, 使用了heredoc的方式. 个人觉得不如python的三重引号方便.
</ul>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">小结</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
从正则表达式的角度来说, javascript实在太弱. 当然, 也与本人的javascript功底较浅有关系. perl对于正则表达式的支持实在是强撼且不遗余力. 上面的实现, 应该可以涵盖绝大多数的注释情况了. 如果您测试出现bug, 或者遇到更BT的字串, 欢迎留言讨论.
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/uncomment-program-with-regex.html/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>关于从普通文本提取正则表达式的再思考</title>
		<link>http://iregex.org/blog/text-2-regular-expressions-again.html</link>
		<comments>http://iregex.org/blog/text-2-regular-expressions-again.html#comments</comments>
		<pubDate>Mon, 08 Mar 2010 18:32:19 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[array]]></category>
		<category><![CDATA[engine]]></category>
		<category><![CDATA[hash]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[recursive]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=79</guid>
		<description><![CDATA[rex按： 写完上一篇文章之后，一直在考虑如何真正实现从普通文本中归纳正则表达式的实现。走了许多弯路，也学了不少知识。例如，perl黑豹书上复杂的数据结构、匿名散列和数组、refenrence... ]]></description>
			<content:encoded><![CDATA[<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">rex按：</h3>
<p>写完<a id="vt:b" title="个人应用之明文字串到正则" href="http://iregex.org/blog/literal-text-to-regex.html"><span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">上一篇</span></span></a><span class="Apple-style-span" style="font-style: normal;">文章之后，一直在考虑如何真正实现从普通文本中归纳正则表达式的实现。走了许多弯路，也学了不少知识。例如，perl黑豹书上复杂的数据结构、匿名散列和数组、refenrence；紫龙书上的状态机的构造，数据结构上图论的知识，都是很有用的。另外还新学了</span><a id="mk58" title="graphviz" href="http://www.graphviz.org/"><span class="Apple-style-span" style="font-style: normal;">graphviz</span></a><span class="Apple-style-span" style="font-style: normal;">的用法。以前觉得很神秘，不过一用才发现很直观。本文的插图是使用</span><a id="a1la" title="online版本的graphviz" href="http://graph.gafol.net/create"><span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">online版本的graphviz</span></span></a><span class="Apple-style-span" style="font-style: normal;">画的。</span></p>
<p>除了本文的这种实现方法（基于图），我还使用另一种方式实现了，很简单：基于关键词。具体作法是，逐一读取每一行文本，使用\s+等将其split开，形成array；然后再对所有的array进行求交集的操作（使用hash），得到每一行都有的关键词；然后按从左到右的顺序建立这覆的正则式^(.*?)keyword1(.*?)keyword2&#8230;.keywordN(.*?)$，再分别匹配每一行文本，得到hash的hash表，或者array的array，转置，并列输出，得到^(option1|option2&#8230;)keyword1(option..)&#8230;$这样的正则式。最后作为验证，再将所最终生成的正则与每一行匹配测试一下。</p>
<p>这样以词为单位做完之后，再逐个字母地分隔开来，递归地处理<span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">(option1|option2&#8230;)的部分。先是单词级，再是字母级，有利于先在最大程度上找出重复的内容；而且粗化和细化的处理过程，思路是一致的，粒度不同罢了。</span></span></p>
<p><span class="Apple-style-span" style="color: #474747;"> 新手请自重，高手请赐教，我的思路未必是正确或最优的。</span></p></blockquote>
<p><span id="more-79"></span></p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>有文本文件text.txt，内容如下：</div>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div>this is a red fox</div>
<div>this is a blue firefox</div>
<div>this is a pig</div>
<div>a red fox</div>
</blockquote>
<div>请写一则程序，根据文本内容，自动构造（比较合理的）正则表达式，使之能够匹配文件中<strong>每一行</strong>文本。</div>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">标准正则</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>有两种极端的解法是不可取的：</p>
<ol>
<li><span class="Apple-style-span" style="color: #ff00ff;">^.*$</span></li>
<li><span class="Apple-style-span" style="color: #ff00ff;">^(this is a red fox|this is a blue firefox|this is a pig|a red fox)$</span></li>
</ol>
<p>第一种失之于太宽泛，第二种失之于太狭隘。太宽泛则泥沙俱下，无论什么文本都能匹配；太狭隘则僵化死板，缺乏灵活性。好的正则表达式源于例文本（从例文本中提取规律），又高于例文本（能匹配同规律的其它文本）。匹配什么，排除什么，都有定则，所谓“君子有所为而有所不为”，指的就是这种情况（貌似跑题了:)）。</p>
<div>那么，如何是比较靠谱的正则表达式呢？以上文的例子而言，可以是：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">^(this is )?a (red fox|blue firefox|pig)$</span></div>
<div>现在我们向着标准答案出发。</div>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">思路</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>任何复杂的电路图，都可以拆分为三种简单的关系：串联，并联，短路。正则表达式也同理。</div>
<p>既然是一条正则匹配所有的文本，那么这条正则（记为<span class="Apple-style-span" style="color: #ff00ff;">$re</span>）也应该匹配第一行文本。</p>
<p>第一行文本为this is a red fox。那么，从<span class="Apple-style-span" style="color: #ff00ff;">^this is a red fox$</span>应该是<span class="Apple-style-span" style="color: #ff00ff;">$re</span>的一个（真）子集。它的路径为：<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a-&gt;red-&gt;fox-&gt;&#8221;$&#8221;</span>。全部节点之间，是串联关系，从左到右依次排列即可。</p>
<p>示意图如下(可以点击看全尺寸图，下同)：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001329.png" target="_blank"><img style="margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001329.png" border="0" alt="Photobucket" /></a></p>
<p>同理，第二行文本也应该是<span class="Apple-style-span" style="color: #ff00ff;">$re</span>的子集。不过，由于已经存在了由<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;this-&gt;is-&gt;a</span>的路径，到a时出现支路，<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;blue-&gt;firefox-&gt;$</span>；</p>
<p>将此路径添加到示意图上，得到：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001747.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001747.png" border="0" alt="Photobucket" /></a></p>
<p>显而易见，这两条并列的支路，始于a，终于$，可以使用|来并列之。</p>
<p>好了，我们总结一下规律：</p>
<div><span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;"><strong><span class="Apple-style-span"><span class="Apple-style-span" style="background-color: #6fa8dc;">并列</span></span></strong>：如果存在A-&gt;B-&gt;C，且同时存在A-&gt;D-&gt;C，则B与D之间是并联关系。即出发点相同，结束点相同，且出发点与结束点之间各有一个以上的节点。并列使用括号来表示，之间以|分隔。例如，对于<span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;">A-&gt;B-&gt;C，A-&gt;D-&gt;C，则可以使用A(B|D)C来表示其正则关系。</span></span></span></span></div>
<div><span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;">为什么要强调是一个以上节点呢？这里先卖个关子。请继续阅读。</span></span></div>
<p>再往下，this is a pig，同理，只需要在原图基础上添加<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;pig-&gt;$</span>的支路即可。此时图示如下：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309002851.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309002851.png" border="0" alt="Photobucket" /></a></p>
<div>
最后一条，a red fox。这条貌似复杂，但是只需在<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;a</span>之间新添加了一条路径而已；<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;red-&gt;fox-&gt;$</span>之间原有路径，可以继续使用。此时，得到完整的示意图如下：</div>
<p><a style="color: #ed1e24; margin-left: 0px; margin-right: 0px; text-decoration: none;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" border="0" alt="Photobucket" /></a></p>
<p>此时，观察可知，一种新的情况出现了。同时存在<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;a</span>，和<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a</span>两条路径。想一下初中物理电路图，我们可以将这种情况称为“短路”，即，<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a</span>这个线路的^、a两个节点之间，添加了一条无障碍通道，它能无视this、is的存在，因此，让<span class="Apple-style-span" style="color: #ff00ff;">this-&gt;is</span>这条路径成为<strong>可选项</strong>。再总结一下规律：</p>
<p>如果有A-&gt;B-&gt;&#8230;C-&gt;D的路径，且有A-&gt;D的路径，则称A-&gt;D之间存在短路，此时,B-&gt;&#8230;-&gt;C可以用(B-&gt;&#8230;-&gt;C)?来表示(就是用括号来表示被短路的部分，问号表示短路之)。</p>
<div>顶点A,D之间，最多存在一个短路关系。但是可以有1或更多条并列的关系存在。</div>
<div>好了，分析结束，得到这样的正则式：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">^(this is )?a (red fox|blue firefox|pig)$</span></div>
<p>这也就是为什么上文要强调是一个节点的缘故。</p>
<div>
<p>如果我们再精益求精的话，可以对<span class="Apple-style-span" style="color: #ff00ff;">red fox|blue firefox|pig</span>这部分<strong><span class="Apple-style-span" style="color: #ff00ff;">递归地</span></strong>进行上述分析过程，进而得到<span class="Apple-style-span" style="color: #ff00ff;"> (red |blue fire)fox|pig</span>这样的结果。</p></blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">实现</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>思路有了，编程就简单了。perl中，固然可以使用比较简洁的hash表来表示链表之间的关系：</div>
<div>例如：</div>
<div>my $hash;</div>
<div style="margin-left: 0px; margin-right: 0px;">$$hash{&#8220;^&#8221;}{&#8220;this&#8221;}{&#8220;is&#8221;}{&#8220;a&#8221;}{&#8220;red&#8221;}{&#8220;fox&#8221;}{&#8220;\$&#8221;}=&#8221;";</div>
<div>$$hash{&#8220;^&#8221;}{&#8220;this&#8221;}{&#8220;is&#8221;}{&#8220;a&#8221;}{&#8220;blue&#8221;}{&#8220;firefox&#8221;}{&#8220;\$&#8221;}=&#8221;";</div>
<p>&#8230;</p>
<div>但是，节点的增删修改都是麻烦事。（我在hash迷宫中lost了很久才爬出来）</div>
<div>抽空补了一下<strong>有向图</strong>的知识，觉得可以简化问题如下。</div>
<p><a style="color: #ed1e24; margin-left: 0px; margin-right: 0px; text-decoration: none;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" target="_blank"><img style="margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" border="0" alt="Photobucket" /></a></p>
<p>上图其实是一个有向图，只需记录所有的顶点集合，路径集合，再来求各路径之间的关系；最后打印输出，即是所求。</p>
<div>顶点集合为：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">(^, this, is, a, red, fox, blue, firefox, pig, $);</span></div>
<div>通路关系集合为：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">(^-&gt;this, this-&gt;is,&#8230;)</span></div>
<p>这两个集合在读取文本文件行的时候可以一次性建立。不复杂。关键是关系的确立。</p>
<div>再次总结，如下：</div>
<ul>
<li>从一个顶点A出发的N条支路必定汇合（只是有时是同一个点，有时不在同一点而已。本文给出的例子是最简单的情况，这里可以假设为汇合到同一点）于M点。</li>
<li>这N条路中，每一条路径的长短以经过的节点个数来计算。例如上图中，^到a有一条路，上面的路径为2，下面的路径为0。</li>
<li>短的支路决定了这N条支路的关系。</li>
<li>长度为任意两点之间，最多只可能有一条长度为0的边。</li>
<li>如果存在长度为0的边，则其余的同级的支路被短路。</li>
<li>长度不为0的N-1条支路之间是并列关系。</li>
<li>整个图始于^，终于$。</li>
</ul>
<div>这些条件、判断，均可以细化为函数。具体的程序从略。</div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/text-2-regular-expressions-again.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>个人应用之明文字串到正则</title>
		<link>http://iregex.org/blog/literal-text-to-regex.html</link>
		<comments>http://iregex.org/blog/literal-text-to-regex.html#comments</comments>
		<pubDate>Wed, 10 Feb 2010 08:50:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=78</guid>
		<description><![CDATA[近来工作中需要将某种明文字串转为简单的正则式。手动做当然可以，但是大量重复性的劳动，自然是交给机器处理为好。昨晚写了一款这样的脚本，放在这里。因为是处理我自己的工作的脚... ]]></description>
			<content:encoded><![CDATA[<p>近来工作中需要将某种明文字串转为简单的正则式。手动做当然可以，但是大量重复性的劳动，自然是交给机器处理为好。昨晚写了一款这样的脚本，放在这里。因为是处理我自己的工作的脚本，贴在这里仅作记录和存档之用，可能对别人没什么实际作用。当然，从现有的明文字串到正则式的转换，应该是个不错的题目，有兴趣朋友的可以深究。</p>
<p>值得一提的是，代码中用了<font color="#FF00FF">$&#038;, (?{})</font> 这样的<font color="#FF00FF">perl only</font>的东东，明晰了思路，简化了代码。如果不使用这种特性的话，代码要<strong>长5倍</strong>。另外，据说从效率上来说，<font color="#FF00FF">use English</font>之后，使用<font color="#FF00FF">$MATCH</font>比直接使用<font color="#FF00FF">$&#038;</font><strong>快5倍</strong>。但是对于即输入即执行的命令行程序来说，<font color="#FF00FF">$&#038;</font>已经足够好。</p>
<p><span id="more-78"></span></p>
<p>实际应用一例：</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #c20cb9; font-weight: bold;">perl</span> hash2re.pl H:aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA0.zip<span style="color: #000000; font-weight: bold;">/</span>H:aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span><span style="color: #000000; font-weight: bold;">/</span>aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span><span style="color: #000000; font-weight: bold;">/</span>aaa<span style="color: #000000; font-weight: bold;">/</span>Aaaaa<span style="color: #000000; font-weight: bold;">/</span>aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span>.exe<br />
RE <span style="color: #000000;">1</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>\.zip$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA0.zip&quot;</span><br />
<br />
RE <span style="color: #000000;">2</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0&quot;</span><br />
<br />
RE <span style="color: #000000;">3</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0&quot;</span><br />
<br />
RE <span style="color: #000000;">4</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa&quot;</span><br />
<br />
RE <span style="color: #000000;">5</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">4</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;Aaaaa&quot;</span><br />
<br />
RE <span style="color: #000000;">6</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>\.exe$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0.exe&quot;</span></div></td></tr></tbody></table></div>
<p>源码：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<br />
<span style="color: #666666; font-style: italic;"># &nbsp; by rex zhang </span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; Feb 09 2010 in Shanghai</span><br />
<br />
<span style="color: #666666; font-style: italic;"># &nbsp; usage: split and regexize hashed filename</span><br />
<span style="color: #666666; font-style: italic;">#</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$lines</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$ARGV</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$lines</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #666666; font-style: italic;">#(C:[^/]+)#)</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$c</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$lines</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/$c//</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;ClearText Filename Ignored:<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\&quot;</span>$c<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@array</span><span style="color: #339933;">=</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">m</span><span style="color: #339933;">!</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>\<span style="color: #339933;">/|</span>H<span style="color: #339933;">:</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*!,</span> <span style="color: #0000ff;">$lines</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$line</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@array</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">next</span> <span style="color: #b1b100;">if</span> <span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$re</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">local</span> <span style="color: #0000ff;">$len</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<br />
<br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(?=[.\[\]()])/\\/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\?/./g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/0+(?{ $len=length($&amp;)})/[0-9]\{$len\}/g</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/A+(?{ $len=length($&amp;)})/[A-Z]\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/a+(?{ $len=length($&amp;)})/[a-z]\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(.)\1+(?{ $len=length($&amp;)})/$1\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\{1\}//g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\^</span>$re<span style="color: #000099; font-weight: bold;">\$</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">++;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$line</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/$re/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;RE $counter:<span style="color: #000099; font-weight: bold;">\t</span>$re<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>Matches: <span style="color: #000099; font-weight: bold;">\&quot;</span>$line<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;RE $counter:<span style="color: #000099; font-weight: bold;">\t</span>$re<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>Failed: <span style="color: #000099; font-weight: bold;">\&quot;</span>$line<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/literal-text-to-regex.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>统计重复文本行的两种方法</title>
		<link>http://iregex.org/blog/get-duplicated-lines.html</link>
		<comments>http://iregex.org/blog/get-duplicated-lines.html#comments</comments>
		<pubDate>Sat, 06 Feb 2010 07:09:43 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=77</guid>
		<description><![CDATA[假设样本文件a.txt内容如下： 123456hello world! hello world! I love regex. hello world! I love regex. hello world! 简单观察可知，hello world!共重复4行；I love regex.重复2行。如何使用正则表达式来写一个程序，统计... ]]></description>
			<content:encoded><![CDATA[<p>假设样本文件<font color="#FF00FF">a.txt</font>内容如下：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">hello world!<br />
hello world!<br />
I love regex.<br />
hello world!<br />
I love regex.<br />
hello world!</div></td></tr></tbody></table></div>
<p>简单观察可知，<font color="#FF00FF">hello world!</font>共重复4行；<font color="#FF00FF">I love regex.</font>重复2行。如何使用正则表达式来写一个程序，统计这些数据呢？因为现实中需要统计的文件，绝非是只凭肉眼就能观察出来。我想到了两种方法，第一种方法，是依赖于正则表达式（否则这篇文章也不会贴在这里）；第二种，hash表做主角，正则表达式作绿叶。<span id="more-77"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">正则表达式的解法</h3>
<p>思路是：对于任何一行文本，如果后面若干行[0～EOF）之后，如果存在相同的文本行，则记下该行内容，统计出现次数；然后删除这样的文本行，再进行下一行的统计。输出统计结果。</p>
<p>下面是相应的perl程序，附注释。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl </span><br />
<span style="color: #666666; font-style: italic;">#usage: &nbsp;./dup_re.pl &lt;a.txt</span><br />
<br />
<span style="color: #000066;">undef</span> <span style="color: #0000ff;">$/</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># enable &quot;slurp&quot; mode</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$file</span> <span style="color: #339933;">=</span> <span style="color: #009999;">&lt;STDIN&gt;</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;"># whole file now here</span><br />
<br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$file</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><br />
&nbsp; &nbsp; <span style="color: #339933;">/</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#for each line;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore the whitespaces at both ends; </span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">\S</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the line content, save to $1;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; \<span style="color: #000066;">s</span><span style="color: #339933;">*</span>$ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines by using \S</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">.*?</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#check if there is the same pattern of $1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span>\<span style="color: #cc66cc;">1</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span>$ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#after 0 or more lines;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">/</span>smx<span style="color: #009900;">&#41;</span> <br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$count</span><span style="color: #339933;">=</span> <span style="color: #0000ff;">$file</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/$line//g</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#delete the duplicated lines</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save the number to $count;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$count</span><span style="color: #339933;">,</span><span style="color: #ff0000;">&quot;times:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$line</span><span style="color: #339933;">,</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Hash表解法</h3>
<p>这种方法，受益于perl语言本身的强大的hash表功能。思路如下：</p>
<ul>
<li>建立空的hash表；</li>
<li>逐行读取文件；</li>
<li>以文本内容为key，插入到表中来。如果是首次出现，value为0，否则value++。</li>
<li>输出hash表中value>=2的记录。</li>
</ul>
<p>Perl程序：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<span style="color: #666666; font-style: italic;">#usage: &nbsp;./dup_hash.pl a.txt</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%hash</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">&lt;&gt;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp;<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/^\s*(\S.*?)\s*$/</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#ignore whitespaces at both ends; </span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#123;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines by using \S</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">++;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save the line to $1, and count the time it appears</span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#sort the hash by values; </span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$key</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">sort</span> <span style="color: #009900;">&#123;</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$b</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">&lt;=&gt;</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#125;</span> <span style="color: #009900;">&#125;</span> <span style="color: #000066;">keys</span> <span style="color: #0000ff;">%hash</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$key</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">&gt;=</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#only print the lines that duplicates;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#for all results, just remove the 'if' line</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #000066;">printf</span> <span style="color: #ff0000;">&quot;%d times:<span style="color: #000099; font-weight: bold;">\t</span>%s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$key</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$key</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">结果</h3>
<p>上面的程序分别保存为dup_re.pl，dup_hash.pl。由于程序对于外部文件的读取的方法不同，运行方式也有差别，详见下图：<br />
<img src="http://public.bay.livefilestore.com/y1p84mh-sSb8s2jIOokB1tAnVJQnNdmS1ir1v9A0nRbWPPZ6AdIQV896FPpKr_LNzQvJ6kJQ-Ue94wHK8LVscG8uQ/20100206_144726.png" alt="我爱正则表达式|统计重复文本行的两种方法" /></p>
<h4>Update</h4>
<p>忽然想到，如果要让这脚本更有效，可以指定忽略大小写，忽略单词间多个空格的情况，使得<font color="#FF00FF">Hello world!</font>与<font color="#FF00FF">      　　hello　　       WORLd!   </font>被视为重复行。测试了一下，正则式没让我失望。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/get-duplicated-lines.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>[译]递归正则表达式</title>
		<link>http://iregex.org/blog/recursive-regular-expressions.html</link>
		<comments>http://iregex.org/blog/recursive-regular-expressions.html#comments</comments>
		<pubDate>Wed, 16 Dec 2009 15:06:47 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[翻译]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[recursive]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=73</guid>
		<description><![CDATA[原文在此。rex译于2009年12月15～17日，翻译过程中使用的是google docs@prism@firefox@ubuntu 9.10，很爽的体验。感谢余晟老师在正则和翻译方面的悉心指导。 平时我们用到的正则表达式，其实没那么“... ]]></description>
			<content:encoded><![CDATA[<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>原文<a title="在此" href="http://www.catonmat.net/blog/recursive-regular-expressions/" id="eklo">在此</a>。<a href="http://iregex.org">rex</a>译于2009年12月15～17日，翻译过程中使用的是google docs@prism@firefox@ubuntu 9.10，很爽的体验。感谢<a href="http://www.luanxiang.org/blog/">余晟</a>老师在正则和翻译方面的悉心指导。</p></blockquote>
<p><a href="http://iregex.org/blog/recursive-regular-expressions.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/yo-dawg-regex.jpg" alt="递归正则表达式" border="0"></a></p>
<p>平时我们用到的正则表达式，其实没那么“规则”。多数编程语言所支持的扩展的正则表达式，其运算能力比起<a title="形式语言理论" href="http://en.wikipedia.org/wiki/Formal_language" id="vcy6">形式语言理论</a>所定义的“<a title="规则" href="http://en.wikipedia.org/wiki/Regular_expression" id="gkvv">规则</a>”正则表达式要强得多。</p>
<p><span id="more-73"></span></p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>rex注：在这里其实可以看到，如果将<b>正则表达式</b>译为<b>正规表达式</b>，一切就都通顺了：<br />
平时我们用到的正规表达式，其实没那么“正规”。多数编程语言所支持的扩展的正规表达式，其运算能力比起正式语言理论所定义的“正规”正规表达式要强得多。</p>
<p>regular expression的日文是“正规表现”，在鸟哥书中，好像也将其称为正规表达式。<a title="via" href="http://fanfou.alwaysdata.net/status&amp;id=5144919240" id="bg:v">via</a></p>
</blockquote>
<p>例如，经常用到的<a title="捕获缓存" href="http://perldoc.perl.org/perlre.html#Capture-buffers" id="emj9">捕获缓存</a>，就是用来帮助我们临时存储任意正则表达式模式，以便重复使用。 又如，“<a title="环视断言" href="http://perldoc.perl.org/perlre.html#Look-Around-Assertions" id="xsz-">环视断言</a>”能让正则表达式引擎在做决定之前先偷偷看看环视一下。这些扩展让正则表达式非常强大，足以描述一些“<a title="上下文无关语法" href="http://en.wikipedia.org/wiki/Context-free_grammar" id="acrs">上下文无关语法</a>”。</p>
<p>Perl语言的正则表达式引擎特性异常丰富，其特征之一是<strong>懒惰正则子表达式</strong>（Lazy regular subexpressions），格式为(??{code})，其中的“code”可以是任意一段perl程序，该子表达式可能匹配时，这段程序就会执行。</p>
<p>我们可以利用这一特征来编写出非常有趣的东西，即将正则表达式自身嵌在它的“code”部分，由此生成<b>递归的正则表达式</b>(a recursive regular expression)！</p>
<p>一直以来，正则表达式无法匹配0<sup>n</sup>1<sup>n</sup>这种表达式，也就是由若干个0以及同等数量的1所组成的字符串。如果使用懒惰正则子表达式，这一经典问题就迎刃而解。</p>
<p>下面是匹配0<sup>n</sup>1<sup>n</sup>字串的perl正则表达式代码。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=</span> <span style="color: #009966; font-style: italic;">qr/0(??{$regex})?1/</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>rex注：紫龙书第四章有云：“文法是比正则表达式表达能力更强的表示方法。每个可以使用正则表达式描述的构造，都可以使用文法来描述，但是反之不成立。换句话说，每个正则语言都是一个上下文无关语言，但是反之不成立。”书中交待的正则，也应该是指的常规正则表达式，而非现代语言中的扩展的正则表达式。一般使用正则表达式来构造小部件，而使用文法来组建语言框架。
</p></blockquote>
<p>此正则表达式匹配一个字符0，之后是正则表达式自身0或1次，之后是字符1。如果正则表达式自身部分不能匹配，那么它只能匹配01；如果自身部分可以匹配，则正则表达式匹配的是00($regex)?11，此时若不能匹配自身则结果是0011，若可以匹配就是000($regex)?111，……依次顺延。</p>
<p>下面是匹配0<sup>50000</sup>1<sup>50000</sup>的Perl程序：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </span><br />
<br />
<span style="color: #0000ff;">$str</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;0&quot;</span>x50000 <span style="color: #339933;">.</span> <span style="color: #ff0000;">&quot;1&quot;</span>x50000<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=</span> <span style="color: #009966; font-style: italic;">qr/0(??{$regex})*1/</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$str</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/^$regex$/</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;yes, it matches&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;no, it doesn't match&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>现在来看题图所示的Yo Dawg正则表达式。你先猜猜它的作用？正确答案是，它匹配(foo(bar())baz)这样完全嵌套的括号表达式（fully parenthesized expression）或((()()())())这样的平衡括号表达式（balanced parentheses）。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=</span> <span style="color: #000066;">qr</span><span style="color: #339933;">/</span><br />
&nbsp; \<span style="color: #009900;">&#40;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># (1) match an open paren ( &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #(1)，此处匹配开括号</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#40;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;"># followed by &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;#(2)，紧接着是</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span> &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; (3) one or more non-paren character &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;#(3)，1个或多个非括号字符</span><br />
&nbsp; &nbsp; <span style="color: #339933;">|</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># OR &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;#(4)，或</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #339933;">??</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$regex</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span> &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; (5) the regex itself &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #(5)，正则式自身</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#41;</span><span style="color: #339933;">*</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;"># (6) repeated zero or more times &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #(6)，重复0或多次</span><br />
&nbsp; \<span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># (7) followed by a close paren ) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #(7)，紧接着是闭括号</span><br />
<span style="color: #339933;">/</span>x<span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>rex注：关于Yo Dawg图片的含义，可以参考<a title="这里" href="http://yoyodawgdawg.com/about" id="az.x">这里</a>。基本上是全是“Yo dawg, I herd you like X, so we put a Y in your Y so you can Z while you Z”的结构的配图文字。</p></blockquote>
<p>构造此正则表达式的思路是这样的。对于完全嵌套的括号表达式，它的开始字符是一个开括号。这是最简单的一步，我们直接写出（上面程序中的(1)）。同理，它的结束字符是闭括号，于是得到(7)。现在该动脑筋了，括号中间是什么呢？对，它可以是既不是开括号又不是闭括号的任意字符（第(3)点），<b>也可以是</b>另一个完全嵌套的表达式（即第(5)点）。所有的这些，既可以只匹配0次（第(3)点），以便构造最小的完全嵌套括号表达式()，也可以匹配多次来匹配较复杂的表达式。</p>
<p>去掉 /x 选项（即，不再使用多行风格的注释模式），可以简记为：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=</span> <span style="color: #009966; font-style: italic;">qr/\(([^()]+|(??{$regex}))*\)/</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>但是切勿在正式产品中使用这一特性，它太诡异，不好把握。建议使用较稳定成熟的<a href="http://search.cpan.org/dist/Text-Balanced/">Text::Balanced</a> 或 <a href="http://search.cpan.org/dist/Regexp-Common/">Regexp::Common</a> 模块。</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>rex注：对于(??{code})，perl<a href="http://perldoc.perl.org/perlre.html#%28??{-code-}%29">官方的提示</a>是：此正则表达式仅作测试使用，可能有更新而不作提示。代码执行时产生的副作用，因版本而异，运行结果或有不同，取决于正则引擎的后期优化。</p></blockquote>
<p>最后提醒大家，在Perl 5.10中已经可以使用<a title="递归捕获缓存" href="http://perldoc.perl.org/perlre.html#%28?PARNO%29-%28?-PARNO%29-%28?+PARNO%29-%28?R%29-%28?0%29" id="vchb">递归捕获缓存</a>来替代懒惰代码子表达式了，运行结果相同。</p>
<p>下面是匹配0<sup>n</sup>1<sup>n</sup>的递归捕获缓存语法(?N)的实现：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$rx</span> <span style="color: #339933;">=</span> <span style="color: #009966; font-style: italic;">qr/(0(?1)*1)/</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>(?1)*的含义是“匹配第一组0或多次”，这里的第一组是指整个正则表达式。</p>
<p>请自行动手，重写平衡括号的正则表达式，当作练习。</p>
<p>祝玩得开心！</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/recursive-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>数字转美元程序</title>
		<link>http://iregex.org/blog/convert-digits-to-english-value.html</link>
		<comments>http://iregex.org/blog/convert-digits-to-english-value.html#comments</comments>
		<pubDate>Sun, 15 Feb 2009 05:09:08 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[美元]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=57</guid>
		<description><![CDATA[本程序将数字转换为英文的美元数，如： 输入 1./num2eng.pl 1,100,834.10 则输出： 1Total: Say US Dollars One Million One Hundred Thundsand Eight Hundred and Thirty-Four and Ten Cents Only. 注意事项： 整数部分可以使用半... ]]></description>
			<content:encoded><![CDATA[<p>本程序将数字转换为英文的美元数，如： 输入</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">.<span style="color: #000000; font-weight: bold;">/</span>num2eng.pl <span style="color: #000000;">1</span>,<span style="color: #000000;">100</span>,<span style="color: #000000;">834.10</span></div></td></tr></tbody></table></div>
<p>则输出：</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">Total: Say US Dollars One Million One Hundred Thundsand Eight Hundred and Thirty-Four and Ten Cents Only.</div></td></tr></tbody></table></div>
<p>注意事项：</p>
<ol>
<li>整数部分可以使用半角的逗号、空格、单引号、下划线、中划线分隔。</li>
<li>分隔符的位置可以任意（每3位可，每4位也可），可以任意组合（可以混合使用上述的分隔符）。</li>
<li>如果使用单引号，请注意在最外边加上双引号以免转义。
</li>
</ol>
<p>完整程序：<br />
<span id="more-57"></span></p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br />117<br />118<br />119<br />120<br />121<br />122<br />123<br />124<br />125<br />126<br />127<br />128<br />129<br />130<br />131<br />132<br />133<br />134<br />135<br />136<br />137<br />138<br />139<br />140<br />141<br />142<br />143<br />144<br />145<br />146<br />147<br />148<br />149<br />150<br />151<br />152<br />153<br />154<br />155<br />156<br />157<br />158<br />159<br />160<br />161<br />162<br />163<br />164<br />165<br />166<br />167<br />168<br />169<br />170<br />171<br />172<br />173<br />174<br />175<br />176<br />177<br />178<br />179<br />180<br />181<br />182<br />183<br />184<br />185<br />186<br />187<br />188<br />189<br />190<br />191<br />192<br />193<br />194<br />195<br />196<br />197<br />198<br />199<br />200<br />201<br />202<br />203<br />204<br />205<br />206<br />207<br />208<br />209<br />210<br />211<br />212<br />213<br />214<br />215<br />216<br />217<br />218<br />219<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<span style="color: #666666; font-style: italic;">#</span><br />
<span style="color: #666666; font-style: italic;">#rex[at]zhasm[dot]com</span><br />
<span style="color: #666666; font-style: italic;">#13 Feb 2009 on perl 5.10 and Ubuntu 8.10</span><br />
<br />
<span style="color: #0000ff;">@single</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;One&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Two&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Three&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Four&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Five&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Six&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Seven&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Eight&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Nine&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Ten&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Eleven&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Twelve&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Thirteen&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Fourteen&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Fifteen&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Sixteen&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Seventeen&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Eighteen&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Nineteen&quot;</span><br />
<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">@tens</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Ten&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Twenty&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Thirty&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Forty&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Fifty&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Sixty&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Seventy&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Eighty&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Ninety&quot;</span><span style="color: #339933;">,</span><br />
<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">@scale</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Thundsand&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Million&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Billion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Trillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Quadrillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Quintillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Sextillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Septillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Octillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Nonillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Decillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Undecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Duodecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Tredecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Quattuordecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Quindecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Sexdecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Septendecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Octodecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Novemdecillion&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;Vigintillion&quot;</span><br />
<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> dot<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">length</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#$dot=1*$dot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">20</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$single</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$ten</span><span style="color: #339933;">=</span><span style="color: #000066;">int</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">/</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$extra</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">-</span><span style="color: #0000ff;">$ten</span><span style="color: #339933;">*</span><span style="color: #cc66cc;">10</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #ff0000;">&quot;$tens[$ten] $single[$extra]&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">length</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #ff0000;">&quot;$tens[$dot]&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> debug<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> hundred<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$number</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$result</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$number</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">999</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; debug<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;too large&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$c</span><span style="color: #339933;">=</span><span style="color: #000066;">int</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$number</span><span style="color: #339933;">/</span><span style="color: #cc66cc;">100</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$b</span><span style="color: #339933;">=</span><span style="color: #000066;">int</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$number</span><span style="color: #339933;">-</span><span style="color: #0000ff;">$c</span><span style="color: #339933;">*</span><span style="color: #cc66cc;">100</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$a</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$number</span> <span style="color: #339933;">%</span> <span style="color: #cc66cc;">10</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$cc</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$bb</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$aa</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$c</span><span style="color: #009900;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$cc</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;$single[$c] Hundred&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$b</span><span style="color: #339933;">&gt;=</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$bb</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;$tens[$b]-$single[$a]&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$b</span><span style="color: #339933;">&lt;=</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$bb</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$single</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">$b</span><span style="color: #339933;">*</span><span style="color: #cc66cc;">10</span><span style="color: #339933;">+</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$b</span><span style="color: #339933;">&gt;=</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$bb</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;$tens[$b]&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$b</span><span style="color: #339933;">&lt;=</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#includeing the condition when $b==0;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$bb</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$single</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">$b</span><span style="color: #339933;">*</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$c</span> <span style="color: #b1b100;">and</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span> <span style="color: #b1b100;">or</span> <span style="color: #0000ff;">$b</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #ff0000;">&quot;$cc and $bb&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">elsif</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$c</span> <span style="color: #b1b100;">and</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">0</span> <span style="color: #b1b100;">and</span> <span style="color: #0000ff;">$b</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$cc</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">elsif</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><span style="color: #0000ff;">$c</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> &nbsp;<span style="color: #ff0000;">&quot;and $bb&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> integer<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$money</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$index</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$string</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$flag</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$flag</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$small</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$money</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">1000</span> <span style="color: #b1b100;">and</span> <span style="color: #0000ff;">$money</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$small</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$money</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$flag</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$money</span><span style="color: #339933;">&gt;=</span><span style="color: #cc66cc;">1000</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$small</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$money</span> <span style="color: #339933;">%</span> <span style="color: #cc66cc;">1000</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$money</span><span style="color: #339933;">=</span><span style="color: #000066;">int</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$money</span><span style="color: #339933;">/</span><span style="color: #cc66cc;">1000</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$small</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$string</span><span style="color: #339933;">=</span>hundred<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$small</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot; $scale[$index] &quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$string</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$index</span><span style="color: #339933;">++;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$string</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\s+$//</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$string</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/^\s+//</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$string</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/^and\s+//i</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$string</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> convert<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$digits</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$dot</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0.0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$integer</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$result</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$digits</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/[-,_' ]//g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$digits</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/(\d*)\.(\d*)/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$integer</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$dot</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$2</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$digits</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/(\d+)/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$integer</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$integer</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$result</span> <span style="color: #339933;">.=</span> integer<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$integer</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span> <span style="color: #b1b100;">and</span> <span style="color: #0000ff;">$integer</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$result</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">&quot; and &quot;</span><span style="color: #339933;">.</span> dot<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot; Cent&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #0000ff;">$result</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">&quot; and &quot;</span><span style="color: #339933;">.</span> dot<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot; Cents&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span> <span style="color: #b1b100;">and</span> <span style="color: #0000ff;">$integer</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$result</span> <span style="color: #339933;">.=</span> dot<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot; Cent&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #0000ff;">$result</span> <span style="color: #339933;">.=</span> dot<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$dot</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot; Cents&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$result</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;Total: Say US Dollars &quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$result</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot; Only.&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$money</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #339933;">@</span><span style="color: #000000; font-weight: bold;">ARGV</span><span style="color: #339933;">;</span> <br />
<span style="color: #000066;">print</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">=</span>convert<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$money</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/convert-digits-to-english-value.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>饭否消息析取之regex vs xml</title>
		<link>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html</link>
		<comments>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html#comments</comments>
		<pubDate>Wed, 08 Oct 2008 10:53:59 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=33</guid>
		<description><![CDATA[页内导航： 能否只用官方的API来获取全部饭否消息？ 饭否消息结构 使用regex解析饭否消息 使用xml解析饭否消息 两相比较 相关阅读 批量导出饭否程序的方法很多，但是基本思路都是先将该网... ]]></description>
			<content:encoded><![CDATA[<p>页内导航：</p>
<ul>
<li><a href="#xiaochaqu"><strong>能否只用官方的API来获取全部饭否消息？</strong></a></li>
<li><a href="#饭否消息结构"><strong>饭否消息结构</strong></a></li>
<li><a href="#regex"><strong>使用regex解析饭否消息</strong></a></li>
<li><a href="#python"><strong>使用xml解析饭否消息</strong></a></li>
<li><a href="#compare"><strong>两相比较</strong></a></li>
<li><a href="#xiangguan"><strong>相关阅读</strong></a></li>
</ul>
<p>
批量导出饭否程序的方法很多，但是基本思路都是先将该网页保存到本地，然后将有用的饭否消息析取出来。本文不讨论如何下载饭否网页了（使用迅雷、wget、curl等），重点讨论对于下载到本地的网页，如何将有用的饭否消息析取出来。
<p><span id="more-33"></span></p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;"><a href="#xiaochaqu"><strong><span style="color: #ff008c;">小插曲：能否只用官方的API来获取全部饭否消息？</span></strong></a></h2>
<p>您或许会提议为什么不使用饭否自身的API。是的，饭否的API更快捷方便，兼容性很强。只是，饭否官方只提供下载前20条饭否消息的API。如果纯粹使用饭否官方API来下载全部饭否消息的方法也不是没有，只是很邪恶：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>true<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; download <span style="color: #cc66cc;">20</span> messages via API<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; store them<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">delete</span> this <span style="color: #cc66cc;">20</span> messages via API<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</p>
<p>一边下载一边删除，确实总能得到全部消息。删除了前面的20条，能保证后面的20条以新消息的面目出现。这在理论上是行得通的。但是我们需要的是英雄Heroes里Peter那样无损的复制方式，而不是Sylar那样的残忍的剪切方式，呵呵。既然官方的API有限制，我们就自己动手了。请继续阅读本文。</p>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">饭否消息结构</h2>
<p>打开一个饭否消息网页的源代码，例如本人的<br />
<a name="饭否消息结构"></a><a title=" 我爱正则表达式" href="http://fanfou.com/regex" target="_blank">http://fanfou.com/regex/p.1</a>（其实http://fanfou.com/regex是http://fanfou.com/regex/p.1的快捷方式。这里使用完整的路径，以便体现其一般性。），观察可见，有用的饭否消息在这个框架里面：（代码较长，阅读请点击展开）</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"> <br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;ol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 代码非抄不能懂也。<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-03 12:07&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;/statuses/QD6qHiqUbeE&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-03 12:07<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过 <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://del.icio.us/fanfou/API%E5%BA%94%E7%94%A8&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; API<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 向自由的身心致敬！ - 早嗷嗷也盼~晚安安也盼~望穿安安双眼~~怎知道今日里打土匪进深山自己的队伍来哎到嗷~面安前安呐啊啊啊啊啊~~~ <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://fanfou.com/linkto/aHR0cDovL3d3dy5kb3ViYW4uY29tL2V2ZW50LzEwMjczNDg3Lw&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://www.douban.com/event/10273487/<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-06 14:07&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;/share/bd96z1U-gHw&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-06 14:07<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://help.fanfou.com/share_button.html&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 饭否分享<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;photo&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://fanfou.com/photo/8JsezhHM_VU&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;img</span> <span style="color: #000066;">src</span>=<span style="color: #ff0000;">&quot;http://photo.fanfou.com/m0/00/19/e2_36807.jpg&quot;</span> <span style="color: #000066;">alt</span>=<span style="color: #ff0000;">&quot;caixinceshi - no description&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 上传了新照片：caixinceshi - no description<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-03 11:33&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-03 11:33<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://help.fanfou.com/mobile_mms.html&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 彩信<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; #更多的<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>条目，每页最多20条。<br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/ol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>Tips：在分析饭否源代码时，饭否消息全在一行，不便于阅读。您可以拷贝所需要的代码（注意前后结构的匹配呼应）到vim中，执行<tt class="string">:%s/&gt;/&gt;\r/g</tt>(将每个&gt;后面加上一个换行符)，再按<tt class="string">ggvG</tt>全选，按<tt class="string">=</tt>格式代码，所有的代码就成了漂亮的缩进格式，便于阅读了。</p>
</p>
<p><a name="regex"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">使用regex解析饭否消息</h2>
<p> </a></p>
<p>下面是使用regex来解析饭否消息的代码（直接拷贝自本人原来的perl抓饭程序。）</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$ffmsg</span><span style="color: #339933;">=</span><span style="color: #000066;">qr</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/(?:statuses|share)/([-_a-zA-Z0-9]{11})&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;([-: 0-9]{16})&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;">&#40;</span>网页<span style="color: #339933;">|</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*&lt;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&gt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&gt;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:&lt;/</span>a<span style="color: #339933;">&gt;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>li<span style="color: #339933;">&gt;</span><br />
<span style="color: #ff0000;">' }xi;</span></div></td></tr></tbody></table></div>
</p>
<p>可以看出，使用正则表达式，能够比较真实地再现原网页代码的风貌。有几处小地方需要说明一下：</p>
<ul>
<li>在第一组小括号里，我使用了<tt class="regex">([^<]+?)</tt>来捕获消息正文（一条完整的消息可以分为：消息正文；发送时间；消息uuid例如QD6qHiqUbeE，发送方法，类型（彩信还是文本））。最初是使用<tt class="regex">.*?</tt>的。但是这样不精确，有时候两条消息竟然混合在一起。而<tt class="regex">([^<]+?)</tt>捕获的是从当前位置开始至下一个&lt;之前的所有内容。或许您会问，这不怕受到消息正文中可能出现的&lt;的影响吗？答案是：不会受到影响。因为饭否会把所有的&lt;以及其实有可能影响解析的字符，都转换成&lt;的形式了，因此它不影响解析。同时，<strong>使用精确的正则表达式有助于提高效率，让不匹配的正则式尽早失败。</strong></li>
<li><tt class="regex">(?:statuses|share)</tt>。这条正则表达式是用来捕获饭否的uuid。它不但能捕获以普通方法发布的消息（网页、短信、手机、API、IM工具等），还能捕获由“饭否分享”工具发布的消息。我不是很喜欢饭否分享这个工具。（或许改天有时间写篇文章，揭露它的缺点？）之所以把“饭否分享”消息和普通消息分开来说，是因为两者的结构是不一样的。</li>
<li>通过<tt class="regex">(网页|(?:\s*<[^>]+>)[^<]+(?:))</tt>这条正则式，既用了捕获型括号，又用了非捕获型括号。使用后者，能有效地避免程序太复杂，便于按序号引用（$1,$2等，如果越多则越混乱，修改正则式后，更是乱成一团遭），还能节省内存（如果程序中捕获了太多的内容，而不及时释放，或许会占尽资源。毕竟不是只捕获几十字节。要考虑到饭否用户或许有近十万条的饭否消息。指的是<a href="http://fanfou.com/appleice">苹果流冰</a>这样的“万玻南痨话”）</li>
<li><tt class="regex">xi</tt>选项：<tt class="regex">x</tt>是为了使用忽略空白字符和允许注释；<tt class="regex">i</tt>选项是忽略大小写。</li>
</ul>
<p>使用正则表达式来析取饭否消息文本，需要考虑的细节很多。一处不细致，程序运行起来就会给你难看。饭否彩信的格式就略过不分析了。道理相同，点到为止。</p>
</p>
<p><a name="python"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">使用xml解析饭否消息</h2>
<p></a></p>
<p>再来看一下在python下，使用xml来解析饭否消息。注：该程序参考了<a href="http://www.happysky.org/" target="_blank"><strong><span style="color: #ff008c;">ppip</span></strong</a>的<a href="http://code.google.com/p/pyfan/" target="_blank">pyfan</a>程序。<br />
 </p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">xml</span>.<span style="color: black;">dom</span> <span style="color: #ff7700;font-weight:bold;">import</span> minidom, Node <span style="color: #808080; font-style: italic;">#引人解析工具：xml小马驹！</span><br />
node = minidom.<span style="color: black;">parse</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;http://fanfou.com/zhasm/p.1&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<span style="color: #808080; font-style: italic;">#抓取页面http://fanfou.com/zhasm/p.1 的全部内容到变量node中</span><br />
l = node.<span style="color: black;">getElementsByTagName</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ol&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
<span style="color: #808080; font-style: italic;">#将饭否消息部分内容保存到变量l中</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> c <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, number<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># 时间</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">hasAttribute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;class&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">continue</span><br />
&nbsp; &nbsp; content = <span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#时间:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>.\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; firstChild.<span style="color: black;">getAttribute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;title&quot;</span><span style="color: black;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#消息正文 :</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; childNodes<span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">22</span>:-<span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#uuid</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>.<span style="color: black;">firstChild</span>.\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; getAttribute<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;href&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">10</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>在xml文件中，前后呼应的标签，成了鲜活的特征，这些特征可以被xml解析函数很容易地辨识出来，并提取出所需内容。</p>
<ul>
<li><strong>childNodes[c].childNodes[0].toxml()[22:-7]</strong>：这条语句的意思是，对于每一条饭否消息（childNodes[c]），其消息内容的第一个节点（childNodes[0]），截取其第23字节到倒数第7字节的内容。它是指哪一段呢？其实就是每一对&lt;span class=&#8221;content&#8221;&gt;&#8230;&lt;/span&gt;之间点号所示的内容。</li>
<li>每条消息的发送时间、正文、uuid，保存在tuple中。</li>
</ul>
<p>取得了内容之后，至于之后的煎炒烹炸，就悉听尊便了。</p>
<p>值得一提的是，本人在大量下载饭否消息时，不止一次遇到过饭否页面无法访问的情况。问了饭否郭万怀，答曰为了减轻服务器负载，每个IP地址下每分钟允许访问100个页面。超过此数就会自动屏蔽。我测试的结果是少于100页。比较靠谱的间隔是，每析取一页，sleep(15)。是有些慢了。没办法。当然，也有人说，执行本人以前写的抓饭程序，一次下载几百页，并没有遇到当机情况。那我只能说是您的RP高、运气好了。</p>
</p>
<p><a name="compare"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">两者比较</h2>
<p></a> </p>
<p>个人认为，xml与regex相比，有如下特点：</p>
<ul>
<li><strong>通用性：</strong>xml具有通用性，不单单能解析饭否消息，其它符合规则的html文本，同样能够较少地改动代码，即可解析；而正则表达式则具有专用性，不能放之四海而皆准。当饭否的界面、框架有微调时，估计使用正则表达式解析的工具首先倒下。</li>
<li><strong>可读性：</strong>有人说perl是只写语言，regex尤甚。这是在说perl或regex代码在编写时性之所至，酣畅淋漓，执行也很高效。只是，如果代码格式混乱且无注释文档的话，隔数日、数月再读，仿佛读天书一般。而使用xml库来解析的python语言，则由于代码格式整齐，库函数见名知意，因而具有较强的可读性。这样说，总体是这样。不过我们可以尽可能把代码（即使是perl或regex的）写的整齐已读，尤其是考虑到perl支持<tt class="regex">/x</tt>选项。</li>
<li><strong>效率：</strong>良好编译的正则式，其执行效率应该优于xml解析。但是，使用xml能够节省编程时间；使用正则式牺牲一部分的编程时间，理论上能提高一点点效率。有兴趣的读者可以编写一段程序，循环个成千上万次，比较一下平均时间。</li>
</ul>
<p>
写到这里，对照金庸先生在《鹿鼎记》第五章：“金戈运启驱除会，玉匣书留想象间”两种武功的比较，颇有意味：<br />
“大慈大悲千叶手”招式太多，记起来麻烦。而“八卦游龙掌”只有八八六十四式，但反复变化，尽可敌得住千叶手。那么哪一门功夫厉害些？这两门都是上乘掌法，说不上哪一门功夫厉害。谁的功夫深，用得巧妙，谁就胜了。
<p>以本文来看，regex就相当于是大慈大悲千叶手了，需要留意的细节太多；xml方式呢，就相当于只有八八六十四式的“八卦游龙掌”。两种工具都很有用。</p>
</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">呼吁官方提供更多功能</h2>
<p>离题了。这里顺便发发牢骚而已，与xml、regex无关。我不止一次地在饭否和本人blog中抱怨，使用上面这足粗笨的方法下载、解析，是最无奈的应用。最便捷的方式，应该是官方提供批量导出程序，只要执行一条数据库查询导出即可实现我们辛辛苦苦半天才能以变通的方式实现的功能。或许是饭否官方的人员都在忙着增强和美化海内吧，饭否自生自长，长时间没有更新，任凭jiwai.de、zuosa等推出一项又一项的新功能。 </p>
</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">扩展</h2>
<p>本文的思路，对twitter同样适用。但是twitter越来越慢了。有一段时间好像还不支持查看历史页面。</p>
</p>
<p><a name="xiangguan"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">相关阅读</h2>
<p></a></p>
<ul>
<li><a href="http://iregex.org/blog/fanfou-private-message-format-analysis.html" target="_blank">饭否私信格式分析</a></li>
<li><a href="http://zhasm.com/blog/fanfou-msg-grabber-limitation-and-suggestion-on-sharing-msg.html">关于饭否消息打包下载的限制以及对于饭否分享功能的建议</a></li>
<li><a href="http://zhasm.com/blog/about-my-fanfou-applications.html">关于本人编写的饭否应用的三言两语</a></li>
<li><a href="http://zhasm.com/blog/comments-on-fanfou.html">饭否，尚能饭否？</a></li>
<li><a href="http://zhasm.com/blog/uuid-in-twitter-and-fanfou.html">uuid in twitter and fanfou</a></li>
<li><a href="http://zhasm.com/blog/fanfou-message-grabber.html">批量抓饭脚本：一次性打包输出自己全部的饭否消息！</a></li>
<li><a href="http://zhasm.com/blog/fanfou-vs-twitter-base64-vs-tinyurl.html">fanfou vs twitter, base64 vs tinyurl?</a></li>
</ul>
<p><span style="color: #ffffff;">验证码：BANG1F79A9FAD20225BEA7FE397AXIANGUO e8da37692b5b030cbefb9956e3bdb9cc</span></p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
