<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; 应用</title>
	<atom:link href="http://iregex.org/blog/category/%e5%ba%94%e7%94%a8/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Tue, 29 Mar 2011 05:04:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>		<item>
		<title>较安全的rm脚本</title>
		<link>http://iregex.org/blog/safer-rm-command.html</link>
		<comments>http://iregex.org/blog/safer-rm-command.html#comments</comments>
		<pubDate>Tue, 29 Mar 2011 05:00:18 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=155</guid>
		<description><![CDATA[想必不少人体会过在Linux下误删文件的欲哭无泪的感觉。我整理出一份比较安全的rm脚本，贴在这里。 特性 接管原生的/bin/rm命令，将待删除的文件mv至回收站，便于统一管理，或者更重要的，... ]]></description>
			<content:encoded><![CDATA[<p>想必不少人体会过在Linux下误删文件的欲哭无泪的感觉。我整理出一份比较安全的rm脚本，贴在这里。</p>
<p><span id="more-155"></span></p>
<h2 style="background-color: rgb(153, 204, 0); border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); font-size: 21px; line-height: 35px; padding-top: 3px; text-indent: 6px;">特性</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>接管原生的<code class="codecolorer bash default"><span class="bash"><span style="color: #000000; font-weight: bold;">/</span>bin<span style="color: #000000; font-weight: bold;">/</span><span style="color: #c20cb9; font-weight: bold;">rm</span></span></code>命令，将待删除的文件<code class="codecolorer bash default"><span class="bash"><span style="color: #c20cb9; font-weight: bold;">mv</span></span></code>至回收站，便于统一管理，或者更重要的，拯救误删文件。</li>
<li>需要调用原生的<code class="codecolorer bash default"><span class="bash"><span style="color: #c20cb9; font-weight: bold;">rm</span></span></code>时，指定路径即可，例如：<code class="codecolorer bash default"><span class="bash"><span style="color: #000000; font-weight: bold;">/</span>bin<span style="color: #000000; font-weight: bold;">/</span><span style="color: #c20cb9; font-weight: bold;">rm</span> <span style="color: #660033;">-rf</span> somefolder</span></code> </li>
<li>记录删除日志到<code class="codecolorer bash default"><span class="bash"><span style="color: #000000; font-weight: bold;">/</span>var<span style="color: #000000; font-weight: bold;">/</span>log<span style="color: #000000; font-weight: bold;">/</span>trash.log</span></code>。如果不需要记录日志，只需要将<code class="codecolorer bash default"><span class="bash">log</span></code>变量置空即可。 </li>
<li>将文件移动至回收站时自动重命名，以便可以重复删除重名文件。</li>
<li>贴图：<img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/Screenshot2011-03-29at123750PM.png" alt="我爱正则表达式" /></li>
</ul>
</blockquote>
<h2 style="background-color: rgb(153, 204, 0); border: 1px solid rgb(102, 102, 102); color: rgb(0, 0, 0); font-size: 21px; line-height: 35px; padding-top: 3px; text-indent: 6px;">贴代码</h2>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">用法</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<li>将下面的代码贴至<code class="codecolorer bash default"><span class="bash">~<span style="color: #000000; font-weight: bold;">/</span>.bashrc 或 ~<span style="color: #000000; font-weight: bold;">/</span>.bash_profile</span></code>中，然后刷新该文件<code class="codecolorer bash default"><span class="bash"><span style="color: #7a0874; font-weight: bold;">source</span> ~<span style="color: #000000; font-weight: bold;">/</span>.bashrc</span></code>即可。</li>
<li>临时取消自定义的<code class="codecolorer bash default"><span class="bash"><span style="color: #c20cb9; font-weight: bold;">rm</span></span></code>：可以使用前文所说的<code class="codecolorer bash default"><span class="bash"><span style="color: #000000; font-weight: bold;">/</span>bin<span style="color: #000000; font-weight: bold;">/</span><span style="color: #c20cb9; font-weight: bold;">rm</span></span></code>或在当前环境下取消该function的定义：<code class="codecolorer bash default"><span class="bash"><span style="color: #7a0874; font-weight: bold;">unset</span> <span style="color: #660033;">-f</span> <span style="color: #c20cb9; font-weight: bold;">rm</span></span></code>。</li>
<li>需要根据自己的系统，修改一下各个变量的定义。</li>
</blockquote>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#safe remove, mv the files to .Trash with unique name</span><br />
<span style="color: #666666; font-style: italic;">#and log the acction</span><br />
<span style="color: #000000; font-weight: bold;">function</span> <span style="color: #c20cb9; font-weight: bold;">rm</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><br />
<span style="color: #7a0874; font-weight: bold;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #007800;">trash</span>=<span style="color: #ff0000;">&quot;<span style="color: #007800;">$HOME</span>/.Trash&quot;</span><br />
&nbsp; &nbsp; <span style="color: #007800;">log</span>=<span style="color: #ff0000;">&quot;/var/log/trash.log&quot;</span><br />
&nbsp; &nbsp; <span style="color: #007800;">stamp</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">date</span> <span style="color: #ff0000;">&quot;+%Y-%m-%d %H:%M:%S&quot;</span><span style="color: #000000; font-weight: bold;">`</span> <span style="color: #666666; font-style: italic;">#current time</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #660033;">-f</span> <span style="color: #ff0000;">&quot;$1&quot;</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">do</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#remove the possible ending /</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007800;">file</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$1</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sed</span> <span style="color: #ff0000;">'s#\/$##'</span> <span style="color: #000000; font-weight: bold;">`</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007800;">pure_filename</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$file</span> &nbsp;<span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">awk</span> <span style="color: #660033;">-F</span> <span style="color: #000000; font-weight: bold;">/</span> <span style="color: #ff0000;">'{print $NF}'</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sed</span> <span style="color: #660033;">-e</span> <span style="color: #ff0000;">&quot;s#^\.##&quot;</span> <span style="color: #000000; font-weight: bold;">`</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$pure_filename</span> <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">grep</span> <span style="color: #ff0000;">&quot;\.&quot;</span> <span style="color: #000000; font-weight: bold;">`</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007800;">new_file</span>=<span style="color: #000000; font-weight: bold;">`</span> <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$pure_filename</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sed</span> <span style="color: #660033;">-e</span> <span style="color: #ff0000;">&quot;s/\([^.]*$\)/<span style="color: #007800;">$RANDOM</span>.\1/&quot;</span> <span style="color: #000000; font-weight: bold;">`</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007800;">new_file</span>=<span style="color: #ff0000;">&quot;<span style="color: #007800;">$pure_filename</span>.<span style="color: #007800;">$RANDOM</span>&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">fi</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007800;">trash_file</span>=<span style="color: #ff0000;">&quot;<span style="color: #007800;">$trash</span>/<span style="color: #007800;">$new_file</span>&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #c20cb9; font-weight: bold;">mv</span> <span style="color: #ff0000;">&quot;<span style="color: #007800;">$file</span>&quot;</span> <span style="color: #ff0000;">&quot;<span style="color: #007800;">$trash_file</span>&quot;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #660033;">-w</span> <span style="color: #007800;">$log</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #660033;">-e</span> <span style="color: #ff0000;">&quot;[<span style="color: #007800;">$stamp</span>]<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #007800;">$file</span><span style="color: #000099; font-weight: bold;">\t</span>=&gt;<span style="color: #000099; font-weight: bold;">\t</span>[<span style="color: #007800;">$trash_file</span>]&quot;</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">tee</span> <span style="color: #660033;">-a</span> <span style="color: #007800;">$log</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #660033;">-e</span> <span style="color: #ff0000;">&quot;[<span style="color: #007800;">$stamp</span>]<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #007800;">$file</span><span style="color: #000099; font-weight: bold;">\t</span>=&gt;<span style="color: #000099; font-weight: bold;">\t</span>[<span style="color: #007800;">$trash_file</span>]&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">fi</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #7a0874; font-weight: bold;">shift</span> &nbsp; <span style="color: #666666; font-style: italic;">#increment the loop</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">done</span><br />
<span style="color: #7a0874; font-weight: bold;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/safer-rm-command.html/feed</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>搞定Mac下的郑码输入法</title>
		<link>http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html</link>
		<comments>http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html#comments</comments>
		<pubDate>Sun, 07 Nov 2010 14:43:05 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[ime]]></category>
		<category><![CDATA[openvanilla]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[zhengma]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=152</guid>
		<description><![CDATA[最近用上了Mac, 苦于没有一份好用的郑码输入法. 于是发挥不怕折腾的精神, 自己制作一份码表, 记在这里. 郑码 估计没多少人使用郑码吧, 这是一个非常小众的输入法方案, 与五笔类似, 据说更&#8... ]]></description>
			<content:encoded><![CDATA[<p>最近用上了Mac, 苦于没有一份好用的郑码输入法. 于是发挥不怕折腾的精神, 自己制作一份码表, 记在这里. </p>
<p> <span id="more-152"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">郑码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>估计没多少人使用<a href="http://zh.wikipedia.org/zh/%E9%83%91%E7%A0%81%E8%BE%93%E5%85%A5%E6%B3%95">郑码</a>吧, 这是一个非常小众的输入法方案, 与五笔类似, 据说更&#8221;规则&#8221;. 没有用过五笔, 不好评价. 个人比较喜欢, 一直在用. 无论如何, 拼音输入法是无福消受的. </p>
<p>在Ubuntu下可以使用 ibus, 但是在Mac下就没有这么幸运了. Fit是不错的输入法, 但是只有内置的拼音和五笔, 暂不支持自定义码表; QIM是收费软件, 貌似可以自定义码表, 但它缺少文档, 找了半天没发现多少有用信息. </p>
<p>搜索一番, 找到一个OpenVanilla, 香草输入法. 免费, 开放, 支持自定义码表. 地址在<a href="http://openvanilla.org/">这里</a> . 另外从郑码爱好者家园上看到一个类似的解决方案, 只不过不太完美(例如, 无法忍受的词频混乱问题). 地址在<a href="http://www.cn25.net/zm/showbbs.asp?bd=14&#038;id=841&#038;totable=1">这里</a>  . </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">做一份郑码码表</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>词汇列表:</strong> 使用的是搜狗实验室的语料文本. 地址在<a href="http://www.sogou.com/labs/dl/w.html">这里</a>  , 格式为 &#8220;词A 词频 词性1 词性2 … 词性N&#8221;, 取自己有用的词汇以及对应词频信息即可. 该文件只有两字词至多字词, 没有单个汉字的字频信息. </p>
<p><strong>字频列表:</strong> 在 <a href="http://lingua.mtsu.edu/chinese-computing/statistics/">lingua.mtsu.edu</a> 上找到了字频列表, 令人喜出望外的是, 它还带有单字的(多音字)拼音/声调. 这为我打造一份带有拼音辅助的码表方案提供了极大的便利. 其实拼音只作为形码的辅助而使用, 只有打不出的字才反查拼音, 无需单字或词汇的简拼信息. 又因为它是辅助, 因此将所有的拼音信息放在做好的列表的末尾即可. </p>
<p>有了上述的两个文件, 就可以准备做码表了. 不过, 还需要单字的构词码码表, 以及约定俗成的快速码表(例如, 对于&#8221;北京&#8221;一词, 郑码有两种打法, 一是简单的ts, 一是正规的trsj). 单字构词码表之前我已经准备出来了, 后者我从网上搜索到的郑码光盘中找到了大字集的码表.</p>
<p>制作的过程不难, 不过细节不少. 不赘述. 过程中当然少不了<a href="http://iregex.org">正则表达式</a>的帮忙.</p>
<p>最终做出来的是五码郑码. 非常好用. 每次启动香草时要花一秒钟左右的时间, 但是一旦运行起来, 就感觉不到了. 毕竟, 最终的词汇列表为19万条之巨.</p>
<p>香草的最大的优势在于开放和免费. 比起qim或fit来, 它作为一个输入法, 支持的特性/自定义功能实在有限. 连自定义切换中英文也不可以, 更不用说动态调整词频和增删词条了. 我写了一个bash function, 用来搜索现有的词条; 写了一个bash脚本, 用来删除词条; 写了一个python程序, 用来动态添加新词.</p>
<p>最后这个添加新词的python程序还是比较好玩的. 支持从命令行中或文件中读取词汇列表, 批量添加到词库中. 添加过程中它自己生成格式正确的郑码编码; 添加完毕之后还会杀死香草, 以便重新加载新词库. </p>
<p>程序push到github了. 在<a href="https://github.com/zhasm/zhengma">这里</a>. </p>
</blockquote>
<p><a href="http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/Screenshot2010-11-07at103904PM.png" border="0" alt="Photobucket"></a></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Update</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<li><strong>2010-11-08</strong>查看了一下香草的其它输入法码表，搞定了标点符号。 </li>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>一个简单的中文分词程序</title>
		<link>http://iregex.org/blog/simple-nlp-for-chinese.html</link>
		<comments>http://iregex.org/blog/simple-nlp-for-chinese.html#comments</comments>
		<pubDate>Sun, 26 Sep 2010 14:41:12 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=151</guid>
		<description><![CDATA[kds:“前驻法大使吴建民指出，应该理**国”,想了一下，原来两个星号是“性爱”两字，生活在一个机械屏蔽时代的中国还真有喜感。——via 想必您也看到了推特上关于“理＊＊国”的笑话了。... ]]></description>
			<content:encoded><![CDATA[<blockquote  style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
    kds:“前驻法大使吴建民指出，应该理**国”,想了一下，原来两个星号是“性爱”两字，生活在一个机械屏蔽时代的中国还真有喜感。——<a href="https://twitter.com/rightf/status/25555437368" title="我爱正则表达式" target="_blank">via</a>
</p></blockquote>
<p>想必您也看到了推特上关于“<a href="https://twitter.com/rex_zhasm/status/25567030862" title="我爱正则表达式" target="_blank">理＊＊国</a>”的笑话了。我正好想学一下中文分词方面的知识，这是第一篇。</p>
<p><span id="more-151"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">分词原理与实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>英语等以空白字符作为分隔符的语言，分词不是问题。中文分词，需要处理的细节太多。单就“<a href="http://is.gd/ftZNO" title="我爱正则表达式" target="_blank">真歧义</a>”这一问题（简言之，如果没有上下文，连活生生的人也无法确定如何断句的歧义句）的处理方法而言，前辈们就已写出洋洋洒洒许多文字。不过这属于进阶题目。我想先实现一个最简单的分词程序。</p>
<p>以我的理解，最简单的分词程序，应该是先将中文文本切成最小的单位－－汉字－－再从词典里找词，将这些字按照最左最长原则（与正则精神暗合），合并为以词为单位的集合。这样的应该是最快的，只按照给定的数据划分合并即可，不必考虑语法元素的权重（词性：名动形数量代等等，语法：主谓宾定状补），以及上下文的出现次数。</p>
<p>关于源文本的切分，就参照<a href="http://iregex.org/blog/words-counter-in-python.html" title="我爱正则表达式" target="_blank">《统计汉字／英文单词数》</a>一文的思路，使用正则表达式<code class="codecolorer python default"><span class="python">r<span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span></span></code>来匹配即可。</p>
<p>关于词典，我使用的是<a href="http://www.mdbg.net/chindict/chindict.php?page=cedict" title="我爱正则表达式" target="_blank">CC-CEDICT</a>的词典，原因有三：没有版权问题；速度较快；Chrome也在用它（发现了吧：在Chrome上双击中文句子，会自动选择中文词汇而不是单字或整行进行反选高亮）。</p>
<p>接下来是如何分词。经过思考，我发现搜索树的原理可以拿来就用。原理请见此文：<a href="http://iregex.org/blog/trie-in-python.html" title="我爱正则表达式" target="_blank">Trie in Python</a>。具体方法是，将词库逐字读入内存，建立搜索树；然后对目标文本进行逐字分析，如果该字之后还可搜索，则继续搜索；否则停止，作为一个词汇单位处理。</p>
<p>这样的算法理论上比较快（未进行benchmark），原因有三：使用Trie结构，本质上是哈希表，空间换时间，是O(0)级的搜索；词库只有800K，可以轻易载入，内存空间没占多少；算法最慢的部分是载入Trie的阶段，之后速度就不再受影响。</p>
<p>不过，谈到它的扩充性，目前只能在words.txt中手动添加新词，而不能实现机器学习。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">源码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>完整的程序（包括我处理过的词库列表）放在<a href="http://github.com/zhasm/simpleNLP" title="我爱正则表达式" target="_blank">github</a>上了。有兴趣的可以把玩一下。这里列出主程序：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;nlp.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-09-26 19:15</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<br />
regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> init_wordslist<span style="color: black;">&#40;</span>fn<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;./words.txt&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; f<span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span>fn<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; lines<span style="color: #66cc66;">=</span><span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>f.<span style="color: black;">readlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> lines<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> words_2_trie<span style="color: black;">&#40;</span>wordslist<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; d<span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> wordslist: <br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>d<br />
&nbsp; &nbsp; &nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>word<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> chars:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span>ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span><span style="color: #483d8b;">''</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> d<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>trie<br />
&nbsp; &nbsp; index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> chars:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> char<span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> index<span style="color: #66cc66;">==</span><span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> char<span style="color: #66cc66;">,</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'*'</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>chars<span style="color: black;">&#91;</span>index:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">pass</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">break</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#init</span><br />
&nbsp; &nbsp; words<span style="color: #66cc66;">=</span>init_wordslist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; trie<span style="color: #66cc66;">=</span>words_2_trie<span style="color: black;">&#40;</span>words<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#read content</span><br />
&nbsp; &nbsp; fn<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #dc143c;">string</span><span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span>fn<span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#do the job</span><br />
&nbsp; &nbsp; search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__<span style="color: #66cc66;">==</span><span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">本机测试</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>测试的文本如下：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">只听得一个女子低低应了一声。绿竹翁道：“姑姑请看，这部琴谱可有些古怪。”那<br />
女子又嗯了一声，琴音响起，调了调弦，停了一会，似是在将断了的琴弦换去，又调了调<br />
弦，便奏了起来。初时所奏和绿竹翁相同，到后来越转越高，那琴韵竟然履险如夷，举重<br />
若轻，毫不费力的便转了上去。令狐冲又惊又喜，依稀记得便是那天晚上所听到曲洋所奏<br />
的琴韵。这一曲时而慷慨激昂，时而温柔雅致，令狐冲虽不明乐理，但觉这位婆婆所奏，<br />
和曲洋所奏的曲调虽同，意趣却大有差别。这婆婆所奏的曲调平和中正，令人听着只觉音<br />
乐之美，却无曲洋所奏热血如沸的激奋。奏了良久，琴韵渐缓，似乎乐音在不住远去，倒<br />
像奏琴之人走出了数十丈之遥，又走到数里之外，细微几不可再闻。<br />
<br />
理性爱国<br />
性爱体验<br />
我爱正则表达式</div></td></tr></tbody></table></div>
<p>请留意末尾三行。</p>
<p>再看一下程序处理的结果：（＊表示词汇间的分隔）</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">只 * 听 得 * 一 个 * 女 子 * 低 低 * 应 * 了 * 一 声 * 。 * 绿 * 竹 * 翁 * 道 * ： * “ * 姑 姑 * 请 看 * ， * 这 * 部 * 琴 * 谱 * 可 有 * 些 * 古 怪 * 。 * ” * 那 * 女 子 * 又 * 嗯 * 了 * 一 声 * ， * 琴 * 音 响 * 起 * ， * 调 * 了 * 调 * 弦 * ， * 停 * 了 * 一 会 * ， * 似 是 * 在 * 将 * 断 * 了 * 的 * 琴 弦 * 换 * 去 * ， * 又 * 调 * 了 * 调 * 弦 * ， * 便 * 奏 * 了 * 起 来 * 。 * 初 * 时 * 所 * 奏 * 和 * 绿 * 竹 * 翁 * 相 同 * ， * 到 * 后 来 * 越 * 转 * 越 * 高 * ， * 那 * 琴 * 韵 * 竟 然 * 履 险 如 夷 * ， * 举 重 * 若 * 轻 * ， * 毫 不 费 力 * 的 * 便 * 转 * 了 * 上 去 * 。 * 令 狐 * 冲 * 又 * 惊 * 又 * 喜 * ， * 依 稀 * 记 得 * 便 是 * 那 天 * 晚 上 * 所 * 听 到 * 曲 * 洋 * 所 * 奏 * 的 * 琴 * 韵 * 。 * 这 一 * 曲 * 时 而 * 慷 慨 * 激 昂 * ， * 时 而 * 温 柔 * 雅 致 * ， * 令 狐 * 冲 * 虽 * 不 明 * 乐 理 * ， * 但 * 觉 * 这 位 * 婆 婆 * 所 * 奏 * ， * 和 * 曲 * 洋 * 所 * 奏 * 的 * 曲 调 * 虽 * 同 * ， * 意 趣 * 却 * 大 有 * 差 别 * 。 * 这 * 婆 婆 * 所 * 奏 * 的 * 曲 调 * 平 和 * 中 正 * ， * 令 人 * 听 * 着 * 只 * 觉 * 音 乐 之 * 美 * ， * 却 * 无 * 曲 * 洋 * 所 * 奏 * 热 血 * 如 * 沸 * 的 * 激 * 奋 * 。 * 奏 * 了 * 良 久 * ， * 琴 * 韵 * 渐 * 缓 * ， * 似 乎 * 乐 音 * 在 * 不 住 * 远 * 去 * ， * 倒 像 * 奏 * 琴 * 之 * 人 * 走 出 * 了 * 数 十 * 丈 * 之 * 遥 * ， * 又 * 走 * 到 * 数 * 里 * 之 外 * ， * 细 微 * 几 * 不 可 再 * 闻 * 。 * 理 性 * 爱 国 * 性 爱 * 体 验 * 我 * 爱 * 正 则 * 表 达 式</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">更新</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>2010-10-03更新</strong>:发现本程序的一个bug。已改进算法，更精确，更快速。程序详见GitHub，链接如前。</p>
<p>请看新程序的分词结果：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>只 * 听 * 得 * 一 * 个 * 女子 * 低 * 低 * 应 * 了 * 一声 * 。 * 绿 * 竹 * 翁 * 道 * ： * “ * 姑姑 * 请看 * ， * 这 * 部 * 琴谱 * 可 * 有些 * 古怪 * 。 * ” * 那 * 女子 * 又 * 嗯 * 了 * 一声 * ， * 琴 * 音响 * 起 * ， * 调 * 了 * 调 * 弦 * ， * 停 * 了 * 一会 * ， * 似 * 是 * 在 * 将 * 断 * 了 * 的 * 琴弦 * 换 * 去 * ， * 又 * 调 * 了 * 调 * 弦 * ， * 便 * 奏 * 了 * 起来 * 。 * 初 * 时 * 所 * 奏 * 和 * 绿 * 竹 * 翁 * 相同 * ， * 到 * 后来 * 越 * 转 * 越 * 高 * ， * 那 * 琴 * 韵 * 竟然 * 履 * 险 * 如 * 夷 * ， * 举重 * 若 * 轻 * ， * 毫不 * 费力 * 的 * 便 * 转 * 了 * 上去 * 。 * 令狐 * 冲 * 又 * 惊 * 又 * 喜 * ， * 依稀 * 记得 * 便是 * 那天 * 晚上 * 所 * 听到 * 曲 * 洋 * 所 * 奏 * 的 * 琴 * 韵 * 。 * 这 * 一 * 曲 * 时而 * 慷慨 * 激昂 * ， * 时而 * 温柔 * 雅致 * ， * 令狐 * 冲 * 虽 * 不明 * 乐理 * ， * 但 * 觉 * 这位 * 婆婆 * 所 * 奏 * ， * 和 * 曲 * 洋 * 所 * 奏 * 的 * 曲调 * 虽 * 同 * ， * 意趣 * 却 * 大有 * 差别 * 。 * 这 * 婆婆 * 所 * 奏 * 的 * 曲调 * 平和 * 中正 * ， * 令人 * 听 * 着 * 只 * 觉 * 音乐 * 之 * 美 * ， * 却 * 无 * 曲 * 洋 * 所 * 奏 * 热血 * 如 * 沸 * 的 * 激 * 奋 * 。 * 奏 * 了 * 良久 * ， * 琴 * 韵 * 渐 * 缓 * ， * 似乎 * 乐音 * 在 * 不住 * 远 * 去 * ， * 倒像 * 奏 * 琴 * 之 * 人 * 走出 * 了 * 数 * 十 * 丈 * 之 * 遥 * ， * 又 * 走 * 到 * 数 * 里 * 之外 * ， * 细微 * 几 * 不可 * 再 * 闻 * 。 *</p>
<p>理性 * 爱国 *</p>
<p>性爱 * 体验 *</p>
<p>我 * 爱 * 正则 * 表达式 *</p>
<p>轻 * 音乐 *</p>
</blockquote>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/simple-nlp-for-chinese.html/feed</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>统计汉字／英文单词数</title>
		<link>http://iregex.org/blog/words-counter-in-python.html</link>
		<comments>http://iregex.org/blog/words-counter-in-python.html#comments</comments>
		<pubDate>Sat, 25 Sep 2010 11:25:38 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=148</guid>
		<description><![CDATA[一个简单的程序，统计文本文档中的单词和汉字数，逆序排列（出现频率高的排在最前面）。python实现。 思路 使用正则式 &#34;(?x) (?: [\w-]+ &#160;&#124; [\x80-\xff]{3} )&#34;获得utf-8文档中的英文单词... ]]></description>
			<content:encoded><![CDATA[<p>一个简单的程序，统计文本文档中的单词和汉字数，逆序排列（出现频率高的排在最前面）。python实现。</p>
<p><span id="more-148"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">思路</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>使用正则式 <code class="codecolorer python default"><span class="python"><span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span></span></code>获得utf-8文档中的英文单词和汉字的列表。
            </li>
<li>使用dictionary来记录每个单词／汉字出现的频率，如果出现过则＋1，如果没出现则置1。</li>
<li>将dictionary按照value排序，输出。</li>
</ul>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">源码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;counter.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;Mon Sep 20 21:00:52 2010</span><br />
<span style="color: #808080; font-style: italic;">#desc: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; convert .py file to html with VIM.</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">operator</span> <span style="color: #ff7700;font-weight:bold;">import</span> itemgetter<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> readfile<span style="color: black;">&#40;</span>f<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">with</span> <span style="color: #008000;">file</span><span style="color: black;">&#40;</span>f<span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;r&quot;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">as</span> pFile:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> pFile.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">def</span> divide<span style="color: black;">&#40;</span>c<span style="color: #66cc66;">,</span> regex<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#the regex below is only valid for utf8 coding</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>c<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> update_dict<span style="color: black;">&#40;</span>di<span style="color: #66cc66;">,</span>li<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> li:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> di.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; di<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; di<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> di<br />
&nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#receive files from bash</span><br />
&nbsp; &nbsp; files<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span> <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#regex compile only once</span><br />
&nbsp; &nbsp; regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #008000;">dict</span><span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#get all words from files</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> f <span style="color: #ff7700;font-weight:bold;">in</span> files:<br />
&nbsp; &nbsp; &nbsp; &nbsp; words<span style="color: #66cc66;">=</span>divide<span style="color: black;">&#40;</span>readfile<span style="color: black;">&#40;</span>f<span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> regex<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">dict</span><span style="color: #66cc66;">=</span>update_dict<span style="color: black;">&#40;</span><span style="color: #008000;">dict</span><span style="color: #66cc66;">,</span> words<span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#sort dictionary by value </span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#dict is now a list.</span><br />
&nbsp; &nbsp; <span style="color: #008000;">dict</span><span style="color: #66cc66;">=</span><span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span><span style="color: #008000;">dict</span>.<span style="color: black;">items</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> key<span style="color: #66cc66;">=</span>itemgetter<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> reverse<span style="color: #66cc66;">=</span><span style="color: #008000;">True</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#output to standard-output</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">dict</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> i<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span> i<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> <br />
<br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__<span style="color: #66cc66;">==</span><span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Tips</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>由于使用了<code class="codecolorer python default"><span class="python">files<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span></span></code> 来接收参数，因此<code class="codecolorer bash default"><span class="bash">.<span style="color: #000000; font-weight: bold;">/</span>counter.py file1 file2 ...</span></code>可以将参数指定的文件的词频累加计算输出。 </p>
<p>可以自定义该程序。例如，</p>
<ul>
<li>使用
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;(?x) ( [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
words<span style="color: #66cc66;">=</span><span style="color: black;">&#91;</span>w <span style="color: #ff7700;font-weight:bold;">for</span> w <span style="color: #ff7700;font-weight:bold;">in</span> regex.<span style="color: black;">split</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">if</span> w<span style="color: black;">&#93;</span></div></td></tr></tbody></table></div>
<p>这样得到的列表是包含分隔符在内的单词列表，方便于以后对全文分词再做操作。
            </li>
<li>以行为单位处理文件，而不是将整个文件读入内存，在处理大文件时可以节约内存。</li>
<li>可以使用这样的正则表达式先对整个文件预处理一下，去掉可能的html tags: <code class="codecolorer python default"><span class="python">content<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;&lt;[^&gt;]+&quot;</span><span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;&quot;</span><span style="color: #66cc66;">,</span>content<span style="color: black;">&#41;</span></span></code>，这样的结果对于某些文档更精确。
            </li>
</ul>
</blockquote>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/words-counter-in-python.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>打造自己的正则表达式助手程序</title>
		<link>http://iregex.org/blog/diy-regexbuddy.html</link>
		<comments>http://iregex.org/blog/diy-regexbuddy.html#comments</comments>
		<pubDate>Wed, 12 May 2010 05:32:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[cgi]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[regexbuddy]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=115</guid>
		<description><![CDATA[其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出... ]]></description>
			<content:encoded><![CDATA[<p>其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出雏形。现在将思路公布在这里，与各位交流一下。</p>
<p><span id="more-115"></span></p>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">缘由</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">为什么不用RegexBuddy了</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>它是收费软件，价格不算便宜。$39.95。Google一下，或有惊喜。</li>
<li>它只能用于Windows平台。虽然在ubuntu下，我会额外安装wine，仅仅是为了驱动RegexBuddy。</li>
<li>Mac下无法使用RegexBuddy。近来我开始使用Mac环境了，不想再为windows软件单独运行环境了。regexbuddy似乎要失之交臂了。搜索了一下，<a href="http://search.macupdate.com/search.php?keywords=regex&#038;os=mac" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，<a href="http://www.apple.com/search/?q=regex&#038;sec=downloads" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，找到的软件聊聊无几，性能也乏善可陈：大多仅支持JavaScript这样比较朴素的正则，缺乏多语言、多选项的支持。&#8211;RegexBuddy出色的表现，已经将我对正则辅助软件的期望值训练得极为挑剔，一般软件难以落入老夫的法眼了，呵呵。</li>
</ul>
<p>没有现成的解决方案，我就考虑，如何自己DIY一个了。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我理想中的正则辅助软件</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>像RegexBuddy一样，支持以下属性：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>支持多语言正则。至少要支持Perl, Python, PHP, JavaScript吧。.Net的用得不多（只在回答别人问题时用过，不算），可以无视；</li>
<li>支持匹配、替换、分割(split)；</li>
<li>支持生成代码片段；这一点很重要。我平常不会死背硬记一些电脑可以代劳的冬冬，除非经常用&#8211;经常用的，慢慢也就变成肌肉记忆了。</li>
</ol>
</blockquote>
<li>除此之外，它最好还能：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>兼容于各种常见平台。我指的是，Win/Lin/Mac。</li>
<li>对于语言的支持要原生。说实话，我怀疑RegexBuddy还在用Perl5.8风格的正则。5.10中的许多新奇好用的特性，还没有在RegexBuddy中得到支持。究其原因，RegexBuddy的作者大概是自行从头构建的Perl等正则引擎，在细节、版本上，与最新版有所差异。说到语言，想起余晟老师的一点意见，就是思考正则问题时，先不要考虑是什么语言、版本的正则，心中要有统一的语法。我同意余老师的观点，但是也觉得，在了解了貌似通用的正则语法基础之后，应该比较清晰地了解自己最常用的正则语言的语法细节，以及与其它语言的差异，以避免似是而非。跑题，打住。</li>
<li>开源，正版，免费。我们向其他人介绍正则，总得有一款可以拿得出手的工具吧？免费这条倒是不苛求，话说好软件还是应该有所回报的。</li>
</ol>
</blockquote>
</ol>
<p>问题是，这么好的软件，到那里去找呢？找不到的话，自己想从头实现，该如何动手呢？ </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我的思路历程</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>使用Objective-C来实现。不过，这想法没多久就像萝莉一样被推倒了。Obj-C固然是要学的，但我等不及了。RegexBuddy这类的软件我是天天都在用。这个目标似乎比上一条还要临渴掘井。为mac平台开发了，代码至少还要为win/lin单独编译吧？再者，如果用了Obj-C，正则引擎怎么办？从头实现？xiaofei说，要实现一个好用的正则引擎，要一个优秀的团队半年的时间。当然，Obj-C也可以调用现成的模块，这也引出了我现在的思路。</li>
<li>做成网页程序，前端接收用户输入，后端使用CGI调用服务器上的原生正则引擎（perl、python），匹配、替换后展现在前端。它最大的好处是，语言百分百原生，Native；只要网络在，打开浏览器就能用；即使没有网络，本机localhost也可用，而且更快。JavaScript/PHP就不必劳驾CGI了，原汤化原食就可以。</li>
</ul>
<p>                             话说我已经选择了第二套方案，于是就着手实现。
                        </p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">目前的进度</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>已经使用HTML+jQuery画出了简单的界面，实现了perl 5.10版的CGI程序，能够进行匹配、替换、分割（Split)。</li>
<li>未实现的功能：代码Snippets自动生成；其它语言版本的实现。</li>
<li>对于我自己来说，基本上已经可以使用了。我现在就正在 eat my own dog food，一边用它，一边完善它。不过要想发布出来供大家使用，还需要旷日持久的功能完善、界面美化。</li>
<li>截图见文章末尾。<br/>
</ul>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">Perl CGI 代码以及简要说明</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">代码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #000000; font-weight: bold;">use</span> CGI<span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> cl <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#0ff&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> h_color<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$color</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$counter</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">?</span> <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">&quot;#0ff&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&lt;span style='background-color:$color'&gt;&quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$a</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;&lt;/span&gt;&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">=</span>CGI<span style="color: #339933;">-&gt;</span><span style="color: #000000; font-weight: bold;">new</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">die</span> <span style="color: #ff0000;">&quot;$!&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">header</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">-</span>type<span style="color: #339933;">=&gt;</span><span style="color: #ff0000;">&quot;text/html; charset=UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;regex&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;">#quit immediatly if no $regex input</span><br />
<span style="color: #000066;">die</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;text&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$mode</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;mode&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;space&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$action</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;action&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\s+//g</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;match&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'$text =~ s@$regex'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@&amp;h_color($&amp;)'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@eg'</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$mode</span><span style="color: #339933;">.</span><span style="color: #ff0000;">';'</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br /&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$text</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$replace</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\$</span>text =~ s:<span style="color: #000099; font-weight: bold;">\$</span>regex:$replace:g;&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #ff0000;">&quot;$code&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br/&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;&lt;pre&gt;$text&lt;/pre&gt;&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;split&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#@result=split(m@$regex@mode, $text);</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=split(m@$regex@'</span><span style="color: #339933;">.</span> <span style="color: #0000ff;">$mode</span> <span style="color: #339933;">.</span> <span style="color: #ff0000;">', $text);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=grep /\S/, @result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'my $count=@result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;font color=\&quot;#ff008c\&quot;&gt;$count&lt;/font&gt; record(s) returned:&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;ol&gt;&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;li&gt;&quot;.&amp;h_color($_).&quot;&lt;/li&gt;&quot; foreach (@result);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;/ol&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
<p>代码…还算简洁。主要就是接收并简单处理一下各个参数，然后按照不同的动作要求（match/replace/splie）进行相应的动态代码生成，然后eval执行结果，返回输出。在match/split中，还插入了代码高亮的小功能。基于perl代码的高效紧凑，实现起来倒也不至于冗长。感谢<a href="http://twitter.com/cnhacktnt">cnhacktnt</a>的协助。</p>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">截图</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match_cn.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/replace.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/split_cn.png" border="0" alt="Photobucket"></a></li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/diy-regexbuddy.html/feed</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>better downloader in perl</title>
		<link>http://iregex.org/blog/better-downloader-in-perl.html</link>
		<comments>http://iregex.org/blog/better-downloader-in-perl.html#comments</comments>
		<pubDate>Sun, 14 Mar 2010 16:08:12 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl curl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=80</guid>
		<description><![CDATA[这两天写了一段perl程序，输入url地址，下载其中的文件，计算MD5值，查该文件是否是病毒；如果无记录，则调用另外一个perl脚本将其上传到某一网站作详细测试。我想说的是，这些功能可能使... ]]></description>
			<content:encoded><![CDATA[<p>这两天写了一段perl程序，输入url地址，下载其中的文件，计算MD5值，查该文件是否是病毒；如果无记录，则调用另外一个perl脚本将其上传到某一网站作详细测试。我想说的是，这些功能可能使用bash来编程，会更直接；用perl来做bash的事情总有些越俎代庖。不过，目前我对perl极感兴趣；有机会就用它；另外，将不成熟的代码贴出来，留给未来的自己一个鄙视现在的自己的机会也好:)</p>
<p>一个小小的发现是，<a href="http://perldoc.perl.org/perlop.html#Quote-Like-Operators">qx</a>可以将所包含的语句当作bash命令来执行，并把结果返回。另外书中交待，<a href="http://perldoc.perl.org/functions/eval.html">eval</a>也是极有用的，不过这次没用上，下次找机会牛刀小试一把。</p>
<p>之所以没有使用curl, md5等等模块，而是使用shell命令，是因为我所用的虚拟机里没有安装，但是它们是bash下标准的可执行文件。这样写来，效率会有点折扣，但是绿色便携。</p>
<p>无废话，贴代码。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br />117<br />118<br />119<br />120<br />121<br />122<br />123<br />124<br />125<br />126<br />127<br />128<br />129<br />130<br />131<br />132<br />133<br />134<br />135<br />136<br />137<br />138<br />139<br />140<br />141<br />142<br />143<br />144<br />145<br />146<br />147<br />148<br />149<br />150<br />151<br />152<br />153<br />154<br />155<br />156<br />157<br />158<br />159<br />160<br />161<br />162<br />163<br />164<br />165<br />166<br />167<br />168<br />169<br />170<br />171<br />172<br />173<br />174<br />175<br />176<br />177<br />178<br />179<br />180<br />181<br />182<br />183<br />184<br />185<br />186<br />187<br />188<br />189<br />190<br />191<br />192<br />193<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #666666; font-style: italic;">#this script offers a better download service</span><br />
<span style="color: #666666; font-style: italic;">#integreting virustotal infor, send-sample.pl</span><br />
<span style="color: #666666; font-style: italic;">#by rex.zhang </span><br />
<span style="color: #666666; font-style: italic;">#on 03-11-2010 in Shanghai</span><br />
<span style="color: #666666; font-style: italic;">#updated @003122010;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$ARGV</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#it is assumed that the filename is the last part of the url, </span><br />
<span style="color: #666666; font-style: italic;">#just after the last / and before the $</span><br />
main<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> main<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#print help message and quit if no url input;</span><br />
&nbsp; &nbsp; help<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$md5</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$size</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$name</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$url</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/\/([^\/]+)$/</span><span style="color: #339933;">;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the filename from the url; </span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$name</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get filesize from the url; if no size got, quit.</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$size</span><span style="color: #339933;">=</span>get_filesize<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">exit</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$size</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#try 5 times at most to get the file.</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$try</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">5</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$ok</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$try</span><span style="color: #339933;">--</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$ok</span><span style="color: #339933;">=</span>download_file<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">last</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$ok</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$ok</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;can not download the file, quit!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the md5 locally</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$md5</span><span style="color: #339933;">=</span>get_md5<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$name</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#and the url link from $virustotal;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span>get_vt_link<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#and even the virus infor;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$info</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span>get_vt_info<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$info</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; v_test<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$name</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>Sample has been sent to vtest. <span style="color: #000099; font-weight: bold;">\n</span>thanks for using.<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#return the md5 value of the file.</span><br />
<span style="color: #666666; font-style: italic;">#the filename is in the current directory</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_md5<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$filename</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$md5</span><span style="color: #339933;">=</span><span style="color: #ff0000;">`md5sum $filename`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$md5</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/^(\w{32})/</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>the md5 of the $filename is:<span style="color: #000099; font-weight: bold;">\t</span> $1.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#get the virustotal link with the given md5 value;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_vt_link<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #ff0000;">`curl -s -e &quot;https://www.virustotal.com&quot; -d &quot;x=80&amp;y=23&amp;hash=$md5&quot; &quot;http://www.virustotal.com/vt/en/consultamd5&quot; | grep href`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$bool</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span> <span style="color: #0000ff;">$link</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/href=&quot;([^&quot;]+)&quot;/i</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$bool</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #ff0000;">&quot;http://www.virustotal.com&quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#get the virus infor according to a virus total infor link </span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_vt_info<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$sophos</span> <span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#if sophos has no detection, do the vtest.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">qx</span><span style="color: #009900;">&#123;</span>curl <span style="color: #339933;">-</span><span style="color: #000066;">s</span> <span style="color: #0000ff;">$url</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$line</span><span style="color: #339933;">.=</span><span style="color: #0000ff;">$_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #339933;">!&lt;</span><span style="color: #000066;">tr</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&gt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">*&gt;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009999;">&lt;td&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>Sophos<span style="color: #339933;">|</span>Symantec<span style="color: #339933;">|</span>TrendMicro<span style="color: #339933;">|</span>McAfee<span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;/</span>td<span style="color: #339933;">&gt;.*?&lt;/</span>tr<span style="color: #339933;">&gt;!</span>sig<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>No record in VirusTotal. Sending v-test...<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>virustotal record found as following:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$tmp</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$tmp</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(&lt;[^&gt;]+&gt;\s*)+/\t/sig</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$tmp</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$tmp</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">m/(?:Sophos[.\s0-9]+)(?!-)(\S+)\s*$/i</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>Sophos has detection as $sophos, no v-test needed.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;you can read the virus details here:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>$url<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#get the filesize by using the curl -I option.</span><br />
<span style="color: #666666; font-style: italic;">#print the filesize if it is greater than a given value, 2MB by default.</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_filesize<span style="color: #009900;">&#123;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$size</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$unit</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">qx</span><span style="color: #009900;">&#123;</span>curl <span style="color: #339933;">-</span><span style="color: #000066;">s</span> <span style="color: #339933;">-</span>I <span style="color: #0000ff;">$url</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/Content-Length:\s(\d+)/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$size</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/Accept-Ranges:\s(\w+)/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$unit</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$size</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;can not get the length of the file, exit!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$size</span><span style="color: #339933;">=</span><span style="color: #000066;">int</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$size</span> <span style="color: #339933;">/</span> <span style="color: #cc66cc;">1024</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$flag</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#if flag=1 it is too large ; </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$size</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">1024</span> <span style="color: #339933;">*</span> <span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$flag</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;the file is ${size}KB and greater than 2MB!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;it is too large to be a virus. exit.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>The file size is $size kb, downloading...<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$flag</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#simply download the file</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> download_file<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$url</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/([^\/]+)$/</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$filename</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #ff0000;">`curl -O $url`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #339933;">-</span>e <span style="color: #0000ff;">$filename</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;no filename, retrying...<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>file is downloaded and saved as $filename.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#print the help message and exit if no url input</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> help<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;Usage: ./dl http://.../file.exe<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>the last part of the url is regarded as filename.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#send the v-test email, with md5 value in the subject.</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> v_test<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$file</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">`zip $file.zip $file`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">`send-sample.pl -a $file.zip -s &quot;$file.zip URL MD5 $md5&quot;`</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/better-downloader-in-perl.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>关于从普通文本提取正则表达式的再思考</title>
		<link>http://iregex.org/blog/text-2-regular-expressions-again.html</link>
		<comments>http://iregex.org/blog/text-2-regular-expressions-again.html#comments</comments>
		<pubDate>Mon, 08 Mar 2010 18:32:19 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[array]]></category>
		<category><![CDATA[engine]]></category>
		<category><![CDATA[hash]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[recursive]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=79</guid>
		<description><![CDATA[rex按： 写完上一篇文章之后，一直在考虑如何真正实现从普通文本中归纳正则表达式的实现。走了许多弯路，也学了不少知识。例如，perl黑豹书上复杂的数据结构、匿名散列和数组、refenrence... ]]></description>
			<content:encoded><![CDATA[<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">rex按：</h3>
<p>写完<a id="vt:b" title="个人应用之明文字串到正则" href="http://iregex.org/blog/literal-text-to-regex.html"><span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">上一篇</span></span></a><span class="Apple-style-span" style="font-style: normal;">文章之后，一直在考虑如何真正实现从普通文本中归纳正则表达式的实现。走了许多弯路，也学了不少知识。例如，perl黑豹书上复杂的数据结构、匿名散列和数组、refenrence；紫龙书上的状态机的构造，数据结构上图论的知识，都是很有用的。另外还新学了</span><a id="mk58" title="graphviz" href="http://www.graphviz.org/"><span class="Apple-style-span" style="font-style: normal;">graphviz</span></a><span class="Apple-style-span" style="font-style: normal;">的用法。以前觉得很神秘，不过一用才发现很直观。本文的插图是使用</span><a id="a1la" title="online版本的graphviz" href="http://graph.gafol.net/create"><span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">online版本的graphviz</span></span></a><span class="Apple-style-span" style="font-style: normal;">画的。</span></p>
<p>除了本文的这种实现方法（基于图），我还使用另一种方式实现了，很简单：基于关键词。具体作法是，逐一读取每一行文本，使用\s+等将其split开，形成array；然后再对所有的array进行求交集的操作（使用hash），得到每一行都有的关键词；然后按从左到右的顺序建立这覆的正则式^(.*?)keyword1(.*?)keyword2&#8230;.keywordN(.*?)$，再分别匹配每一行文本，得到hash的hash表，或者array的array，转置，并列输出，得到^(option1|option2&#8230;)keyword1(option..)&#8230;$这样的正则式。最后作为验证，再将所最终生成的正则与每一行匹配测试一下。</p>
<p>这样以词为单位做完之后，再逐个字母地分隔开来，递归地处理<span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">(option1|option2&#8230;)的部分。先是单词级，再是字母级，有利于先在最大程度上找出重复的内容；而且粗化和细化的处理过程，思路是一致的，粒度不同罢了。</span></span></p>
<p><span class="Apple-style-span" style="color: #474747;"> 新手请自重，高手请赐教，我的思路未必是正确或最优的。</span></p></blockquote>
<p><span id="more-79"></span></p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>有文本文件text.txt，内容如下：</div>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div>this is a red fox</div>
<div>this is a blue firefox</div>
<div>this is a pig</div>
<div>a red fox</div>
</blockquote>
<div>请写一则程序，根据文本内容，自动构造（比较合理的）正则表达式，使之能够匹配文件中<strong>每一行</strong>文本。</div>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">标准正则</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>有两种极端的解法是不可取的：</p>
<ol>
<li><span class="Apple-style-span" style="color: #ff00ff;">^.*$</span></li>
<li><span class="Apple-style-span" style="color: #ff00ff;">^(this is a red fox|this is a blue firefox|this is a pig|a red fox)$</span></li>
</ol>
<p>第一种失之于太宽泛，第二种失之于太狭隘。太宽泛则泥沙俱下，无论什么文本都能匹配；太狭隘则僵化死板，缺乏灵活性。好的正则表达式源于例文本（从例文本中提取规律），又高于例文本（能匹配同规律的其它文本）。匹配什么，排除什么，都有定则，所谓“君子有所为而有所不为”，指的就是这种情况（貌似跑题了:)）。</p>
<div>那么，如何是比较靠谱的正则表达式呢？以上文的例子而言，可以是：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">^(this is )?a (red fox|blue firefox|pig)$</span></div>
<div>现在我们向着标准答案出发。</div>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">思路</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>任何复杂的电路图，都可以拆分为三种简单的关系：串联，并联，短路。正则表达式也同理。</div>
<p>既然是一条正则匹配所有的文本，那么这条正则（记为<span class="Apple-style-span" style="color: #ff00ff;">$re</span>）也应该匹配第一行文本。</p>
<p>第一行文本为this is a red fox。那么，从<span class="Apple-style-span" style="color: #ff00ff;">^this is a red fox$</span>应该是<span class="Apple-style-span" style="color: #ff00ff;">$re</span>的一个（真）子集。它的路径为：<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a-&gt;red-&gt;fox-&gt;&#8221;$&#8221;</span>。全部节点之间，是串联关系，从左到右依次排列即可。</p>
<p>示意图如下(可以点击看全尺寸图，下同)：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001329.png" target="_blank"><img style="margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001329.png" border="0" alt="Photobucket" /></a></p>
<p>同理，第二行文本也应该是<span class="Apple-style-span" style="color: #ff00ff;">$re</span>的子集。不过，由于已经存在了由<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;this-&gt;is-&gt;a</span>的路径，到a时出现支路，<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;blue-&gt;firefox-&gt;$</span>；</p>
<p>将此路径添加到示意图上，得到：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001747.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001747.png" border="0" alt="Photobucket" /></a></p>
<p>显而易见，这两条并列的支路，始于a，终于$，可以使用|来并列之。</p>
<p>好了，我们总结一下规律：</p>
<div><span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;"><strong><span class="Apple-style-span"><span class="Apple-style-span" style="background-color: #6fa8dc;">并列</span></span></strong>：如果存在A-&gt;B-&gt;C，且同时存在A-&gt;D-&gt;C，则B与D之间是并联关系。即出发点相同，结束点相同，且出发点与结束点之间各有一个以上的节点。并列使用括号来表示，之间以|分隔。例如，对于<span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;">A-&gt;B-&gt;C，A-&gt;D-&gt;C，则可以使用A(B|D)C来表示其正则关系。</span></span></span></span></div>
<div><span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;">为什么要强调是一个以上节点呢？这里先卖个关子。请继续阅读。</span></span></div>
<p>再往下，this is a pig，同理，只需要在原图基础上添加<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;pig-&gt;$</span>的支路即可。此时图示如下：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309002851.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309002851.png" border="0" alt="Photobucket" /></a></p>
<div>
最后一条，a red fox。这条貌似复杂，但是只需在<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;a</span>之间新添加了一条路径而已；<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;red-&gt;fox-&gt;$</span>之间原有路径，可以继续使用。此时，得到完整的示意图如下：</div>
<p><a style="color: #ed1e24; margin-left: 0px; margin-right: 0px; text-decoration: none;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" border="0" alt="Photobucket" /></a></p>
<p>此时，观察可知，一种新的情况出现了。同时存在<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;a</span>，和<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a</span>两条路径。想一下初中物理电路图，我们可以将这种情况称为“短路”，即，<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a</span>这个线路的^、a两个节点之间，添加了一条无障碍通道，它能无视this、is的存在，因此，让<span class="Apple-style-span" style="color: #ff00ff;">this-&gt;is</span>这条路径成为<strong>可选项</strong>。再总结一下规律：</p>
<p>如果有A-&gt;B-&gt;&#8230;C-&gt;D的路径，且有A-&gt;D的路径，则称A-&gt;D之间存在短路，此时,B-&gt;&#8230;-&gt;C可以用(B-&gt;&#8230;-&gt;C)?来表示(就是用括号来表示被短路的部分，问号表示短路之)。</p>
<div>顶点A,D之间，最多存在一个短路关系。但是可以有1或更多条并列的关系存在。</div>
<div>好了，分析结束，得到这样的正则式：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">^(this is )?a (red fox|blue firefox|pig)$</span></div>
<p>这也就是为什么上文要强调是一个节点的缘故。</p>
<div>
<p>如果我们再精益求精的话，可以对<span class="Apple-style-span" style="color: #ff00ff;">red fox|blue firefox|pig</span>这部分<strong><span class="Apple-style-span" style="color: #ff00ff;">递归地</span></strong>进行上述分析过程，进而得到<span class="Apple-style-span" style="color: #ff00ff;"> (red |blue fire)fox|pig</span>这样的结果。</p></blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">实现</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>思路有了，编程就简单了。perl中，固然可以使用比较简洁的hash表来表示链表之间的关系：</div>
<div>例如：</div>
<div>my $hash;</div>
<div style="margin-left: 0px; margin-right: 0px;">$$hash{&#8220;^&#8221;}{&#8220;this&#8221;}{&#8220;is&#8221;}{&#8220;a&#8221;}{&#8220;red&#8221;}{&#8220;fox&#8221;}{&#8220;\$&#8221;}=&#8221;";</div>
<div>$$hash{&#8220;^&#8221;}{&#8220;this&#8221;}{&#8220;is&#8221;}{&#8220;a&#8221;}{&#8220;blue&#8221;}{&#8220;firefox&#8221;}{&#8220;\$&#8221;}=&#8221;";</div>
<p>&#8230;</p>
<div>但是，节点的增删修改都是麻烦事。（我在hash迷宫中lost了很久才爬出来）</div>
<div>抽空补了一下<strong>有向图</strong>的知识，觉得可以简化问题如下。</div>
<p><a style="color: #ed1e24; margin-left: 0px; margin-right: 0px; text-decoration: none;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" target="_blank"><img style="margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" border="0" alt="Photobucket" /></a></p>
<p>上图其实是一个有向图，只需记录所有的顶点集合，路径集合，再来求各路径之间的关系；最后打印输出，即是所求。</p>
<div>顶点集合为：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">(^, this, is, a, red, fox, blue, firefox, pig, $);</span></div>
<div>通路关系集合为：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">(^-&gt;this, this-&gt;is,&#8230;)</span></div>
<p>这两个集合在读取文本文件行的时候可以一次性建立。不复杂。关键是关系的确立。</p>
<div>再次总结，如下：</div>
<ul>
<li>从一个顶点A出发的N条支路必定汇合（只是有时是同一个点，有时不在同一点而已。本文给出的例子是最简单的情况，这里可以假设为汇合到同一点）于M点。</li>
<li>这N条路中，每一条路径的长短以经过的节点个数来计算。例如上图中，^到a有一条路，上面的路径为2，下面的路径为0。</li>
<li>短的支路决定了这N条支路的关系。</li>
<li>长度为任意两点之间，最多只可能有一条长度为0的边。</li>
<li>如果存在长度为0的边，则其余的同级的支路被短路。</li>
<li>长度不为0的N-1条支路之间是并列关系。</li>
<li>整个图始于^，终于$。</li>
</ul>
<div>这些条件、判断，均可以细化为函数。具体的程序从略。</div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/text-2-regular-expressions-again.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>个人应用之明文字串到正则</title>
		<link>http://iregex.org/blog/literal-text-to-regex.html</link>
		<comments>http://iregex.org/blog/literal-text-to-regex.html#comments</comments>
		<pubDate>Wed, 10 Feb 2010 08:50:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=78</guid>
		<description><![CDATA[近来工作中需要将某种明文字串转为简单的正则式。手动做当然可以，但是大量重复性的劳动，自然是交给机器处理为好。昨晚写了一款这样的脚本，放在这里。因为是处理我自己的工作的脚... ]]></description>
			<content:encoded><![CDATA[<p>近来工作中需要将某种明文字串转为简单的正则式。手动做当然可以，但是大量重复性的劳动，自然是交给机器处理为好。昨晚写了一款这样的脚本，放在这里。因为是处理我自己的工作的脚本，贴在这里仅作记录和存档之用，可能对别人没什么实际作用。当然，从现有的明文字串到正则式的转换，应该是个不错的题目，有兴趣朋友的可以深究。</p>
<p>值得一提的是，代码中用了<font color="#FF00FF">$&#038;, (?{})</font> 这样的<font color="#FF00FF">perl only</font>的东东，明晰了思路，简化了代码。如果不使用这种特性的话，代码要<strong>长5倍</strong>。另外，据说从效率上来说，<font color="#FF00FF">use English</font>之后，使用<font color="#FF00FF">$MATCH</font>比直接使用<font color="#FF00FF">$&#038;</font><strong>快5倍</strong>。但是对于即输入即执行的命令行程序来说，<font color="#FF00FF">$&#038;</font>已经足够好。</p>
<p><span id="more-78"></span></p>
<p>实际应用一例：</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #c20cb9; font-weight: bold;">perl</span> hash2re.pl H:aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA0.zip<span style="color: #000000; font-weight: bold;">/</span>H:aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span><span style="color: #000000; font-weight: bold;">/</span>aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span><span style="color: #000000; font-weight: bold;">/</span>aaa<span style="color: #000000; font-weight: bold;">/</span>Aaaaa<span style="color: #000000; font-weight: bold;">/</span>aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span>.exe<br />
RE <span style="color: #000000;">1</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>\.zip$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA0.zip&quot;</span><br />
<br />
RE <span style="color: #000000;">2</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0&quot;</span><br />
<br />
RE <span style="color: #000000;">3</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0&quot;</span><br />
<br />
RE <span style="color: #000000;">4</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa&quot;</span><br />
<br />
RE <span style="color: #000000;">5</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">4</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;Aaaaa&quot;</span><br />
<br />
RE <span style="color: #000000;">6</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>\.exe$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0.exe&quot;</span></div></td></tr></tbody></table></div>
<p>源码：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<br />
<span style="color: #666666; font-style: italic;"># &nbsp; by rex zhang </span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; Feb 09 2010 in Shanghai</span><br />
<br />
<span style="color: #666666; font-style: italic;"># &nbsp; usage: split and regexize hashed filename</span><br />
<span style="color: #666666; font-style: italic;">#</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$lines</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$ARGV</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$lines</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #666666; font-style: italic;">#(C:[^/]+)#)</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$c</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$lines</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/$c//</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;ClearText Filename Ignored:<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\&quot;</span>$c<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@array</span><span style="color: #339933;">=</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">m</span><span style="color: #339933;">!</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>\<span style="color: #339933;">/|</span>H<span style="color: #339933;">:</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*!,</span> <span style="color: #0000ff;">$lines</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$line</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@array</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">next</span> <span style="color: #b1b100;">if</span> <span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$re</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">local</span> <span style="color: #0000ff;">$len</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<br />
<br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(?=[.\[\]()])/\\/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\?/./g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/0+(?{ $len=length($&amp;)})/[0-9]\{$len\}/g</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/A+(?{ $len=length($&amp;)})/[A-Z]\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/a+(?{ $len=length($&amp;)})/[a-z]\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(.)\1+(?{ $len=length($&amp;)})/$1\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\{1\}//g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\^</span>$re<span style="color: #000099; font-weight: bold;">\$</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">++;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$line</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/$re/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;RE $counter:<span style="color: #000099; font-weight: bold;">\t</span>$re<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>Matches: <span style="color: #000099; font-weight: bold;">\&quot;</span>$line<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;RE $counter:<span style="color: #000099; font-weight: bold;">\t</span>$re<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>Failed: <span style="color: #000099; font-weight: bold;">\&quot;</span>$line<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/literal-text-to-regex.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>统计重复文本行的两种方法</title>
		<link>http://iregex.org/blog/get-duplicated-lines.html</link>
		<comments>http://iregex.org/blog/get-duplicated-lines.html#comments</comments>
		<pubDate>Sat, 06 Feb 2010 07:09:43 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=77</guid>
		<description><![CDATA[假设样本文件a.txt内容如下： 123456hello world! hello world! I love regex. hello world! I love regex. hello world! 简单观察可知，hello world!共重复4行；I love regex.重复2行。如何使用正则表达式来写一个程序，统计... ]]></description>
			<content:encoded><![CDATA[<p>假设样本文件<font color="#FF00FF">a.txt</font>内容如下：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">hello world!<br />
hello world!<br />
I love regex.<br />
hello world!<br />
I love regex.<br />
hello world!</div></td></tr></tbody></table></div>
<p>简单观察可知，<font color="#FF00FF">hello world!</font>共重复4行；<font color="#FF00FF">I love regex.</font>重复2行。如何使用正则表达式来写一个程序，统计这些数据呢？因为现实中需要统计的文件，绝非是只凭肉眼就能观察出来。我想到了两种方法，第一种方法，是依赖于正则表达式（否则这篇文章也不会贴在这里）；第二种，hash表做主角，正则表达式作绿叶。<span id="more-77"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">正则表达式的解法</h3>
<p>思路是：对于任何一行文本，如果后面若干行[0～EOF）之后，如果存在相同的文本行，则记下该行内容，统计出现次数；然后删除这样的文本行，再进行下一行的统计。输出统计结果。</p>
<p>下面是相应的perl程序，附注释。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl </span><br />
<span style="color: #666666; font-style: italic;">#usage: &nbsp;./dup_re.pl &lt;a.txt</span><br />
<br />
<span style="color: #000066;">undef</span> <span style="color: #0000ff;">$/</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># enable &quot;slurp&quot; mode</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$file</span> <span style="color: #339933;">=</span> <span style="color: #009999;">&lt;STDIN&gt;</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;"># whole file now here</span><br />
<br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$file</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><br />
&nbsp; &nbsp; <span style="color: #339933;">/</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#for each line;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore the whitespaces at both ends; </span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">\S</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the line content, save to $1;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; \<span style="color: #000066;">s</span><span style="color: #339933;">*</span>$ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines by using \S</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">.*?</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#check if there is the same pattern of $1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span>\<span style="color: #cc66cc;">1</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span>$ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#after 0 or more lines;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">/</span>smx<span style="color: #009900;">&#41;</span> <br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$count</span><span style="color: #339933;">=</span> <span style="color: #0000ff;">$file</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/$line//g</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#delete the duplicated lines</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save the number to $count;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$count</span><span style="color: #339933;">,</span><span style="color: #ff0000;">&quot;times:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$line</span><span style="color: #339933;">,</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Hash表解法</h3>
<p>这种方法，受益于perl语言本身的强大的hash表功能。思路如下：</p>
<ul>
<li>建立空的hash表；</li>
<li>逐行读取文件；</li>
<li>以文本内容为key，插入到表中来。如果是首次出现，value为0，否则value++。</li>
<li>输出hash表中value>=2的记录。</li>
</ul>
<p>Perl程序：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<span style="color: #666666; font-style: italic;">#usage: &nbsp;./dup_hash.pl a.txt</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%hash</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">&lt;&gt;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp;<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/^\s*(\S.*?)\s*$/</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#ignore whitespaces at both ends; </span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#123;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines by using \S</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">++;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save the line to $1, and count the time it appears</span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#sort the hash by values; </span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$key</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">sort</span> <span style="color: #009900;">&#123;</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$b</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">&lt;=&gt;</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#125;</span> <span style="color: #009900;">&#125;</span> <span style="color: #000066;">keys</span> <span style="color: #0000ff;">%hash</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$key</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">&gt;=</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#only print the lines that duplicates;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#for all results, just remove the 'if' line</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #000066;">printf</span> <span style="color: #ff0000;">&quot;%d times:<span style="color: #000099; font-weight: bold;">\t</span>%s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$key</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$key</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">结果</h3>
<p>上面的程序分别保存为dup_re.pl，dup_hash.pl。由于程序对于外部文件的读取的方法不同，运行方式也有差别，详见下图：<br />
<img src="http://public.bay.livefilestore.com/y1p84mh-sSb8s2jIOokB1tAnVJQnNdmS1ir1v9A0nRbWPPZ6AdIQV896FPpKr_LNzQvJ6kJQ-Ue94wHK8LVscG8uQ/20100206_144726.png" alt="我爱正则表达式|统计重复文本行的两种方法" /></p>
<h4>Update</h4>
<p>忽然想到，如果要让这脚本更有效，可以指定忽略大小写，忽略单词间多个空格的情况，使得<font color="#FF00FF">Hello world!</font>与<font color="#FF00FF">      　　hello　　       WORLd!   </font>被视为重复行。测试了一下，正则式没让我失望。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/get-duplicated-lines.html/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>由正则式反推文本:REExtractor</title>
		<link>http://iregex.org/blog/reextractor.html</link>
		<comments>http://iregex.org/blog/reextractor.html#comments</comments>
		<pubDate>Tue, 02 Feb 2010 09:12:35 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[REExtractor]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=76</guid>
		<description><![CDATA[发现一款简单有趣的正则表达式应用：REExtractor，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是 Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性... ]]></description>
			<content:encoded><![CDATA[<p>发现一款简单有趣的正则表达式应用：<a id="f-4f" href="http://re2form.appspot.com/" title="我爱正则表达式|由正则式反推文本">REExtractor</a>，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是<br />
Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性。不过，理论上可以做到，执行时却有限制。<span id="more-76"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一些限制</h3>
<ol>
<li>平台是GAE，语言是python，因此用的是python正则。或需代理才能访问使用。</li>
<li>支持的元字符或缩写：<font class="Apple-style-span" color="#FF00FF">(), [],{m,n},{n},|,\w,\d</font>。如果需要用到这些字符的字面值，请使用反斜线转义之。其中这里的\w等同于<font class="Apple-style-span" color="#FF00FF">[a-zA-Z0-9]</font>，为62个字符之一，而不是通常意义上的包括下划线在内的<font class="Apple-style-span" color="#FF00FF">[_a-zA_Z0-9]</font>，63字符之一。但是可以用<font class="Apple-style-span" color="#FF00FF">[_\w]</font>来代替，没问题的。</li>
<li>不支持的元字符：<font class="Apple-style-span" color="#FF00FF">.(点号),^,$,\b,\D,\W,\1&#8230;（后向引用）, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>等。</li>
<ul>
<li>如果出现<font class="Apple-style-span" color="#FF00FF">.</font>点号，则直接输出。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">^, $, \b, \1, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>， 程序无视之。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">\D或\b或[^]</font>，则程序会报错。原因是范围太宽。</li>
</ul>
<li>不支持可能性在1000条以上结果的正则表达式。例如，<font class="Apple-style-span" color="#FF00FF">\w{2}</font>，因为它的可能性是62×62。但是你可以使用\w\d，因为它的可能性是62×10。</li>
</ol>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">它能做什么</h3>
<p><a href="http://iregex.org/blog/REExtractor.html" target="_blank" title="我爱正则表达式|由正则式反推文本"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100202170343.jpg" border="0" alt="我爱正则表达式|由正则式反推文本"></a><br />
好吧，虽然限制多多，但是你仍然可以拿它来做一些有趣的应用。下面略举两例。</p>
<ul>
<li>生成一些简单的邮箱地址。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">[abc]{3}\d@1(26|63).com</font> ，它生成540条邮箱地址。</li>
<li>生成一些人名。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">张[小大勇赞强战海][虎猫龙彪平]</font>。它生成35条人名。是的，它支持中文，并且每个中文字都可以当成一个字符来应用。如果你家要添一个宝宝，可以将一些可能的字排列一下，看看哪些组合比较赏心、顺口，再从中选择一个。</li>
</ul>
<p>平心而论，上面的这些小应用，当然可以直接编程实现，限制更少，更灵活，更强大。但是有必要每次都开编译器么？尝试一下这款小程序，也挺有趣的。而且，上一节中提及的一些限制，其实也是蛮有道理的。毕竟从正则式反推文本，用不到大多数的零宽断言（不过<font class="Apple-style-span" color="#FF00FF">\1</font>这种反向引用应该挺常用的，却不支持）。当作一个小玩具就好。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/reextractor.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

