<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式</title>
	<atom:link href="http://iregex.org/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Tue, 31 Aug 2010 04:35:39 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>统计最近用过的linux命令</title>
		<link>http://iregex.org/blog/most-frequently-used-linux-commands.html</link>
		<comments>http://iregex.org/blog/most-frequently-used-linux-commands.html#comments</comments>
		<pubDate>Tue, 31 Aug 2010 04:32:59 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[bash ubuntu]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=142</guid>
		<description><![CDATA[统计最近用过的linux命令。没什么具体用途，练习bash而已。 流程： 从 history 命令中得到最近1000条命令。 删除每行的行号。 记录每一行中的命令。行首的第一个英文单词，以及管道后面的第一... ]]></description>
			<content:encoded><![CDATA[<p>统计最近用过的linux命令。没什么具体用途，练习bash而已。</p>
<p><span id="more-142"></span></p>
<p>流程：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>从 <code class="codecolorer bash default"><span class="bash"><span style="color: #7a0874; font-weight: bold;">history</span></span></code> 命令中得到最近1000条命令。
        </li>
<li>删除每行的行号。</li>
<li>记录每一行中的命令。行首的第一个英文单词，以及管道后面的第一个英文单词，视为命令名称。</li>
<li>将得到的命令列表排序。</li>
<li>统计每个命令的出现次数，先以次数降序排列，再以命令名称升序排列。</li>
</ul>
</blockquote>
<p>完整的命令为：<code class="codecolorer bash default"><span class="bash"><span style="color: #7a0874; font-weight: bold;">history</span> <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">sed</span> <span style="color: #ff0000;">&quot;s#^\s\+[0-9]\+\s\+##g&quot;</span> <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">grep</span> <span style="color: #660033;">-oP</span> <span style="color: #ff0000;">&quot;(?&lt;=^|\|)\w+&quot;</span><span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sort</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">uniq</span> -c<span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">sort</span> -k1,1nr <span style="color: #660033;">-k2</span></span></code>。</p>
<p>以下是本人的ubuntu命令显示：
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp; &nbsp; 157 ls<br />
&nbsp; &nbsp; 134 cd<br />
&nbsp; &nbsp; &nbsp;89 pcregrep<br />
&nbsp; &nbsp; &nbsp;76 cat<br />
&nbsp; &nbsp; &nbsp;56 xargs<br />
&nbsp; &nbsp; &nbsp;52 python<br />
&nbsp; &nbsp; &nbsp;49 vim<br />
&nbsp; &nbsp; &nbsp;47 sudo<br />
&nbsp; &nbsp; &nbsp;46 git<br />
&nbsp; &nbsp; &nbsp;44 exit<br />
&nbsp; &nbsp; &nbsp;37 rename<br />
&nbsp; &nbsp; &nbsp;28 echo<br />
&nbsp; &nbsp; &nbsp;27 sed<br />
&nbsp; &nbsp; &nbsp;27 tstp<br />
&nbsp; &nbsp; &nbsp;26 adt<br />
&nbsp; &nbsp; &nbsp;26 grep<br />
&nbsp; &nbsp; &nbsp;19 curl<br />
&nbsp; &nbsp; &nbsp;18 rm<br />
&nbsp; &nbsp; &nbsp;16 history<br />
&nbsp; &nbsp; &nbsp;16 wget<br />
&nbsp; &nbsp; &nbsp;12 ps<br />
&nbsp; &nbsp; &nbsp;10 kill<br />
&nbsp; &nbsp; &nbsp;10 make<br />
&nbsp; &nbsp; &nbsp;10 perl<br />
&nbsp; &nbsp; &nbsp; 8 ll<br />
&nbsp; &nbsp; &nbsp; 8 mv<br />
&nbsp; &nbsp; &nbsp; 8 scp<br />
&nbsp; &nbsp; &nbsp; 8 sfo<br />
&nbsp; &nbsp; &nbsp; 7 ctags<br />
&nbsp; &nbsp; &nbsp; 7 tst<br />
&nbsp; &nbsp; &nbsp; 6 awk<br />
&nbsp; &nbsp; &nbsp; 6 gvim<br />
&nbsp; &nbsp; &nbsp; 6 mkdir<br />
&nbsp; &nbsp; &nbsp; 6 sort<br />
&nbsp; &nbsp; &nbsp; 4 chmod<br />
&nbsp; &nbsp; &nbsp; 4 man<br />
&nbsp; &nbsp; &nbsp; 4 uniq<br />
&nbsp; &nbsp; &nbsp; 3 cjb<br />
&nbsp; &nbsp; &nbsp; 3 md5sum<br />
&nbsp; &nbsp; &nbsp; 3 tt<br />
&nbsp; &nbsp; &nbsp; 3 vmxp<br />
&nbsp; &nbsp; &nbsp; 3 which<br />
&nbsp; &nbsp; &nbsp; 2 chown<br />
&nbsp; &nbsp; &nbsp; 2 ctag<br />
&nbsp; &nbsp; &nbsp; 2 docky<br />
&nbsp; &nbsp; &nbsp; 2 ex<br />
&nbsp; &nbsp; &nbsp; 2 ks<br />
&nbsp; &nbsp; &nbsp; 2 pyton<br />
&nbsp; &nbsp; &nbsp; 2 set<br />
&nbsp; &nbsp; &nbsp; 2 tar<br />
&nbsp; &nbsp; &nbsp; 1 bc<br />
&nbsp; &nbsp; &nbsp; 1 cdcd<br />
&nbsp; &nbsp; &nbsp; 1 cp<br />
&nbsp; &nbsp; &nbsp; 1 cpanm<br />
&nbsp; &nbsp; &nbsp; 1 date<br />
&nbsp; &nbsp; &nbsp; 1 efr<br />
&nbsp; &nbsp; &nbsp; 1 firefox<br />
&nbsp; &nbsp; &nbsp; 1 gawk<br />
&nbsp; &nbsp; &nbsp; 1 gi<br />
&nbsp; &nbsp; &nbsp; 1 less<br />
&nbsp; &nbsp; &nbsp; 1 lua<br />
&nbsp; &nbsp; &nbsp; 1 PWD<br />
&nbsp; &nbsp; &nbsp; 1 re<br />
&nbsp; &nbsp; &nbsp; 1 sleep<br />
&nbsp; &nbsp; &nbsp; 1 tpo<br />
&nbsp; &nbsp; &nbsp; 1 unzip<br />
&nbsp; &nbsp; &nbsp; 1 vi<br />
&nbsp; &nbsp; &nbsp; 1 vm<br />
&nbsp; &nbsp; &nbsp; 1 xarg</div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/most-frequently-used-linux-commands.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>anti spam杂谈</title>
		<link>http://iregex.org/blog/anti-spam.html</link>
		<comments>http://iregex.org/blog/anti-spam.html#comments</comments>
		<pubDate>Sun, 15 Aug 2010 02:03:50 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[akismet]]></category>
		<category><![CDATA[antispam]]></category>
		<category><![CDATA[discuz]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=140</guid>
		<description><![CDATA[本文是一篇随笔，将email的anti spam技术和论坛的防灌水结合在一起讨论。从技术层面出发。不涉及其它。 据说德国有这样一句谚语：没有泡沫的啤酒不是好啤酒。推而广知，可以得到：没人灌... ]]></description>
			<content:encoded><![CDATA[<p>本文是一篇随笔，将email的anti spam技术和论坛的防灌水结合在一起讨论。从技术层面出发。不涉及其它。</p>
<p><span id="more-140"></span></p>
<p>据说德国有这样一句谚语：没有泡沫的啤酒不是好啤酒。推而广知，可以得到：没人灌水的论坛不是好论坛，没有垃圾邮件的邮件系统不是好系统（至少是不知名的系统／电邮地址），没有病毒骚扰的OS不是好的OS，等等。但是，只有泡沫的啤酒也不是什么好啤酒吧？关键是将不需要的内容控制在可以允许的范围内。单就开论坛、维护垃圾邮件的角度出发，审核技术还是很有用，很有必要的。否则，其地盘很快就会淹没在垃圾广告的汪洋大海之中。自己的论坛，自己发广告是为了维持网站开销，但是不请自来的广告是无法容忍的。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Spam</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>
    垃圾邮件有两个特点：<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>
                <strong>大量</strong>。一封两封垃圾邮件，个人用户可能比较在意，但是对于服务器来说，邮件数以百万计。在这样大的分母下，如果偶然有一两封垃圾邮件被判为合法邮件，或者合法邮件被误判为垃圾，实属正常。
            </li>
<li><strong>不需要</strong>。需要与否，取决于用户的主观判断。大家都认为 porn 和 drug 内容是spam，但是也不排除有人将这类邮件标为 ham 的。
            </li>
</ol>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Anti-Spam</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>常用的反垃圾邮件有以下几种方法：
</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>静态关键词列表。如果邮件头（标题，收发件人，电邮地址，正文）含某些关键词（例如Viagra），就将邮件标为Spam 或 Ham。这是最简单的方式，为各大邮件厂商所采用，包括Gmail。它最大的优点是快。缺点多多，一是需要维护（增／改／删 关键词），二是准确率不高（含黑名单中关键词的邮件未必全是Spam，反之亦然），三是不能识别含有干扰因素的邮件。例如，V1agra，发*漂，含这种关键词的邮件，人眼立即能识别它是垃圾邮件，但是静态法就傻了。
        </li>
<li>正则表达式规则表。与静态关键词列表相反，它速度稍慢，但是极其强大和灵活。对于邮件头的扫描，效果尤佳：邮件头是有规律可循的，尤其是对于大量的垃圾邮件而言，不可能不在邮件头中留下蛛丝马迹。<br />
        但是这种方法也有其短板。与上述方法类似，它也需要专人来维护，而且无论从配置难度到维护成本，都远高于前者。对于邮件的正文，正则表达式的扫描速度比较缓慢，尤其是对于精心设计了干扰因素的垃圾邮件。有相当大的一部分邮件，人眼看上去确实也是垃圾邮件，但是使用正则表达式也不好写规则。一个新的规则写手，可能要在准确率与查杀总量的折哀上花费很长一段时间才能掌握其规律。
        </li>
<li>贝叶斯概率法。若已知某些字词经常出现在垃圾邮件中，却很少出现在合法邮件中，当一封邮件含有这些字词时，那么link它是垃圾邮件的可能性就很大。参考此文。<a href="http://home.q.yesky.com/space-4148078-do-blog-id-412454.html" title="我爱正则表达式" target="_blank">贝叶斯过滤技术</a>。<br />
        它最大的优点是，只要有足够健壮的算法，足够的样本空间，其准确率是非常高的。同时，它主要依赖于机器学习，而不需要后期大量的人工干预。</li>
</ul>
</blockquote>
<p>国内有的网站，其内容过滤系统极其简单粗暴，只要出现单个汉字“日”，“操”，“干”等等字眼，就当作垃圾邮件／评论对待，而不分析具体上下文，实在令人又好气又好笑。又有，《百家姓》的常见姓氏用字本身不是垃圾字眼或违禁词汇，如果将其加入静态列表，就会导致连萝卜也无法搜索。其实，用一点点正则表达式（环视）或贝叶斯的技术（条件概率），就能提高过滤质量，皆大欢喜。当然，如果要扫描亿万级的网页，速度的要求肯定要优先于准确度，某些情况下只能做到大致靠谱罢了。然而，频频出错的系统，即使快一点点又有何用（成语：南辕北辙）？ 不过，从来都是宁枉勿纵的。
</p>
<p>由于静态法的特点，注定了列表只能向管理员开放，而对普通游客讳莫如深。这导致了另一种现象：该贴无法显示，是因为含有某关键词。至于哪些词是关键词，不好意思，不能告诉你，怕这个列表一旦公开，想发类似内容的人就能轻易绕过。那就有劳管理员们从严自省，并用心地揣摩圣意。</p>
<p> 动静结合，人、机结合，系统才能越用越新。@chunzi说得很形象：反垃圾邮件的过程，不是拼耐力的马拉松赛跑，而是适者生存的进化。总是魔已先高一丈，道才一尺尺增高，并最终压住魔。同时有新的魔即将出现。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一些工具</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>如果你说自己的邮箱里其实没多少垃圾邮件，或者即使有也已经自动被转入垃圾箱了，那么有两种可能。一是你所用的邮箱系统本身的反垃圾邮件系统做得不错（有太多太多各式各样的明显的垃圾邮件在进入你的邮箱之前，已经被服务器端给block了）。二是你的邮址没有被爬虫抓到或算出来。</p>
<li>
    <a href="http://spamassassin.apache.org/" title="我爱正则表达式" target="_blank">Spamassassin</a>是一套不错的反垃圾邮件系统。免费，与Apache紧密结合，强大的正则式支持。国内有一个组织专门动态维护一个中文的规则表，在<a href="http://www.ccert.edu.cn/spam/sa/Chinese_rules.htm" title="我爱正则表达式" target="_blank">这里</a>，可以参考。Spamassassin其实也有贝叶斯模块，只是它以正则知名罢了。
    </li>
<li>WordPress 有个 Akismet 插件是用来block 博客上的垃圾评论的。这个设置起来比较傻瓜（只需要申请一个API）即可，效果比较智能，完全不用用户再手工添加任何规则。对于出错的判断，用户有义务提交给Akismet官方，方便它学习新的变种。应该说，用户提交的漏判或误判，是必不可少的语料库。没有用户提交，Akismet就会一根筋地按照既定的思路继续犯同样的错误。
    </li>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">个人应用</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>国人开发的Discuz是一款不错的论坛程序。不过，它没有较好的反垃圾模块，我一直在颇厌其烦地删除spam和spammer。看了一下Ak的官网，几乎国外知名的论坛程序都有Ak的插件了。我研究了dz的数据结构，使用python写了一个脚本，定时搜索新贴子，将其提交到Ak做判断。如果判为垃圾，则屏蔽贴子，并对该会员实施减分操作。刚开始试用，效果还可以。其实可以做成原生的php插件，集成到dz中的。</p>
<p>流程很直接：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>定时搜索最新贴子；</li>
<li>对于每一个新贴子:</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>提交给Akismet作测试。</li>
<li>如果不是垃圾，忽视之。</li>
<li>如果是垃圾，将该贴转为仅管理员可见。同时将该用户扣分。</li>
</ul>
</blockquote>
<li>总合一下应该执行的操作，执行SQL, Commit。</li>
<li>生成报表，发邮件给管理员。</li>
</ul>
</blockquote>
<p>Ak的开发者页面在这里 <a href="http://akismet.com/development/" title="我爱正则表达式" target="_blank">Akismet API Documentation</a>。我用了其中的<a href="http://www.voidspace.org.uk/python/modules.shtml" title="我爱正则表达式" target="_blank">Python 模块</a> 将其封装为一个class，只需要init和check即可：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;comment.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-14 15:58</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">from</span> akismet <span style="color: #ff7700;font-weight:bold;">import</span> Akismet<br />
<br />
<span style="color: #ff7700;font-weight:bold;">class</span> Comment<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">api</span>=Akismet<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">api</span> <span style="color: #ff7700;font-weight:bold;">is</span> <span style="color: #008000;">None</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>.<span style="color: black;">api</span>.<span style="color: black;">verify_key</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'No Valid Akismet API'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; exit<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> init<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, comment, <span style="color: #dc143c;">user</span>=<span style="color: #483d8b;">''</span>, ip=<span style="color: #483d8b;">''</span>, <span style="color: #dc143c;">email</span>=<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: #dc143c;">user</span>=<span style="color: #dc143c;">user</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">comment</span>=comment.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">ip</span>=ip<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: #dc143c;">email</span>=<span style="color: #dc143c;">email</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> check<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">api</span>.<span style="color: black;">comment_check</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">comment</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'comment_author'</span>: <span style="color: #008000;">self</span>.<span style="color: #dc143c;">user</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'comment_author_email'</span>:<span style="color: #008000;">self</span>.<span style="color: #dc143c;">email</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'user_ip'</span>:<span style="color: #008000;">self</span>.<span style="color: black;">ip</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'user_agent'</span>:<span style="color: #483d8b;">&quot;Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>这是程序的主要部分</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http:#iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;dzas.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-14 15:20</span><br />
<br />
<span style="color: #808080; font-style: italic;">#anti spam for discuz! bbs.</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">from</span> comment <span style="color: #ff7700;font-weight:bold;">import</span> Comment <span style="color: #ff7700;font-weight:bold;">as</span> C<br />
<span style="color: #ff7700;font-weight:bold;">from</span> eml <span style="color: #ff7700;font-weight:bold;">import</span> Email<br />
<br />
<span style="color: #808080; font-style: italic;">#########################################################</span><br />
<span style="color: #808080; font-style: italic;">#global settings</span><br />
send_email_log =<span style="color: #ff4500;">1</span><br />
<br />
dbhost = <span style="color: #483d8b;">'yourhost.website.com'</span><span style="color: #66cc66;">;</span><br />
dbuser = <span style="color: #483d8b;">'db_user'</span><span style="color: #66cc66;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
dbpw = <span style="color: #483d8b;">'yourpassword'</span><span style="color: #66cc66;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
dbname = <span style="color: #483d8b;">'dbname'</span><span style="color: #66cc66;">;</span> &nbsp; &nbsp;<br />
dbpre=<span style="color: #483d8b;">'cdb_'</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#punish : punish for the user if he/she publish spam</span><br />
punish_score=<span style="color: #ff4500;">2</span><br />
<br />
<span style="color: #808080; font-style: italic;">#now is 2 hours; you may change</span><br />
sql=<span style="color: black;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #483d8b;">'last_n_hr'</span>:<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'select `%sposts`.pid, `%sposts`.author, `%sposts`.message, `%sposts`.useip, `%smembers`.email, authorid from `%sposts`, `%smembers` where dateline&gt;UNIX_TIMESTAMP(now())-2*3600 and `%sposts`.author=`%smembers`.username order by pid desc;'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>dbpre,dbpre,dbpre,dbpre,dbpre,dbpre,dbpre,dbpre,dbpre,<span style="color: black;">&#41;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'hide'</span>:<span style="color: #483d8b;">'update %sposts set `status`= 1 where pid in (%s);'</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'punish'</span>:<span style="color: #483d8b;">'update %smembers set credits=credits-%s where uid=%s;'</span>, <br />
&nbsp; &nbsp; <span style="color: #483d8b;">'find_hided'</span>:<span style="color: #483d8b;">&quot;SELECT * FROM &nbsp;`%sposts` WHERE &nbsp;`status`=1;&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>dbpre<span style="color: black;">&#41;</span>,<br />
<span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> MySQLdb<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> init_db<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; db=MySQLdb.<span style="color: black;">connect</span><span style="color: black;">&#40;</span>host=dbhost, <span style="color: #dc143c;">user</span>=dbuser, passwd=dbpw,db=dbname, charset=<span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> db<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> hide_spam<span style="color: black;">&#40;</span>db, spam<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; sql_str=sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'hide'</span><span style="color: black;">&#93;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>dbpre,<span style="color: #483d8b;">','</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>spam<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql_str<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; db.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> punish<span style="color: black;">&#40;</span>db,score<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> u <span style="color: #ff7700;font-weight:bold;">in</span> score.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; s=score<span style="color: black;">&#91;</span>u<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; sql_str=sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'punish'</span><span style="color: black;">&#93;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>dbpre,s<span style="color: #66cc66;">*</span>punish_score,u<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql_str<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; db.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> get_msg<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'last_n_hr'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; records= c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; result=<span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> records:<br />
&nbsp; &nbsp; &nbsp; &nbsp; result.<span style="color: black;">append</span><span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'pid'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'user'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'msg'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'ip'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'email'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'uid'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#125;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> result<br />
<br />
<span style="color: #808080; font-style: italic;">#send log to admin. change the global variable </span><br />
<span style="color: #808080; font-style: italic;">#send_email_log = 1 or 0 to enable/disable</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> report<span style="color: black;">&#40;</span>spam, score<span style="color: black;">&#41;</span>:<br />
<br />
&nbsp; &nbsp; sub=<span style="color: #483d8b;">&quot;%s spams caputured&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>spam<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; body=<span style="color: #483d8b;">&quot;spammers including: %s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #483d8b;">', '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span> <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'user'</span><span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; body+=<span style="color: #483d8b;">&quot;spam pids including: %s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #483d8b;">', '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span> <span style="color: black;">&#91;</span> <span style="color: #008000;">str</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'pid'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><br />
&nbsp;<br />
&nbsp; &nbsp; body+=<span style="color: #483d8b;">&quot;useful sql: %s<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #66cc66;">%</span> sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'find_hided'</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; body+=<span style="color: #483d8b;">&quot;spam preview: <span style="color: #000099; font-weight: bold;">\n</span>%s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span> m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'msg'</span><span style="color: black;">&#93;</span>.<span style="color: black;">splitlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; Email<span style="color: black;">&#40;</span>sub.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span>,body.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#the core part</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> anti_spam<span style="color: black;">&#40;</span>db, msgs<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c=C<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; spam=<span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; score=<span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>.<span style="color: black;">fromkeys</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'uid'</span><span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> msgs<span style="color: black;">&#93;</span>, <span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> msgs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; c.<span style="color: black;">init</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'msg'</span><span style="color: black;">&#93;</span>, m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'user'</span><span style="color: black;">&#93;</span>, m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'ip'</span><span style="color: black;">&#93;</span>, m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'email'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> c.<span style="color: black;">check</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; score<span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'uid'</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>+=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; spam.<span style="color: black;">append</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> s <span style="color: #ff7700;font-weight:bold;">in</span> score.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> score<span style="color: black;">&#91;</span>s<span style="color: black;">&#93;</span>==<span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">del</span> score<span style="color: black;">&#91;</span>s<span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> score <span style="color: #ff7700;font-weight:bold;">and</span> spam: <br />
&nbsp; &nbsp; &nbsp; &nbsp; hide_spam<span style="color: black;">&#40;</span>db, <span style="color: black;">&#91;</span><span style="color: #008000;">str</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'pid'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; punish<span style="color: black;">&#40;</span>db, score<span style="color: black;">&#41;</span> <br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> send_email_log:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; report<span style="color: black;">&#40;</span>spam, score<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>: <br />
&nbsp; &nbsp; db=init_db<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; msg=get_msg<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; anti_spam<span style="color: black;">&#40;</span>db, msg<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__==<span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/anti-spam.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>正则凡客</title>
		<link>http://iregex.org/blog/fanke.html</link>
		<comments>http://iregex.org/blog/fanke.html#comments</comments>
		<pubDate>Tue, 10 Aug 2010 12:00:51 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=139</guid>
		<description><![CDATA[按需替换，精准搜索； 有时贪婪，有时懒惰。 喜欢正则，不喜欢回溯； 喜欢用正则解决问题，也知道并非所有问题都能用正则解决。 喜欢在论坛上讨论疑点，也喜欢在博客上分享心得。 自学... ]]></description>
			<content:encoded><![CDATA[<p>按需替换，精准搜索；<br />
有时贪婪，有时懒惰。<br />
喜欢正则，不喜欢回溯；<br />
喜欢用正则解决问题，也知道并非所有问题都能用正则解决。<br />
喜欢在<a href="http://regex.me">论坛</a>上讨论疑点，也喜欢在<a href="http://iregex.org">博客</a>上分享心得。<br />
自学而未成才，不是什么大牛，我是一名正则凡客。 </p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanke.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>复杂的正则表达式应该如何构造</title>
		<link>http://iregex.org/blog/craft-complex-regex.html</link>
		<comments>http://iregex.org/blog/craft-complex-regex.html#comments</comments>
		<pubDate>Fri, 06 Aug 2010 07:02:46 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[问答]]></category>
		<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=138</guid>
		<description><![CDATA[昨天Snopo问我如何写一段正则表达式，来提取sql的条件语句。解答之余，想写一篇文章介绍一下经验。文题本来是《如何构造复杂的正则表达式》，但是觉得有些歧义，就感觉正则式本来很简单... ]]></description>
			<content:encoded><![CDATA[<p>昨天Snopo问我如何写一段正则表达式，来提取sql的条件语句。解答之余，想写一篇文章介绍一下经验。文题本来是《如何构造复杂的正则表达式》，但是觉得有些歧义，就感觉正则式本来很简单，我在教人如何将它小事化大一样。正好相反，我的本意是说，即使复杂的正则式也不怕，找出合适的方法，将其构造出来。</p>
<p><span id="more-138"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">避重就轻</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>
    Snopo给出的文本是这样的：<code class="codecolorer text default"><span class="text">or and name='zhangsan' and id=001 or age&gt;20 or area='%renmin%' and like</span></code>，问，如何提取其中正确的SQL查询语句。</p>
<p>
       简要分析可知，中间部分是合乎要求的，只是两端的有若干个<code class="codecolorer text default"><span class="text">like, or, and</span></code>。构造能够解析合乎SQL语法的查询语句的正则表达式，应该是比较复杂的。可是，对于具体的问题，也可以更简单。上述的不良构的SQL语句，应该是使用程序自动生成的，它的两端会有一些不符合题意的文本。只要将这些文本去除就可以了。</p>
<p>于是，我写出了正则表达式：<code class="codecolorer perl default"><span class="perl"><span style="color: #000066;">s</span><span style="color: #339933;">/^</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #b1b100;">or</span><span style="color: #339933;">|</span>and<span style="color: #339933;">|</span>like<span style="color: #009900;">&#41;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+|</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #b1b100;">or</span><span style="color: #339933;">|</span>and<span style="color: #339933;">|</span>like<span style="color: #009900;">&#41;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span><span style="color: #0000ff;">$/</span><span style="color: #339933;">/</span>mi<span style="color: #339933;">;</span></span></code>，这样就把多行字串首尾的<code class="codecolorer text default"><span class="text">like, or, and</span></code>以及可能的空白字符全部去掉了，剩下的内容即为所求。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">分而治之</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>
        答案发过去之后，Snopo显然不是很满意这种“偷懒”的办法。他继续问道，能否写出正则式，用来匹配合符SQL语法要求的条件查询语句？（只考虑where部分即可，不必写完整的select。）
    </p>
<p>的确，从快速解决问题的角度来说，只要能够行之有效地解决，用什么办法都可以；不过从学习知识的角度来说，不避重就轻，而是刨根问底，才是正途。既如此，就看一下如何使用正则，将该SQL查询语句解决掉。</p>
<p>
        最简单的查询语句，应该是真假判断，即  <code class="codecolorer text default"><span class="text">where 1; where True; where false</span></code>，等等。 这样的语句使用正则式，直接<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:-?</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+|</span>True<span style="color: #339933;">|</span>False<span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span>i</span></code>。
        </p>
<p>
        稍复杂些的单条语句，可以是左右比较，即</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">name like 'zhang%', 或 age&gt;25 ，或 work in ('it', 'hr', 'R&amp;D')</div></td></tr></tbody></table></div>
<p>。将其简单化，结构就变为<code class="codecolorer text default"><span class="text">A OP B</span></code>。其中A代表变量，OP代表比较操作符，B代表值。
        </p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>A: 最简单的A，应该是<code class="codecolorer text default"><span class="text">\w+</span></code>。考虑到实际情况，变量包含点号或脱字符，例如<code class="codecolorer text default"><span class="text">`table.salary`</span></code>，可以记为<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\w</span><span style="color: #339933;">.</span><span style="color: #ff0000;">`]+/</span></span></code>。这是比较笼统的细化。如果要求比较苛刻，还可以做到让脱字符同时在左右两边出现（条件判断）。 </li>
<li>OP: Where 常用的几种关系比较为：<code class="codecolorer text default"><span class="text">=, &lt;&gt;, &gt;, &lt;, &gt;=, &lt;=, Between, Like, in</span></code>。使用简单的正则描述之，成为：<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">&lt;&gt;=</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">|</span>Between<span style="color: #339933;">|</span>Like<span style="color: #339933;">|</span>In<span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span>i</span></code>。
                </li>
<li>B: B 的情况又可分为3种：变量，数字，字符串，列表。为简单起见，这里就不考虑算术表达式了。<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>变量的话，直接延用A的定义即可。不赘述。</li>
<li>数字：使用<code class="codecolorer text default"><span class="text">/\d+/</span></code>来定义。不考虑小数和负数了。</li>
<li>字符串：包括单引号字串和双引号字串。中间可以包括被转义的引号。我写了一个符合这一要求的引号字串正则表达式，形如：<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #ff0000;">&quot;]|[^<span style="color: #000099; font-weight: bold;">\\</span>1])*?<span style="color: #000099; font-weight: bold;">\1</span>/</span></span></code>。不过，由于它只是庞大机器的一个零件，这样写的风险是极其大的。首先，它使用了反向引用；其次，该反向引用使用了全局的反向引用编号。我写了自动生成全局编号的函数，来解决这一问题。不过，这里谈细节是不是太深入了。应该先谈框架，再说细节才对。不应该一入手就陷进细节的汪洋大海。 </li>
<li>列表：列表是形如<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">3</span> <span style="color: #339933;">,</span> <span style="color: #cc66cc;">4</span><span style="color: #009900;">&#41;</span> 或 <span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;it&quot;</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;hr&quot;</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;r&amp;d&quot;</span><span style="color: #009900;">&#41;</span></span></code>之类的东东，它由简单变量以逗号相连，两边加上括号组成。列表的单项以I表示，它代表 数字|字符串。此时，列表就变为：<code class="codecolorer text default"><span class="text">/\(I(?:,I)*?\)/</span></code>。它表示，左括号，一个I，一系列由逗号、I组成的其它列表项（0个或多个），右括号。简单起见没有考虑空白字符。
                        </li>
</ul>
</blockquote>
</li>
<li>至此，可以总结出单条语句的正则框架：<code class="codecolorer perl default"><span class="perl">S <span style="color: #339933;">=~</span> <span style="color: #339933;">/</span>A OP B<span style="color: #339933;">/</span>i</span></code>。S在此代表单条语句。 </li>
</ul>
</blockquote>
<p>
        更为复杂的是多条语句，可以由单条语句组成，中间使用 and 或 or 连接。合理地构造单条语句，将其稳定地编制为多条语句，任务就完成了。
    </p>
<p>
        沿用上面的示例，以S代表单条语句，那么复合语句C就是 <code class="codecolorer perl default"><span class="perl">C <span style="color: #339933;">=~</span> S<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #b1b100;">or</span><span style="color: #339933;">|</span>and<span style="color: #009900;">&#41;</span> S<span style="color: #009900;">&#41;</span><span style="color: #339933;">*?/</span></span></code>。至此，一个初具规模的条件语句解析器就诞生了。下面以python为例，一步一步实现出来。
    </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Python实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>重申一句：虽然给出了实现，但是仍请注重思路，忽略代码。</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;test.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-06 17:12</span><br />
<br />
<span style="color: #808080; font-style: italic;">#generage quoted string;</span><br />
<span style="color: #808080; font-style: italic;">#including ' and &quot; string</span><br />
<span style="color: #808080; font-style: italic;">#allow \' and \&quot; inside</span><br />
index=<span style="color: #ff4500;">0</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> gen_quote_str<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>: <br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">global</span> index<br />
&nbsp; &nbsp; index+=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; char=<span style="color: #008000;">chr</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">96</span>+index<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;(?P&lt;quote_%s&gt;['&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['&quot;]|[^'&quot;])*?(?P=quote_%s)&quot;&quot;&quot;</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>char, char<span style="color: black;">&#41;</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#simple variable </span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">'[<span style="color: #000099; font-weight: bold;">\w</span>.`]+'</span><br />
<br />
<span style="color: #808080; font-style: italic;">#operators</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> op<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">'(?:[&lt;&gt;=]{1,2}|Between|Like|In)'</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#list item within (,)</span><br />
<span style="color: #808080; font-style: italic;">#eg: 'a', 23, a.b, &quot;asdfasdf\&quot;aasdf&quot;</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;(?:%s|%s)&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, gen_quote_str<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#a complite list, like</span><br />
<span style="color: #808080; font-style: italic;">#eg: (23, 24, 44), (&quot;regex&quot;, &quot;is&quot;, &quot;good&quot;)</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;<span style="color: #000099; font-weight: bold;">\(</span> <span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; %s <br />
&nbsp; &nbsp; (?:,<span style="color: #000099; font-weight: bold;">\s</span>* %s)* <span style="color: #000099; font-weight: bold;">\s</span>* <br />
<span style="color: #000099; font-weight: bold;">\)</span>&quot;&quot;&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#simple comparison</span><br />
<span style="color: #808080; font-style: italic;">#eg: a=15 , b&gt;23</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;%s <span style="color: #000099; font-weight: bold;">\s</span>* %s <span style="color: #000099; font-weight: bold;">\s</span>* (?:<span style="color: #000099; font-weight: bold;">\w</span>+| %s | %s )&quot;&quot;&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, op<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, gen_quote_str<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
<span style="color: #808080; font-style: italic;">#complex comparison</span><br />
<span style="color: #808080; font-style: italic;"># name like 'zhang%' and age&gt;23 and work in (&quot;hr&quot;, &quot;it&quot;, 'r&amp;d')</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> c<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;<br />
(?ix) %s <br />
(?:<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; (?:and|or)<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; %s &nbsp;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
)*<br />
&nbsp; &nbsp; &nbsp;&quot;&quot;&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;A:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>, a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;OP:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>, op<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;ITEM:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>, item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;ITEMS:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>, items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;S:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>, s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;C:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>, c<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>该代码在我的机器上(Ubuntu 10.04, Python 2.6.5)运行的结果是：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">A:&nbsp; <span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+<br />
OP: <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;</span>=<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span><br />
ITEM: &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_a<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_a<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
ITEMS:&nbsp; \<span style="color: black;">&#40;</span> \s<span style="color: #66cc66;">*</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_b<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_b<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:,\s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_c<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_c<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*</span> \s<span style="color: #66cc66;">*</span> <br />
\<span style="color: black;">&#41;</span><br />
S:&nbsp; <span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+ \s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;</span>=<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span> \s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:\w+| <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_d<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_d<span style="color: black;">&#41;</span> | \<span style="color: black;">&#40;</span> \s<span style="color: #66cc66;">*</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_e<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_e<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:,\s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_f<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_f<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*</span> \s<span style="color: #66cc66;">*</span> <br />
\<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span><br />
C:&nbsp; <br />
<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>ix<span style="color: black;">&#41;</span> <span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+ \s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;</span>=<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span> \s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:\w+| <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_g<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_g<span style="color: black;">&#41;</span> | \<span style="color: black;">&#40;</span> \s<span style="color: #66cc66;">*</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_h<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_h<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:,\s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_i<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_i<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*</span> \s<span style="color: #66cc66;">*</span> <br />
\<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> <br />
<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:\s<span style="color: #66cc66;">*</span><br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: #ff7700;font-weight:bold;">and</span>|or<span style="color: black;">&#41;</span>\s<span style="color: #66cc66;">*</span><br />
&nbsp; &nbsp; <span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+ \s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;</span>=<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span> \s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:\w+| <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_j<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_j<span style="color: black;">&#41;</span> | \<span style="color: black;">&#40;</span> \s<span style="color: #66cc66;">*</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_k<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_k<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:,\s<span style="color: #66cc66;">*</span> <span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>:<span style="color: black;">&#91;</span>\w.<span style="color: #66cc66;">`</span><span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P<span style="color: #66cc66;">&lt;</span>quote_l<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*?</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>P=quote_l<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">*</span> \s<span style="color: #66cc66;">*</span> <br />
\<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> &nbsp;\s<span style="color: #66cc66;">*</span><br />
<span style="color: black;">&#41;</span><span style="color: #66cc66;">*</span></div></td></tr></tbody></table></div>
<p>请看匹配效果图：</p>
<p><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/2010-08-07_Selection_02.png" /></p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">算术表达式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>我记得刚才好像提到“为简单起见，这里就不考虑算术表达式了”。不过，解析算术表达式是个非常有趣的话题，只要是算法书，都会提及（中缀表达式转前缀表达式，诸如此类）。当然它也可以使用正则表达式来描述。</p>
<p>其主要思路是：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">expr -&gt; expr + term | expr - term | term<br />
term -&gt; term * factor | term / factor | factor<br />
factor -&gt; digit | ( expr )</div></td></tr></tbody></table></div>
<p>以及代码：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;math.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-07 00:44</span><br />
<br />
integer=r<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\d</span>+&quot;</span><br />
<br />
factor=r<span style="color: #483d8b;">&quot;%s (?:<span style="color: #000099; font-weight: bold;">\.</span> %s)?&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>integer, integer<span style="color: black;">&#41;</span><br />
<br />
term= <span style="color: #483d8b;">&quot;%s(?: <span style="color: #000099; font-weight: bold;">\s</span>* [*/] <span style="color: #000099; font-weight: bold;">\s</span>* %s)* &quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>factor, factor<span style="color: black;">&#41;</span><br />
<br />
expr= <span style="color: #483d8b;">&quot;(?x) %s(?: <span style="color: #000099; font-weight: bold;">\s</span>* [+-] <span style="color: #000099; font-weight: bold;">\s</span>* %s)* &quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>term, term<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">print</span> expr</div></td></tr></tbody></table></div>
<p>看一下它的输出和匹配效果图：</p>
<p><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/2010-08-07_Selection_01.png"/></p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">小贴士</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>如果不用复杂的正则式就能解决问题，一定不要用。 </li>
<li>如果必须写比较复杂的正则式，请参考以下原则。</li>
<li>从大处着眼，先理解待解析的文本的整体结构是什么样子，划分为小部件；</li>
<li>从细处着手，试图实现每一个小部件，力求每一部分都是完整、坚固的，且放在全局也不会冲突。
</li>
<li>合理组装这些部件。</li>
<li>分而治之的好处：只有某个模块出错，其它部分没错时，可以迅速定位错误，消除BUG。</li>
<li>谨慎使用捕获括号，除非你知道自己在做什么，知道它会有什么副作用，以及是否有可行的解决措施。对于短小的正则式来说，一两个多余的括号是无伤大雅的；但是对于复杂的正则式来说，一对多余的括号可能就是致命的错误。</li>
<li>尽量使用free-space模式。此时你可以自由地添加注释和空白字符，以便提高正则表达式的可读性。</li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/craft-complex-regex.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>格式化HTML标签缩进</title>
		<link>http://iregex.org/blog/html-tag-indentation.html</link>
		<comments>http://iregex.org/blog/html-tag-indentation.html#comments</comments>
		<pubDate>Wed, 04 Aug 2010 05:06:40 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[问答]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tag]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=137</guid>
		<description><![CDATA[读者“神の呼出”留言询问如何格式化HTML的标签缩进，并给出了他的思路和解法，是从纯粹的正则出发。例如，寻找配对的标签要用到后向引用，标签嵌套则使用递归。不过，这两个特性虽然... ]]></description>
			<content:encoded><![CDATA[<p>读者“神の呼出”<a href="http://iregex.org/blog/recursive-regex-in-php.html">留言</a>询问如何格式化HTML的标签缩进，并给出了他的思路和解法，是从纯粹的正则出发。例如，寻找配对的标签要用到后向引用，标签嵌套则使用递归。不过，这两个特性虽然很有用，却不宜滥用。本文试图从另一个角度出发，简化思路，降低对正则的依赖，以便提高速度。</p>
<p><span id="more-137"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">问题描述</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>总目标：格式化HTML文本，按标签的层级输出。</li>
<li>标签是平衡的，例如<code class="codecolorer html4strict default"><span class="html4strict"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">div</span>&gt;&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">div</span>&gt;</span></span></code>。</li>
<li>标签也有可能不是以配对的形式出现的，例如<code class="codecolorer html4strict default"><span class="html4strict"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">img</span> ... <span style="color: #66cc66;">/</span>&gt;</span></span></code>。</li>
<li>标签可能嵌套出现的，例如：
<div class="codecolorer-container html4strict mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="html4strict codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">div</span>&gt;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">div</span>&gt;&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">div</span>&gt;</span><br />
<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">div</span>&gt;</span></div></td></tr></tbody></table></div>
</li>
<li>所有的文本可能是以单行形式给出，没有换行符、水平制表符等空白字符。</li>
<li>输出时，每出现一组新的标签，缩进一个层级。</li>
<li>最内层的标签应处于同级。</li>
<li>所有的文字应与其父标签同级。</li>
<li>独立元素实现同级缩进。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">解决思路</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>我主要说一下思路，并给出Python版的实现。其中用到的正则，都是简单正则，可以方便地翻译为其它语言。</p>
<ul>
<li>由于源文本是单行文本（默认情况），需要在合适的地方插入换行符。我的思路是，在<strong>不是位于行首的左尖括号处</strong>加入换行符。要使用多行模式，以便让<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">^</span></span></code>能匹配字串内的行首。使用的正则式是<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!^</span><span style="color: #009900;">&#41;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?=&lt;</span><span style="color: #009900;">&#41;</span></span></code>。注意，它顺便去掉了尖括号左侧的可能的任意个空白字符。实际运行时，它不但处理单行，还处理多行。</li>
<li>同理，在<strong>不是位于行尾</strong>的右尖括号的右侧插入换行符。正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#40;</span><span style="color: #66cc66;">?&lt;!</span>^<span style="color: black;">&#41;</span>\s<span style="color: #66cc66;">*</span><span style="color: black;">&#40;</span><span style="color: #66cc66;">?</span>=<span style="color: #66cc66;">&lt;</span><span style="color: black;">&#41;</span></span></code>。同理。</li>
<li>现在，以<strong>文本行</strong>为单位处理每一行文本。</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>设置层级变量level，初始值为0。</li>
<li>如果层级向右缩进，则level++；如果向左伸出，则level&#8211;。</li>
<li>如果该行包含<code class="codecolorer text default"><span class="text">&quot;/&gt;&quot;</span></code>，则这是一个独立的缩进单位，层级不变，直接输出level个层级符号，以及该行文本即可。</li>
<li>否则，如果该行以<code class="codecolorer text default"><span class="text">&quot;&lt;/&quot;</span></code>开头，则表明是一个层级的结束，应该先level&#8211;，再输出该行内容。此顺序很重要。 </li>
<li>否则，如果该行以<code class="codecolorer text default"><span class="text">&quot;&lt;&quot;</span></code>开头，则表明这是一个层级的开始，应该先level++，再输出该行内容。顺序同样重要。 </li>
<li>其余情况，就是普通文本，直接继承上个层级的缩进量，再输出该行文本即可。</li>
</ul>
</blockquote>
</ul>
<p>    程序至此为止。当然，如果想处理更加复杂的情况，可以酌情增减语句。例如，我所处理的文本，有的是一个标签太长，因此分行写的，例如：</p>
<div class="codecolorer-container html4strict mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br /></div></td><td><div class="html4strict codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #00bbdd;">&lt;!DOCTYPE </span><br />
<span style="color: #00bbdd;">&nbsp; &nbsp; html </span><br />
<span style="color: #00bbdd;">&nbsp; &nbsp; PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; </span><br />
<span style="color: #00bbdd;">&nbsp; &nbsp; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;</span></div></td></tr></tbody></table></div>
<p>对于这样的情况，我只好给出纯粹的正则解法，虽然速度不快，但是不重复，不遗漏：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#combine &lt; \n..&gt; lines</span><br />
x=<span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span>,content<span style="color: black;">&#41;</span> <br />
<span style="color: #ff7700;font-weight:bold;">while</span> x: <br />
&nbsp; &nbsp; content=content.<span style="color: black;">replace</span><span style="color: black;">&#40;</span>x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>,x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot; &quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; x=<span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span>,content<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>这种情况，我想不出正则以外的解法。
</p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Python代码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>实现只是小道。如果理解了上述思路，很容易转为其它语言的代码。JS, PHP都可以。请读者自已实现。有问题请留言。</p>
<p>该python代码的使用方式是：<code class="codecolorer bash default"><span class="bash">&nbsp;.<span style="color: #000000; font-weight: bold;">/</span>format_html.py source.html<span style="color: #000000; font-weight: bold;">&gt;</span> dest.html</span></code></p>
<p>完整代码：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;format_html.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-04</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span>,<span style="color: #dc143c;">sys</span><br />
<br />
indent=<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><br />
f=<span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
content=f.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
content=content.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#combine &lt; \n..&gt; lines</span><br />
x=<span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span>,content<span style="color: black;">&#41;</span> <br />
<span style="color: #ff7700;font-weight:bold;">while</span> x: <br />
&nbsp; &nbsp; content=content.<span style="color: black;">replace</span><span style="color: black;">&#40;</span>x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>,x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot; &quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; x=<span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span>,content<span style="color: black;">&#41;</span><br />
<br />
content=<span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?m)(?&lt;!^)<span style="color: #000099; font-weight: bold;">\s</span>*(?=&lt;)&quot;</span>,<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, content<span style="color: black;">&#41;</span><br />
content=<span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?&lt;=&gt;)<span style="color: #000099; font-weight: bold;">\s</span>*(?=<span style="color: #000099; font-weight: bold;">\S</span>)&quot;</span>,<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, content<span style="color: black;">&#41;</span><span style="color: #66cc66;">;</span><br />
lines=content.<span style="color: black;">splitlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
level=<span style="color: #ff4500;">0</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> l <span style="color: #ff7700;font-weight:bold;">in</span> lines: <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #483d8b;">&quot;/&gt;&quot;</span> <span style="color: #ff7700;font-weight:bold;">in</span> l:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>indent<span style="color: #66cc66;">*</span>level,l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">elif</span> l<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>==<span style="color: #483d8b;">'&lt;/'</span> : &nbsp;<br />
&nbsp; &nbsp; &nbsp; &nbsp; level -=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>indent<span style="color: #66cc66;">*</span>level,l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">elif</span> l<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>==<span style="color: #483d8b;">'&lt;'</span>: <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>indent<span style="color: #66cc66;">*</span>level,l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; level +=<span style="color: #ff4500;">1</span> <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>indent<span style="color: #66cc66;">*</span>level,l<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/html-tag-indentation.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Trie in Python</title>
		<link>http://iregex.org/blog/trie-in-python.html</link>
		<comments>http://iregex.org/blog/trie-in-python.html#comments</comments>
		<pubDate>Sun, 01 Aug 2010 14:58:54 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[trie]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=136</guid>
		<description><![CDATA[关于 Trie 的介绍，请读上文Trie，此不赘述。本文主要分析 Trie 实现原理，并给出 Python 的实现。 构造检索树 先正更上文不精确之处。上文说， 具体说来，就是提取出备选项文本的公共部分，... ]]></description>
			<content:encoded><![CDATA[<p>关于 Trie 的介绍，请读上文<a href="http://iregex.org/blog/trie.html" title="我爱正则表达式" target="_blank">Trie</a>，此不赘述。本文主要分析 Trie 实现原理，并给出 Python 的实现。
</p>
<p><span id="more-136"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">构造检索树</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>先正更上文不精确之处。上文说，</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>具体说来，就是提取出备选项文本的公共部分，构造“检索树”…</p></blockquote>
<p>其实 Trie 并不是提取<strong>所有的</strong>“公共部分”，而是只提取“前缀”而已。例如，对于正则式<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>abbcc<span style="color: #339933;">|</span>abcc<span style="color: #339933;">/</span></span></code>，它生成的结果是<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?-</span>xism<span style="color: #339933;">:</span>ab<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>bcc<span style="color: #339933;">|</span>cc<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></span></code>，而非<code class="codecolorer perl default"><span class="perl">abb<span style="color: #339933;">?</span>cc</span></code>，可见它并没有智能到足够程度，可应用之而不可迷信之。具体原因，可以通过读源码以及本文分析而理解。
    </p>
<p>新建一个 Trie 对象之后，每向它添加一个字串，它都做如下操作：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>整体的数据结构是 Hash 表，亦即Python中的Dictionary。</li>
<li>Hash表的每一个元素指向它自身；以所输入的子串的每个字母为Key，如果它自身为空，则指向一个新建的匿名Hash；</li>
<li>最后一个元素的Key为空字串‘’，value为1。这也是判断每个分支是否结束的标志。</li>
</ul>
</blockquote>
<p>请看一下对于字串 <code class="codecolorer text default"><span class="text">foobar</span></code>分析后所生成的数据结构：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">{<br />
'f' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'b' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'r' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp;}<br />
}</div></td></tr></tbody></table></div>
<p>很美观，对不对。奇妙的是，对于第一条字串，它生成的结构是这样的；对于新插入的第二条，第三条……第N条字串，它不是另起炉炉灶，而是萧规曹随，见缝插针，充分利用前面已经成生的数据结构。这要归功于Hash/Dictionary这种数据结构的特点。看一下针对于<code class="codecolorer text default"><span class="text">foobar foobah fooxar foozap fooza</span></code> 完全插入后的效果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp; 'f' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'b' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'h' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'r' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'x' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'r' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'z' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'' =&gt; 1,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'p' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
}</div></td></tr></tbody></table></div>
<p>这个结构图很直观地解释了为什么是提取前缀而非后缀中缀什么的。</p>
<p>构造一个检索树，已经不是一个问题。现在来看看如何将它转换为正则表达式。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">检索树的正则表现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>原文源码比较精炼，用了许多Perl特有的语法且无注释。我简要解释一下作者思路。</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>如果当前节点的 key 为空，且当前只有一个键，则该分支结束，返回空值。这也是前文伏笔的照应：为什么要加上<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">$ref</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">''</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span></span></code>。 </li>
<li>对于不为空的节点，一一分析之。</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>只要当前节点不为空，一直递归调用本函数，将当前key+下个节点的递归结果push到数组<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@alt</span></span></code>中备用。 </li>
<li>否则只将 key push到 <code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@cc</span></span></code>中。 </li>
</ul>
</blockquote>
<li><code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@cc</span></span></code>是用来保存单个字母的，而<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@alt</span></span></code>则是用来保存多个字母的备选项的。 </li>
<li>将<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@cc</span></span></code>中的元素格式化为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span>abc<span style="color: #009900;">&#93;</span></span></code>的样子。当然，如果只有一个元素就不必了。 </li>
<li>将<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@alt</span></span></code>中的元素格式化为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>abc<span style="color: #339933;">|</span>xyz<span style="color: #009900;">&#41;</span></span></code>的样子，一个元素则免。
        </li>
<li>在适当的地方添加问号，表示备选。</li>
</ul>
<p>读懂了源码，自己实现起来就不是问题了。我在读此代码时，使用了纸笔抄写、观测<code class="codecolorer perl default"><span class="perl"><span style="color: #000066;">print</span></span></code>、<code class="codecolorer perl default"><span class="perl">Data<span style="color: #339933;">::</span><span style="color: #006600;">Dumper</span></span></code>输出等方式来辅助理解。事实证明卓有成效。</p>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">移植到Python</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>从Perl到Python，其实就像从英语到法语的转换一样，只是将拼写方式，细微语法修整一下即可，算不得伤筋动骨的大手术。我是用它来验证理解、熟悉语法细节的。代码如下。只要<code class="codecolorer python default"><span class="python"><span style="color: #ff7700;font-weight:bold;">from</span> trie <span style="color: #ff7700;font-weight:bold;">import</span> Trie</span></code>就能使用了。
</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;tr.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-01 20:24</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">class</span> Trie<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;&quot;Regexp::Trie in python&quot;&quot;&quot;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">data</span>=<span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> add<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, word<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; ref=<span style="color: #008000;">self</span>.<span style="color: black;">data</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> word:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span>=ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref=ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span><span style="color: #483d8b;">''</span><span style="color: black;">&#93;</span>=<span style="color: #ff4500;">1</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> dump<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">data</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> _regexp<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, pData<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; data=pData<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> data.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>data.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>==<span style="color: #ff4500;">1</span>: <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">None</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; alt=<span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; cc=<span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; q=<span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>data.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>data<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span>,<span style="color: #008000;">dict</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; recurse=<span style="color: #008000;">self</span>._regexp<span style="color: black;">&#40;</span>data<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt.<span style="color: black;">append</span><span style="color: black;">&#40;</span>char+recurse<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cc.<span style="color: black;">append</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; q=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; cconly=<span style="color: #008000;">len</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> <span style="color: #ff4500;">0</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: #ff4500;">1</span> &nbsp;<span style="color: #808080; font-style: italic;">#if len, 0; else:0</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#41;</span><span style="color: #66cc66;">&gt;</span><span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#41;</span>==<span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt.<span style="color: black;">append</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'['</span>+<span style="color: #483d8b;">''</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#41;</span>+<span style="color: #483d8b;">']'</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span>==<span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result=alt<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result=<span style="color: #483d8b;">&quot;(?:&quot;</span>+<span style="color: #483d8b;">&quot;|&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot;)&quot;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> q:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> cconly:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result+=<span style="color: #483d8b;">&quot;?&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result=<span style="color: #483d8b;">&quot;(?:%s)?&quot;</span> <span style="color: #66cc66;">%</span> result<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> result <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> regexp<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;(?-xism:%s)&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #008000;">self</span>._regexp<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">dump</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
a=Trie<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">for</span> w <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #483d8b;">'foobar'</span>, <span style="color: #483d8b;">'foobah'</span>, <span style="color: #483d8b;">'fooxar'</span>, <span style="color: #483d8b;">'foozap'</span>, <span style="color: #483d8b;">'fooza'</span><span style="color: black;">&#93;</span>:<br />
&nbsp; &nbsp; a.<span style="color: black;">add</span><span style="color: black;">&#40;</span>w<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> a.<span style="color: black;">regexp</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>Ubuntu 10.04, Python 2.6.5 环境下测试通过。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/trie-in-python.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Trie</title>
		<link>http://iregex.org/blog/trie.html</link>
		<comments>http://iregex.org/blog/trie.html#comments</comments>
		<pubDate>Sun, 01 Aug 2010 00:51:16 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[trie]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=133</guid>
		<description><![CDATA[从《Effective Perl》上学习到一个module：Regexp::Trie。它属于正则优化类的module，具体说来，就是提取出备选项文本的公共部分，构造“检索树”，以便最大程度上减少回溯，提升效率。 试用了一... ]]></description>
			<content:encoded><![CDATA[<p>从《<a href="http://book.douban.com/subject/4073062/">Effective Perl</a>》上学习到一个module：<a href="http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm">Regexp::Trie</a>。它属于正则优化类的module，具体说来，就是提取出备选项文本的公共部分，构造“检索树”，以便最大程度上减少回溯，提升效率。</p>
<p><span id="more-133"></span></p>
<p>试用了一下。对于简单的例子，它的效果很惊艳；对于复杂的例子，这个优化器的“机器人”的性格就显示出来了，感觉不如手工写得更高效，更有针对性。见例子：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<span style="color: #666666; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #666666; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #666666; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;a.pl</span><br />
<span style="color: #666666; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-01 </span><br />
<br />
<span style="color: #000000; font-weight: bold;">use</span> Regexp<span style="color: #339933;">::</span><span style="color: #006600;">Trie</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@simple</span><span style="color: #339933;">=</span><span style="color: #009966; font-style: italic;">qw/foobar foobah fooxar foozap fooza/</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@complex</span><span style="color: #339933;">=</span><span style="color: #009966; font-style: italic;">qw/AuthenticViagra BestDealsOnViagra BestMaleTablets GenuineOnlineViagra GreatViagraShop OriginalViagra TopViagraShop VIAGRA ViagraBestOnlineShop ViagraBrandReseller ViagraPills onlineshop/</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;">#the simple one</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$t_s</span><span style="color: #339933;">=</span>Regexp<span style="color: #339933;">::</span><span style="color: #006600;">Trie</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@simple</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$t_s</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">add</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#the complex one</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$t_c</span><span style="color: #339933;">=</span>Regexp<span style="color: #339933;">::</span><span style="color: #006600;">Trie</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@complex</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$t_c</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">add</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;simple regex trie: &quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$t_s</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">regexp</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;complex regex trie: &quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$t_c</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">regexp</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">simple regex trie<span style="color: #339933;">:</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">?-</span>xism<span style="color: #339933;">:</span>foo<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>ba<span style="color: #009900;">&#91;</span>hr<span style="color: #009900;">&#93;</span><span style="color: #339933;">|</span>xar<span style="color: #339933;">|</span>zap<span style="color: #339933;">?</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
complex regex trie<span style="color: #339933;">:</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">?-</span>xism<span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>AuthenticViagra<span style="color: #339933;">|</span>Best<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>DealsOnViagra<span style="color: #339933;">|</span>MaleTablets<span style="color: #009900;">&#41;</span><span style="color: #339933;">|</span>G<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>enuineOnlineViagra<span style="color: #339933;">|</span>reatViagraShop<span style="color: #009900;">&#41;</span><span style="color: #339933;">|</span>OriginalViagra<span style="color: #339933;">|</span>TopViagraShop<span style="color: #339933;">|</span>V<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>IAGRA<span style="color: #339933;">|</span>iagra<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>B<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>estOnlineShop<span style="color: #339933;">|</span>randReseller<span style="color: #009900;">&#41;</span><span style="color: #339933;">|</span>Pills<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">|</span>onlineshop<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></div></td></tr></tbody></table></div>
<p>对于第2个例子，可以看出来，Trie是基于单个字符来优化的，而不是基于单词。虽然如此，它还是会极大地减少程序员的手工劳动。毕竟，未经优化的多选结构的效率之差不言而喻的，只要经过Trie的优化，其效率就会有质的飞跃；而手工的调教，只是在该飞跃的基础上再有一点点提升而已。对正则表达式比较头痛的程序员，更是可以无视自己的优化，直接使用该module而已。不过，使用Perl的程序员，应该不会对正则表达式头痛吧？</p>
<p>Perl 5.10 之后的版本，已经内置了类似于Trie的优化机制，因此没有手动优化的必要了。不过，我还是好奇于它的实现：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000066;">package</span> Regexp<span style="color: #339933;">::</span><span style="color: #006600;">Trie</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">use</span> <span style="color: #cc66cc;">5.008001</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">use</span> strict<span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">use</span> warnings<span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">our</span> <span style="color: #0000ff;">$VERSION</span> <span style="color: #339933;">=</span> <span style="color: #000066;">sprintf</span> <span style="color: #ff0000;">&quot;%d.%02d&quot;</span><span style="color: #339933;">,</span> <span style="color: #000066;">q</span><span style="color: #0000ff;">$Revision</span><span style="color: #339933;">:</span> <span style="color: #cc66cc;">0.2</span> $ <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/(\d+)/g</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #666666; font-style: italic;"># use overload q(&quot;&quot;) =&gt; sub { shift-&gt;regexp };</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> <span style="color: #000000; font-weight: bold;">new</span><span style="color: #009900;">&#123;</span> <span style="color: #000066;">bless</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000066;">shift</span> <span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> add<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$self</span> <span style="color: #339933;">=</span> <span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$str</span> &nbsp;<span style="color: #339933;">=</span> <span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$ref</span> &nbsp;<span style="color: #339933;">=</span> <span style="color: #0000ff;">$self</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">for</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$char</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">split</span> <span style="color: #339933;">//,</span> <span style="color: #0000ff;">$str</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$ref</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$char</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">||=</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$ref</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$ref</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$char</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$ref</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">''</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;"># { '' =&gt; 1 } as terminator</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$self</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> _regexp<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$self</span> <span style="color: #339933;">=</span> <span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$self</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">''</span><span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">and</span> <span style="color: #000066;">scalar</span> <span style="color: #000066;">keys</span> <span style="color: #0000ff;">%$self</span> <span style="color: #339933;">==</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;"># terminator</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@alt</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">@cc</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$q</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">for</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$char</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">sort</span> <span style="color: #000066;">keys</span> <span style="color: #0000ff;">%$self</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$qchar</span> <span style="color: #339933;">=</span> <span style="color: #000066;">quotemeta</span> <span style="color: #0000ff;">$char</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">ref</span> <span style="color: #0000ff;">$self</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$char</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">defined</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$recurse</span> <span style="color: #339933;">=</span> _regexp<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$self</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$char</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">push</span> <span style="color: #0000ff;">@alt</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$qchar</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">$recurse</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><span style="color: #b1b100;">else</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">push</span> <span style="color: #0000ff;">@cc</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$qchar</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><span style="color: #b1b100;">else</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$q</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$cconly</span> <span style="color: #339933;">=</span> <span style="color: #339933;">!</span><span style="color: #0000ff;">@alt</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">@cc</span> <span style="color: #b1b100;">and</span> <span style="color: #000066;">push</span> <span style="color: #0000ff;">@alt</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">@cc</span> <span style="color: #339933;">==</span> <span style="color: #cc66cc;">1</span> <span style="color: #339933;">?</span> <span style="color: #0000ff;">$cc</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">'['</span><span style="color: #339933;">.</span> <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">''</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">@cc</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span> <span style="color: #ff0000;">']'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$result</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">@alt</span> <span style="color: #339933;">==</span> <span style="color: #cc66cc;">1</span> <span style="color: #339933;">?</span> <span style="color: #0000ff;">$alt</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">'(?:'</span> <span style="color: #339933;">.</span> <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'|'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">@alt</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #ff0000;">')'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$q</span> <span style="color: #b1b100;">and</span> <span style="color: #0000ff;">$result</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$cconly</span> <span style="color: #339933;">?</span> <span style="color: #ff0000;">&quot;$result?&quot;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">&quot;(?:$result)?&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$result</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> regexp<span style="color: #009900;">&#123;</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$str</span> <span style="color: #339933;">=</span> shift<span style="color: #339933;">-&gt;</span>_regexp<span style="color: #339933;">;</span> <span style="color: #009966; font-style: italic;">qr/$str/</span> <span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>大致读了一下，让我想起之前我写过的一篇《<a href="http://iregex.org/blog/text-2-regular-expressions-again.html" target="_blank" title="我爱正则表达式">关于从普通文本提取正则表达式的再思考</a>》。不过，我的是业余与Trie的专业的差距，呵呵。这两天学习一下这段代码。若有值得分享的心得，再贴出来。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/trie.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Python 中文正则笔记</title>
		<link>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html</link>
		<comments>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html#comments</comments>
		<pubDate>Sun, 27 Jun 2010 03:50:41 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[cjk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=129</guid>
		<description><![CDATA[总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个... ]]></description>
			<content:encoded><![CDATA[<p>总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。</p>
<p><span id="more-129"></span></p>
<p>从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一点经验</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>可以使用 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数查看字串的原始格式。这对于写正则表达式有所帮助。
            </li>
<li>Python 的 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span></span></code>模块有两个相似的函数：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span>.<span style="color: black;">match</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">search</span></span></code> 。两个函数的匹配过程完全一致，只是起点不同。<code class="codecolorer python default"><span class="python">match</span></code>只从字串的开始位置进行匹配，如果失败，它就此放弃；而<code class="codecolorer python default"><span class="python">search</span></code>则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解<code class="codecolorer python default"><span class="python">match</span></code>的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，<code class="codecolorer python default"><span class="python">search</span></code>通常是你需要的那个函数。</li>
<li>从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用<code class="codecolorer python default"><span class="python">findall<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>这个函数。例子见后面的代码。</li>
<li><code class="codecolorer python default"><span class="python">utf8</span></code>下，每个汉字占据3个字符位置，正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#91;</span>\x80-\xff<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#125;</span></span></code>，这个都知道了吧。</li>
<li><code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>下，汉字的格式如<code class="codecolorer python default"><span class="python">\uXXXX</span></code>，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。</li>
<li>两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span></span></code>，来自定义所需要匹配的文本。</li>
<li>匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的<code class="codecolorer python default"><span class="python">utf8</span></code>，此时你不用额外做什么；如果是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>，就需要在正则式之前加上<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;&quot;</span></span></code>格式。</li>
<li>可以这样定义<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>字符串：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">string</span>=u<span style="color: #483d8b;">&quot;我爱正则表达式&quot;</span></span></code>。如果字串不是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>的，可以使用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数转换之。如果你知道源字串的编码，可以使用<code class="codecolorer python default"><span class="python">newstr=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>oldstring, original_coding_name<span style="color: black;">&#41;</span></span></code>的方式转换，例如 linux 下常用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>, <span style="color: #483d8b;">&quot;utf8&quot;</span><span style="color: black;">&#41;</span></span></code>，windows 下或许会用<code class="codecolorer python default"><span class="python">cp936</span></code>吧，没测试。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例程序</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;py_utf8_unicode.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-06-27 09:11</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> findPart<span style="color: black;">&#40;</span>regex, text, name<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; res=<span style="color: #dc143c;">re</span>.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>regex, text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;There are %d %s parts:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>res<span style="color: black;">&#41;</span>, name<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>,r<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span><br />
<br />
<span style="color: #808080; font-style: italic;">#sample is utf8 by default.</span><br />
sample=<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'en: Regular expression is a powerful tool for manipulating text.<br />
zh: 正则表达式是一种很有用的处理文本的工具。<br />
jp: 正規表現は非常に役に立つツールテキストを操作することです。<br />
jp-char: あアいイうウえエおオ<br />
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.<br />
puc: 。？！、，；：“ ”‘ ’——……·－·《》〈〉！￥％＆＊＃<br />
'</span><span style="color: #483d8b;">''</span><br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw utf8 string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#find the non-ascii chars:</span><br />
findPart<span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]+&quot;</span>,sample,<span style="color: #483d8b;">&quot;non-ascii&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#convert the utf8 to unicode</span><br />
usample=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sample,<span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw unicode string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>usample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#get each language parts:</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode chinese&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>ac00-<span style="color: #000099; font-weight: bold;">\u</span>d7ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode korean&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese katakana&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese hiragana&quot;</span><span style="color: black;">&#41;</span> <br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3000-<span style="color: #000099; font-weight: bold;">\u</span>303f<span style="color: #000099; font-weight: bold;">\u</span>fb00-<span style="color: #000099; font-weight: bold;">\u</span>fffd]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode cjk Punctuation&quot;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">the raw utf8 string is:<br />
'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n'<br />
<br />
There are 14 non-ascii parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具。<br />
&nbsp; &nbsp; 正規表現は非常に役に立つツールテキストを操作することです。<br />
&nbsp; &nbsp; あアいイうウえエおオ<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
&nbsp; &nbsp; 。？！、，；：“<br />
&nbsp; &nbsp; ”‘<br />
&nbsp; &nbsp; ’——……·－·《》〈〉！￥％＆＊＃<br />
<br />
the raw unicode string is:<br />
u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n'<br />
<br />
There are 6 unicode chinese parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具<br />
&nbsp; &nbsp; 正規表現<br />
&nbsp; &nbsp; 非常<br />
&nbsp; &nbsp; 役<br />
&nbsp; &nbsp; 立<br />
&nbsp; &nbsp; 操作<br />
<br />
There are 8 unicode korean parts:<br />
<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
<br />
There are 6 unicode japanese katakana parts:<br />
<br />
&nbsp; &nbsp; ツールテキスト<br />
&nbsp; &nbsp; ア<br />
&nbsp; &nbsp; イ<br />
&nbsp; &nbsp; ウ<br />
&nbsp; &nbsp; エ<br />
&nbsp; &nbsp; オ<br />
<br />
There are 11 unicode japanese hiragana parts:<br />
<br />
&nbsp; &nbsp; は<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; つ<br />
&nbsp; &nbsp; を<br />
&nbsp; &nbsp; することです<br />
&nbsp; &nbsp; あ<br />
&nbsp; &nbsp; い<br />
&nbsp; &nbsp; う<br />
&nbsp; &nbsp; え<br />
&nbsp; &nbsp; お<br />
<br />
There are 5 unicode cjk Punctuation parts:<br />
<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。？！、，；：<br />
&nbsp; &nbsp; －<br />
&nbsp; &nbsp; 《》〈〉！￥％＆＊＃</div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>正则笔记</title>
		<link>http://iregex.org/blog/regex-note-20100621.html</link>
		<comments>http://iregex.org/blog/regex-note-20100621.html#comments</comments>
		<pubDate>Mon, 21 Jun 2010 15:04:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[pos]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=128</guid>
		<description><![CDATA[笔记三则，贴在这里。 首字母大小写无关模式 有一段时间，我在写正则表达式来匹配Drug关键字时，经常写出 /viagra&#124;cialis&#124;anti-ed/ 这样的表达式。为了让它更美观，我会给关键词排序；为... ]]></description>
			<content:encoded><![CDATA[<p>笔记三则，贴在这里。</p>
<p><span id="more-128"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">首字母大小写无关模式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>有一段时间，我在写正则表达式来匹配<code class="codecolorer text default"><span class="text">Drug</span></code>关键字时，经常写出 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">|</span>cialis<span style="color: #339933;">|</span>anti<span style="color: #339933;">-</span>ed<span style="color: #339933;">/</span></span></code> 这样的表达式。为了让它更美观，我会给关键词排序；为了提升速度，我会使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span>Vv<span style="color: #009900;">&#93;</span>iagra<span style="color: #339933;">/</span></span></code> 而非<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">/</span>i</span></code> ，只让必要的部分进行大小写通配模式。确切地说，我是需要对每个单词的首字母进行大小写无关的匹配。 </p>
<p>我写了这样的一个函数，专门用来批量转换。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#convert regex to sorted list, then provide both lower/upper case for the first letter of each word</span><br />
<span style="color: #666666; font-style: italic;">#luf means lower upper first</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> luf<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; split the regex with the delimiter |</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@arr</span><span style="color: #339933;">=</span><span style="color: #000066;">sort</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\|/</span><span style="color: #339933;">,</span><span style="color: #000066;">shift</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; provide both the upper and lower case for the &nbsp;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; first leffer of each word </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #0000ff;">\b</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>zA<span style="color: #339933;">-</span>Z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\l</span><span style="color: #0000ff;">$1</span><span style="color: #0000ff;">\u</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">/</span>g<span style="color: #339933;">;</span><span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; join the keyword to a regex again</span><br />
&nbsp; &nbsp; <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'|'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000066;">print</span> luf <span style="color: #ff0000;">&quot;sex pill|viagra|cialis|anti-ed&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; the output is:[aA]nti-[eE]d|[cC]ialis|[sS]ex [pP]ill|[vV]iagra</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">控制全局匹配下次开始的位置</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>记得jyf曾经问过我，如何控制匹配开始的位置。嗯，现在我可以回答这个问题了。Perl 提供了 pos 函数，可以在 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>g</span></code> 全局匹配中调整下次匹配开始的位置。举例如下：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>其输出结果是每两个字母，即<code class="codecolorer text default"><span class="text">ab, cd, ef</span></code></p>
<p>可以使用 pos($_)来重新定位下一次匹配开始的位置，如：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">pos</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">--;</span> &nbsp;<span style="color: #666666; font-style: italic;">#pos($_)++;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">pos($_)--: &nbsp;ab, bc, cd, de, ef, fg.<br />
pos($_)++: &nbsp;ab, de.</div></td></tr></tbody></table></div>
<p>可以阅读 Perl 文档中关于 <a href="http://perldoc.perl.org/functions/pos.html" title="我爱正则表达式" target="_blank">pos</a>的章节获取详细信息。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">散列与正则表达式替换</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
《effective-perl-2e》第三章有这样一个例子（见下面的代码），将特殊符号转义。</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">my %ent = { '&amp;' =&gt; 'amp', '&lt;' =&gt; 'lt', '&gt;' =&gt; 'gt' };<br />
$html =~ s/([&amp;&lt;&gt;])/&amp;$ent{$1};/g;</div></td></tr></tbody></table></div>
<p>这个例子非常非常巧妙。它灵活地运用了散列这种数据结构，将待替换的部分作为 key ，将与其对应的替换内容作为 value 。这样只要有匹配就会捕获，然后将捕获的部分作为 key ，反查到 value 并运用到替换中，体现了高级语言的效率。</p>
<p>不过，这样的 Perl 代码，能否移植到 Python 中呢？ Python 同样支持正则，支持散列（Python 中叫做 Dictionary），但是似乎不支持在替换过程中插入太多花哨的东西（替换行内变量内插）。</p>
<p>查阅 Python 的文档，（在 shell 下 执行 python ，然后 import re，然后 help(re)），：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">sub(pattern, repl, string, count=0)<br />
&nbsp; &nbsp; Return the string obtained by replacing the leftmost<br />
&nbsp; &nbsp; non-overlapping occurrences of the pattern in string by the<br />
&nbsp; &nbsp; replacement repl. &nbsp;repl can be either a string or a callable;<br />
&nbsp; &nbsp; if a string, backslash escapes in it are processed. &nbsp;If it is<br />
&nbsp; &nbsp; a callable, it's passed the match object and must return<br />
&nbsp; &nbsp; a replacement string to be used.</div></td></tr></tbody></table></div>
<p>原来 python 和 php 一样，是支持在替换的过程中使用 callable 回调函数的。该函数的默认参数是一个匹配对象变量。这样一来，问题就简单了：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">ent=<span style="color: black;">&#123;</span><span style="color: #483d8b;">'&lt;'</span>:<span style="color: #483d8b;">&quot;lt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&gt;'</span>:<span style="color: #483d8b;">&quot;gt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&amp;'</span>:<span style="color: #483d8b;">&quot;amp&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> rep<span style="color: black;">&#40;</span>mo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> ent<span style="color: black;">&#91;</span>mo.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><br />
<br />
html=<span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;([&amp;&lt;&gt;])&quot;</span>,rep, html<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>python 替换函数 callback 的关键点在于其参数是一个匹配对象变量。只要明白了这一点，查一下手册，看看该种对象都有哪些属性，一一拿来使用，就能写出灵活高效的 python 正则替换代码。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-note-20100621.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Superor老师的正则表达式视频教程</title>
		<link>http://iregex.org/blog/regex-tutorial-by-superor.html</link>
		<comments>http://iregex.org/blog/regex-tutorial-by-superor.html#comments</comments>
		<pubDate>Sun, 20 Jun 2010 01:35:06 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=127</guid>
		<description><![CDATA[浏览CU时发现Superor老师的《探索Perl的世界(更新到40集)-Perl 教学视频》（国人，中文），其中有5集是讲正则表达式的。观看之后觉得不错，贴在这里。 之前贴过余晟老师的正则表达式视频，由... ]]></description>
			<content:encoded><![CDATA[<p>浏览CU时发现Superor老师的《<a href="http://bbs.chinaunix.net/thread-1707137-1-2.html">探索Perl的世界(更新到40集)-Perl 教学视频</a>》（国人，中文），其中有5集是讲正则表达式的。观看之后觉得不错，贴在这里。<span id="more-127"></span></p>
<p>之前贴过余晟老师的正则表达式视频，由于各种不可抗力，所上传到各大空间的，也都渐渐不再能访问。我最早是从老友牛腩粉那里得到的，地址<a href="http://tieba.baidu.com/f?kz=464065073">在此</a>，可以留言，碰碰运气。</p>
<p>Superor老师的视频，其实不限于正则表达式，而是系统地讲解 Perl 的教程。我是断章取义，将正则表达式部分摘过来了。Superor老师也说过了，他在学习Perl时，准备了不少书，但是不是系统地看完，而是用到哪一部分，就细读这一部分的全部内容。学正则表达式也可如此。</p>
<p>视频是在线的，效果不错，虽然会插播广告。</p>
<p>第20集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3748" target="_blank">http://www.boobooke.com/v/bbk3748</a></p>
<p>第21集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3749" target="_blank">http://www.boobooke.com/v/bbk3749</a></p>
<p>第22集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3750" target="_blank">http://www.boobooke.com/v/bbk3750</a></p>
<p>第23集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3751" target="_blank">http://www.boobooke.com/v/bbk3751</a></p>
<p>第24集：第八章 正则表达式<br /> <br />
<a href="http://www.boobooke.com/v/bbk3752" target="_blank">http://www.boobooke.com/v/bbk3752</a> </p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-tutorial-by-superor.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
