<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; python</title>
	<atom:link href="http://iregex.org/blog/tag/python/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Tue, 29 Mar 2011 05:04:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>		<item>
		<title>搞定Mac下的郑码输入法</title>
		<link>http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html</link>
		<comments>http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html#comments</comments>
		<pubDate>Sun, 07 Nov 2010 14:43:05 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[ime]]></category>
		<category><![CDATA[openvanilla]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[zhengma]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=152</guid>
		<description><![CDATA[最近用上了Mac, 苦于没有一份好用的郑码输入法. 于是发挥不怕折腾的精神, 自己制作一份码表, 记在这里. 郑码 估计没多少人使用郑码吧, 这是一个非常小众的输入法方案, 与五笔类似, 据说更&#8... ]]></description>
			<content:encoded><![CDATA[<p>最近用上了Mac, 苦于没有一份好用的郑码输入法. 于是发挥不怕折腾的精神, 自己制作一份码表, 记在这里. </p>
<p> <span id="more-152"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">郑码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>估计没多少人使用<a href="http://zh.wikipedia.org/zh/%E9%83%91%E7%A0%81%E8%BE%93%E5%85%A5%E6%B3%95">郑码</a>吧, 这是一个非常小众的输入法方案, 与五笔类似, 据说更&#8221;规则&#8221;. 没有用过五笔, 不好评价. 个人比较喜欢, 一直在用. 无论如何, 拼音输入法是无福消受的. </p>
<p>在Ubuntu下可以使用 ibus, 但是在Mac下就没有这么幸运了. Fit是不错的输入法, 但是只有内置的拼音和五笔, 暂不支持自定义码表; QIM是收费软件, 貌似可以自定义码表, 但它缺少文档, 找了半天没发现多少有用信息. </p>
<p>搜索一番, 找到一个OpenVanilla, 香草输入法. 免费, 开放, 支持自定义码表. 地址在<a href="http://openvanilla.org/">这里</a> . 另外从郑码爱好者家园上看到一个类似的解决方案, 只不过不太完美(例如, 无法忍受的词频混乱问题). 地址在<a href="http://www.cn25.net/zm/showbbs.asp?bd=14&#038;id=841&#038;totable=1">这里</a>  . </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">做一份郑码码表</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>词汇列表:</strong> 使用的是搜狗实验室的语料文本. 地址在<a href="http://www.sogou.com/labs/dl/w.html">这里</a>  , 格式为 &#8220;词A 词频 词性1 词性2 … 词性N&#8221;, 取自己有用的词汇以及对应词频信息即可. 该文件只有两字词至多字词, 没有单个汉字的字频信息. </p>
<p><strong>字频列表:</strong> 在 <a href="http://lingua.mtsu.edu/chinese-computing/statistics/">lingua.mtsu.edu</a> 上找到了字频列表, 令人喜出望外的是, 它还带有单字的(多音字)拼音/声调. 这为我打造一份带有拼音辅助的码表方案提供了极大的便利. 其实拼音只作为形码的辅助而使用, 只有打不出的字才反查拼音, 无需单字或词汇的简拼信息. 又因为它是辅助, 因此将所有的拼音信息放在做好的列表的末尾即可. </p>
<p>有了上述的两个文件, 就可以准备做码表了. 不过, 还需要单字的构词码码表, 以及约定俗成的快速码表(例如, 对于&#8221;北京&#8221;一词, 郑码有两种打法, 一是简单的ts, 一是正规的trsj). 单字构词码表之前我已经准备出来了, 后者我从网上搜索到的郑码光盘中找到了大字集的码表.</p>
<p>制作的过程不难, 不过细节不少. 不赘述. 过程中当然少不了<a href="http://iregex.org">正则表达式</a>的帮忙.</p>
<p>最终做出来的是五码郑码. 非常好用. 每次启动香草时要花一秒钟左右的时间, 但是一旦运行起来, 就感觉不到了. 毕竟, 最终的词汇列表为19万条之巨.</p>
<p>香草的最大的优势在于开放和免费. 比起qim或fit来, 它作为一个输入法, 支持的特性/自定义功能实在有限. 连自定义切换中英文也不可以, 更不用说动态调整词频和增删词条了. 我写了一个bash function, 用来搜索现有的词条; 写了一个bash脚本, 用来删除词条; 写了一个python程序, 用来动态添加新词.</p>
<p>最后这个添加新词的python程序还是比较好玩的. 支持从命令行中或文件中读取词汇列表, 批量添加到词库中. 添加过程中它自己生成格式正确的郑码编码; 添加完毕之后还会杀死香草, 以便重新加载新词库. </p>
<p>程序push到github了. 在<a href="https://github.com/zhasm/zhengma">这里</a>. </p>
</blockquote>
<p><a href="http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/Screenshot2010-11-07at103904PM.png" border="0" alt="Photobucket"></a></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Update</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<li><strong>2010-11-08</strong>查看了一下香草的其它输入法码表，搞定了标点符号。 </li>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/zhengma-on-openvanilla-for-mac.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>一个简单的中文分词程序</title>
		<link>http://iregex.org/blog/simple-nlp-for-chinese.html</link>
		<comments>http://iregex.org/blog/simple-nlp-for-chinese.html#comments</comments>
		<pubDate>Sun, 26 Sep 2010 14:41:12 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=151</guid>
		<description><![CDATA[kds:“前驻法大使吴建民指出，应该理**国”,想了一下，原来两个星号是“性爱”两字，生活在一个机械屏蔽时代的中国还真有喜感。——via 想必您也看到了推特上关于“理＊＊国”的笑话了。... ]]></description>
			<content:encoded><![CDATA[<blockquote  style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
    kds:“前驻法大使吴建民指出，应该理**国”,想了一下，原来两个星号是“性爱”两字，生活在一个机械屏蔽时代的中国还真有喜感。——<a href="https://twitter.com/rightf/status/25555437368" title="我爱正则表达式" target="_blank">via</a>
</p></blockquote>
<p>想必您也看到了推特上关于“<a href="https://twitter.com/rex_zhasm/status/25567030862" title="我爱正则表达式" target="_blank">理＊＊国</a>”的笑话了。我正好想学一下中文分词方面的知识，这是第一篇。</p>
<p><span id="more-151"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">分词原理与实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>英语等以空白字符作为分隔符的语言，分词不是问题。中文分词，需要处理的细节太多。单就“<a href="http://is.gd/ftZNO" title="我爱正则表达式" target="_blank">真歧义</a>”这一问题（简言之，如果没有上下文，连活生生的人也无法确定如何断句的歧义句）的处理方法而言，前辈们就已写出洋洋洒洒许多文字。不过这属于进阶题目。我想先实现一个最简单的分词程序。</p>
<p>以我的理解，最简单的分词程序，应该是先将中文文本切成最小的单位－－汉字－－再从词典里找词，将这些字按照最左最长原则（与正则精神暗合），合并为以词为单位的集合。这样的应该是最快的，只按照给定的数据划分合并即可，不必考虑语法元素的权重（词性：名动形数量代等等，语法：主谓宾定状补），以及上下文的出现次数。</p>
<p>关于源文本的切分，就参照<a href="http://iregex.org/blog/words-counter-in-python.html" title="我爱正则表达式" target="_blank">《统计汉字／英文单词数》</a>一文的思路，使用正则表达式<code class="codecolorer python default"><span class="python">r<span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span></span></code>来匹配即可。</p>
<p>关于词典，我使用的是<a href="http://www.mdbg.net/chindict/chindict.php?page=cedict" title="我爱正则表达式" target="_blank">CC-CEDICT</a>的词典，原因有三：没有版权问题；速度较快；Chrome也在用它（发现了吧：在Chrome上双击中文句子，会自动选择中文词汇而不是单字或整行进行反选高亮）。</p>
<p>接下来是如何分词。经过思考，我发现搜索树的原理可以拿来就用。原理请见此文：<a href="http://iregex.org/blog/trie-in-python.html" title="我爱正则表达式" target="_blank">Trie in Python</a>。具体方法是，将词库逐字读入内存，建立搜索树；然后对目标文本进行逐字分析，如果该字之后还可搜索，则继续搜索；否则停止，作为一个词汇单位处理。</p>
<p>这样的算法理论上比较快（未进行benchmark），原因有三：使用Trie结构，本质上是哈希表，空间换时间，是O(0)级的搜索；词库只有800K，可以轻易载入，内存空间没占多少；算法最慢的部分是载入Trie的阶段，之后速度就不再受影响。</p>
<p>不过，谈到它的扩充性，目前只能在words.txt中手动添加新词，而不能实现机器学习。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">源码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>完整的程序（包括我处理过的词库列表）放在<a href="http://github.com/zhasm/simpleNLP" title="我爱正则表达式" target="_blank">github</a>上了。有兴趣的可以把玩一下。这里列出主程序：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;nlp.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-09-26 19:15</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<br />
regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> init_wordslist<span style="color: black;">&#40;</span>fn<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;./words.txt&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; f<span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span>fn<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; lines<span style="color: #66cc66;">=</span><span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>f.<span style="color: black;">readlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> lines<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> words_2_trie<span style="color: black;">&#40;</span>wordslist<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; d<span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> wordslist: <br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>d<br />
&nbsp; &nbsp; &nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>word<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> chars:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span>ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span><span style="color: #483d8b;">''</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> d<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>trie<br />
&nbsp; &nbsp; index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> chars:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> char<span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> index<span style="color: #66cc66;">==</span><span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> char<span style="color: #66cc66;">,</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'*'</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>chars<span style="color: black;">&#91;</span>index:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">pass</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">break</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#init</span><br />
&nbsp; &nbsp; words<span style="color: #66cc66;">=</span>init_wordslist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; trie<span style="color: #66cc66;">=</span>words_2_trie<span style="color: black;">&#40;</span>words<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#read content</span><br />
&nbsp; &nbsp; fn<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #dc143c;">string</span><span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span>fn<span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#do the job</span><br />
&nbsp; &nbsp; search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__<span style="color: #66cc66;">==</span><span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">本机测试</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>测试的文本如下：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">只听得一个女子低低应了一声。绿竹翁道：“姑姑请看，这部琴谱可有些古怪。”那<br />
女子又嗯了一声，琴音响起，调了调弦，停了一会，似是在将断了的琴弦换去，又调了调<br />
弦，便奏了起来。初时所奏和绿竹翁相同，到后来越转越高，那琴韵竟然履险如夷，举重<br />
若轻，毫不费力的便转了上去。令狐冲又惊又喜，依稀记得便是那天晚上所听到曲洋所奏<br />
的琴韵。这一曲时而慷慨激昂，时而温柔雅致，令狐冲虽不明乐理，但觉这位婆婆所奏，<br />
和曲洋所奏的曲调虽同，意趣却大有差别。这婆婆所奏的曲调平和中正，令人听着只觉音<br />
乐之美，却无曲洋所奏热血如沸的激奋。奏了良久，琴韵渐缓，似乎乐音在不住远去，倒<br />
像奏琴之人走出了数十丈之遥，又走到数里之外，细微几不可再闻。<br />
<br />
理性爱国<br />
性爱体验<br />
我爱正则表达式</div></td></tr></tbody></table></div>
<p>请留意末尾三行。</p>
<p>再看一下程序处理的结果：（＊表示词汇间的分隔）</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">只 * 听 得 * 一 个 * 女 子 * 低 低 * 应 * 了 * 一 声 * 。 * 绿 * 竹 * 翁 * 道 * ： * “ * 姑 姑 * 请 看 * ， * 这 * 部 * 琴 * 谱 * 可 有 * 些 * 古 怪 * 。 * ” * 那 * 女 子 * 又 * 嗯 * 了 * 一 声 * ， * 琴 * 音 响 * 起 * ， * 调 * 了 * 调 * 弦 * ， * 停 * 了 * 一 会 * ， * 似 是 * 在 * 将 * 断 * 了 * 的 * 琴 弦 * 换 * 去 * ， * 又 * 调 * 了 * 调 * 弦 * ， * 便 * 奏 * 了 * 起 来 * 。 * 初 * 时 * 所 * 奏 * 和 * 绿 * 竹 * 翁 * 相 同 * ， * 到 * 后 来 * 越 * 转 * 越 * 高 * ， * 那 * 琴 * 韵 * 竟 然 * 履 险 如 夷 * ， * 举 重 * 若 * 轻 * ， * 毫 不 费 力 * 的 * 便 * 转 * 了 * 上 去 * 。 * 令 狐 * 冲 * 又 * 惊 * 又 * 喜 * ， * 依 稀 * 记 得 * 便 是 * 那 天 * 晚 上 * 所 * 听 到 * 曲 * 洋 * 所 * 奏 * 的 * 琴 * 韵 * 。 * 这 一 * 曲 * 时 而 * 慷 慨 * 激 昂 * ， * 时 而 * 温 柔 * 雅 致 * ， * 令 狐 * 冲 * 虽 * 不 明 * 乐 理 * ， * 但 * 觉 * 这 位 * 婆 婆 * 所 * 奏 * ， * 和 * 曲 * 洋 * 所 * 奏 * 的 * 曲 调 * 虽 * 同 * ， * 意 趣 * 却 * 大 有 * 差 别 * 。 * 这 * 婆 婆 * 所 * 奏 * 的 * 曲 调 * 平 和 * 中 正 * ， * 令 人 * 听 * 着 * 只 * 觉 * 音 乐 之 * 美 * ， * 却 * 无 * 曲 * 洋 * 所 * 奏 * 热 血 * 如 * 沸 * 的 * 激 * 奋 * 。 * 奏 * 了 * 良 久 * ， * 琴 * 韵 * 渐 * 缓 * ， * 似 乎 * 乐 音 * 在 * 不 住 * 远 * 去 * ， * 倒 像 * 奏 * 琴 * 之 * 人 * 走 出 * 了 * 数 十 * 丈 * 之 * 遥 * ， * 又 * 走 * 到 * 数 * 里 * 之 外 * ， * 细 微 * 几 * 不 可 再 * 闻 * 。 * 理 性 * 爱 国 * 性 爱 * 体 验 * 我 * 爱 * 正 则 * 表 达 式</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">更新</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>2010-10-03更新</strong>:发现本程序的一个bug。已改进算法，更精确，更快速。程序详见GitHub，链接如前。</p>
<p>请看新程序的分词结果：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>只 * 听 * 得 * 一 * 个 * 女子 * 低 * 低 * 应 * 了 * 一声 * 。 * 绿 * 竹 * 翁 * 道 * ： * “ * 姑姑 * 请看 * ， * 这 * 部 * 琴谱 * 可 * 有些 * 古怪 * 。 * ” * 那 * 女子 * 又 * 嗯 * 了 * 一声 * ， * 琴 * 音响 * 起 * ， * 调 * 了 * 调 * 弦 * ， * 停 * 了 * 一会 * ， * 似 * 是 * 在 * 将 * 断 * 了 * 的 * 琴弦 * 换 * 去 * ， * 又 * 调 * 了 * 调 * 弦 * ， * 便 * 奏 * 了 * 起来 * 。 * 初 * 时 * 所 * 奏 * 和 * 绿 * 竹 * 翁 * 相同 * ， * 到 * 后来 * 越 * 转 * 越 * 高 * ， * 那 * 琴 * 韵 * 竟然 * 履 * 险 * 如 * 夷 * ， * 举重 * 若 * 轻 * ， * 毫不 * 费力 * 的 * 便 * 转 * 了 * 上去 * 。 * 令狐 * 冲 * 又 * 惊 * 又 * 喜 * ， * 依稀 * 记得 * 便是 * 那天 * 晚上 * 所 * 听到 * 曲 * 洋 * 所 * 奏 * 的 * 琴 * 韵 * 。 * 这 * 一 * 曲 * 时而 * 慷慨 * 激昂 * ， * 时而 * 温柔 * 雅致 * ， * 令狐 * 冲 * 虽 * 不明 * 乐理 * ， * 但 * 觉 * 这位 * 婆婆 * 所 * 奏 * ， * 和 * 曲 * 洋 * 所 * 奏 * 的 * 曲调 * 虽 * 同 * ， * 意趣 * 却 * 大有 * 差别 * 。 * 这 * 婆婆 * 所 * 奏 * 的 * 曲调 * 平和 * 中正 * ， * 令人 * 听 * 着 * 只 * 觉 * 音乐 * 之 * 美 * ， * 却 * 无 * 曲 * 洋 * 所 * 奏 * 热血 * 如 * 沸 * 的 * 激 * 奋 * 。 * 奏 * 了 * 良久 * ， * 琴 * 韵 * 渐 * 缓 * ， * 似乎 * 乐音 * 在 * 不住 * 远 * 去 * ， * 倒像 * 奏 * 琴 * 之 * 人 * 走出 * 了 * 数 * 十 * 丈 * 之 * 遥 * ， * 又 * 走 * 到 * 数 * 里 * 之外 * ， * 细微 * 几 * 不可 * 再 * 闻 * 。 *</p>
<p>理性 * 爱国 *</p>
<p>性爱 * 体验 *</p>
<p>我 * 爱 * 正则 * 表达式 *</p>
<p>轻 * 音乐 *</p>
</blockquote>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/simple-nlp-for-chinese.html/feed</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>统计汉字／英文单词数</title>
		<link>http://iregex.org/blog/words-counter-in-python.html</link>
		<comments>http://iregex.org/blog/words-counter-in-python.html#comments</comments>
		<pubDate>Sat, 25 Sep 2010 11:25:38 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=148</guid>
		<description><![CDATA[一个简单的程序，统计文本文档中的单词和汉字数，逆序排列（出现频率高的排在最前面）。python实现。 思路 使用正则式 &#34;(?x) (?: [\w-]+ &#160;&#124; [\x80-\xff]{3} )&#34;获得utf-8文档中的英文单词... ]]></description>
			<content:encoded><![CDATA[<p>一个简单的程序，统计文本文档中的单词和汉字数，逆序排列（出现频率高的排在最前面）。python实现。</p>
<p><span id="more-148"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">思路</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>使用正则式 <code class="codecolorer python default"><span class="python"><span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span></span></code>获得utf-8文档中的英文单词和汉字的列表。
            </li>
<li>使用dictionary来记录每个单词／汉字出现的频率，如果出现过则＋1，如果没出现则置1。</li>
<li>将dictionary按照value排序，输出。</li>
</ul>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">源码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;counter.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;Mon Sep 20 21:00:52 2010</span><br />
<span style="color: #808080; font-style: italic;">#desc: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; convert .py file to html with VIM.</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">operator</span> <span style="color: #ff7700;font-weight:bold;">import</span> itemgetter<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> readfile<span style="color: black;">&#40;</span>f<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">with</span> <span style="color: #008000;">file</span><span style="color: black;">&#40;</span>f<span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;r&quot;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">as</span> pFile:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> pFile.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">def</span> divide<span style="color: black;">&#40;</span>c<span style="color: #66cc66;">,</span> regex<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#the regex below is only valid for utf8 coding</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>c<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> update_dict<span style="color: black;">&#40;</span>di<span style="color: #66cc66;">,</span>li<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> li:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> di.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; di<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; di<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> di<br />
&nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#receive files from bash</span><br />
&nbsp; &nbsp; files<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span> <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#regex compile only once</span><br />
&nbsp; &nbsp; regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #008000;">dict</span><span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#get all words from files</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> f <span style="color: #ff7700;font-weight:bold;">in</span> files:<br />
&nbsp; &nbsp; &nbsp; &nbsp; words<span style="color: #66cc66;">=</span>divide<span style="color: black;">&#40;</span>readfile<span style="color: black;">&#40;</span>f<span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> regex<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">dict</span><span style="color: #66cc66;">=</span>update_dict<span style="color: black;">&#40;</span><span style="color: #008000;">dict</span><span style="color: #66cc66;">,</span> words<span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#sort dictionary by value </span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#dict is now a list.</span><br />
&nbsp; &nbsp; <span style="color: #008000;">dict</span><span style="color: #66cc66;">=</span><span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span><span style="color: #008000;">dict</span>.<span style="color: black;">items</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> key<span style="color: #66cc66;">=</span>itemgetter<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> reverse<span style="color: #66cc66;">=</span><span style="color: #008000;">True</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#output to standard-output</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">dict</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> i<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span> i<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> <br />
<br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__<span style="color: #66cc66;">==</span><span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Tips</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>由于使用了<code class="codecolorer python default"><span class="python">files<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span></span></code> 来接收参数，因此<code class="codecolorer bash default"><span class="bash">.<span style="color: #000000; font-weight: bold;">/</span>counter.py file1 file2 ...</span></code>可以将参数指定的文件的词频累加计算输出。 </p>
<p>可以自定义该程序。例如，</p>
<ul>
<li>使用
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;(?x) ( [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
words<span style="color: #66cc66;">=</span><span style="color: black;">&#91;</span>w <span style="color: #ff7700;font-weight:bold;">for</span> w <span style="color: #ff7700;font-weight:bold;">in</span> regex.<span style="color: black;">split</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">if</span> w<span style="color: black;">&#93;</span></div></td></tr></tbody></table></div>
<p>这样得到的列表是包含分隔符在内的单词列表，方便于以后对全文分词再做操作。
            </li>
<li>以行为单位处理文件，而不是将整个文件读入内存，在处理大文件时可以节约内存。</li>
<li>可以使用这样的正则表达式先对整个文件预处理一下，去掉可能的html tags: <code class="codecolorer python default"><span class="python">content<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;&lt;[^&gt;]+&quot;</span><span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;&quot;</span><span style="color: #66cc66;">,</span>content<span style="color: black;">&#41;</span></span></code>，这样的结果对于某些文档更精确。
            </li>
</ul>
</blockquote>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/words-counter-in-python.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>anti spam杂谈</title>
		<link>http://iregex.org/blog/anti-spam.html</link>
		<comments>http://iregex.org/blog/anti-spam.html#comments</comments>
		<pubDate>Sun, 15 Aug 2010 02:03:50 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[akismet]]></category>
		<category><![CDATA[antispam]]></category>
		<category><![CDATA[discuz]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=140</guid>
		<description><![CDATA[本文是一篇随笔，将email的anti spam技术和论坛的防灌水结合在一起讨论。从技术层面出发。不涉及其它。 据说德国有这样一句谚语：没有泡沫的啤酒不是好啤酒。推而广知，可以得到：没人灌... ]]></description>
			<content:encoded><![CDATA[<p>本文是一篇随笔，将email的anti spam技术和论坛的防灌水结合在一起讨论。从技术层面出发。不涉及其它。</p>
<p><span id="more-140"></span></p>
<p>据说德国有这样一句谚语：没有泡沫的啤酒不是好啤酒。推而广知，可以得到：没人灌水的论坛不是好论坛，没有垃圾邮件的邮件系统不是好系统（至少是不知名的系统／电邮地址），没有病毒骚扰的OS不是好的OS，等等。但是，只有泡沫的啤酒也不是什么好啤酒吧？关键是将不需要的内容控制在可以允许的范围内。单就开论坛、维护垃圾邮件的角度出发，审核技术还是很有用，很有必要的。否则，其地盘很快就会淹没在垃圾广告的汪洋大海之中。自己的论坛，自己发广告是为了维持网站开销，但是不请自来的广告是无法容忍的。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Spam</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>
    垃圾邮件有两个特点：<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>
                <strong>大量</strong>。一封两封垃圾邮件，个人用户可能比较在意，但是对于服务器来说，邮件数以百万计。在这样大的分母下，如果偶然有一两封垃圾邮件被判为合法邮件，或者合法邮件被误判为垃圾，实属正常。
            </li>
<li><strong>不需要</strong>。需要与否，取决于用户的主观判断。大家都认为 porn 和 drug 内容是spam，但是也不排除有人将这类邮件标为 ham 的。
            </li>
</ol>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Anti-Spam</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>常用的反垃圾邮件有以下几种方法：
</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>静态关键词列表。如果邮件头（标题，收发件人，电邮地址，正文）含某些关键词（例如Viagra），就将邮件标为Spam 或 Ham。这是最简单的方式，为各大邮件厂商所采用，包括Gmail。它最大的优点是快。缺点多多，一是需要维护（增／改／删 关键词），二是准确率不高（含黑名单中关键词的邮件未必全是Spam，反之亦然），三是不能识别含有干扰因素的邮件。例如，V1agra，发*漂，含这种关键词的邮件，人眼立即能识别它是垃圾邮件，但是静态法就傻了。
        </li>
<li>正则表达式规则表。与静态关键词列表相反，它速度稍慢，但是极其强大和灵活。对于邮件头的扫描，效果尤佳：邮件头是有规律可循的，尤其是对于大量的垃圾邮件而言，不可能不在邮件头中留下蛛丝马迹。<br />
        但是这种方法也有其短板。与上述方法类似，它也需要专人来维护，而且无论从配置难度到维护成本，都远高于前者。对于邮件的正文，正则表达式的扫描速度比较缓慢，尤其是对于精心设计了干扰因素的垃圾邮件。有相当大的一部分邮件，人眼看上去确实也是垃圾邮件，但是使用正则表达式也不好写规则。一个新的规则写手，可能要在准确率与查杀总量的折哀上花费很长一段时间才能掌握其规律。
        </li>
<li>贝叶斯概率法。若已知某些字词经常出现在垃圾邮件中，却很少出现在合法邮件中，当一封邮件含有这些字词时，那么link它是垃圾邮件的可能性就很大。参考此文。<a href="http://home.q.yesky.com/space-4148078-do-blog-id-412454.html" title="我爱正则表达式" target="_blank">贝叶斯过滤技术</a>。<br />
        它最大的优点是，只要有足够健壮的算法，足够的样本空间，其准确率是非常高的。同时，它主要依赖于机器学习，而不需要后期大量的人工干预。</li>
</ul>
</blockquote>
<p>国内有的网站，其内容过滤系统极其简单粗暴，只要出现单个汉字“日”，“操”，“干”等等字眼，就当作垃圾邮件／评论对待，而不分析具体上下文，实在令人又好气又好笑。又有，《百家姓》的常见姓氏用字本身不是垃圾字眼或违禁词汇，如果将其加入静态列表，就会导致连萝卜也无法搜索。其实，用一点点正则表达式（环视）或贝叶斯的技术（条件概率），就能提高过滤质量，皆大欢喜。当然，如果要扫描亿万级的网页，速度的要求肯定要优先于准确度，某些情况下只能做到大致靠谱罢了。然而，频频出错的系统，即使快一点点又有何用（成语：南辕北辙）？ 不过，从来都是宁枉勿纵的。
</p>
<p>由于静态法的特点，注定了列表只能向管理员开放，而对普通游客讳莫如深。这导致了另一种现象：该贴无法显示，是因为含有某关键词。至于哪些词是关键词，不好意思，不能告诉你，怕这个列表一旦公开，想发类似内容的人就能轻易绕过。那就有劳管理员们从严自省，并用心地揣摩圣意。</p>
<p> 动静结合，人、机结合，系统才能越用越新。@chunzi说得很形象：反垃圾邮件的过程，不是拼耐力的马拉松赛跑，而是适者生存的进化。总是魔已先高一丈，道才一尺尺增高，并最终压住魔。同时有新的魔即将出现。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一些工具</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>如果你说自己的邮箱里其实没多少垃圾邮件，或者即使有也已经自动被转入垃圾箱了，那么有两种可能。一是你所用的邮箱系统本身的反垃圾邮件系统做得不错（有太多太多各式各样的明显的垃圾邮件在进入你的邮箱之前，已经被服务器端给block了）。二是你的邮址没有被爬虫抓到或算出来。</p>
<li>
    <a href="http://spamassassin.apache.org/" title="我爱正则表达式" target="_blank">Spamassassin</a>是一套不错的反垃圾邮件系统。免费，与Apache紧密结合，强大的正则式支持。国内有一个组织专门动态维护一个中文的规则表，在<a href="http://www.ccert.edu.cn/spam/sa/Chinese_rules.htm" title="我爱正则表达式" target="_blank">这里</a>，可以参考。Spamassassin其实也有贝叶斯模块，只是它以正则知名罢了。
    </li>
<li>WordPress 有个 Akismet 插件是用来block 博客上的垃圾评论的。这个设置起来比较傻瓜（只需要申请一个API）即可，效果比较智能，完全不用用户再手工添加任何规则。对于出错的判断，用户有义务提交给Akismet官方，方便它学习新的变种。应该说，用户提交的漏判或误判，是必不可少的语料库。没有用户提交，Akismet就会一根筋地按照既定的思路继续犯同样的错误。
    </li>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">个人应用</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>国人开发的Discuz是一款不错的论坛程序。不过，它没有较好的反垃圾模块，我一直在颇厌其烦地删除spam和spammer。看了一下Ak的官网，几乎国外知名的论坛程序都有Ak的插件了。我研究了dz的数据结构，使用python写了一个脚本，定时搜索新贴子，将其提交到Ak做判断。如果判为垃圾，则屏蔽贴子，并对该会员实施减分操作。刚开始试用，效果还可以。其实可以做成原生的php插件，集成到dz中的。</p>
<p>流程很直接：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>定时搜索最新贴子；</li>
<li>对于每一个新贴子:</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>提交给Akismet作测试。</li>
<li>如果不是垃圾，忽视之。</li>
<li>如果是垃圾，将该贴转为仅管理员可见。同时将该用户扣分。</li>
</ul>
</blockquote>
<li>总合一下应该执行的操作，执行SQL, Commit。</li>
<li>生成报表，发邮件给管理员。</li>
</ul>
</blockquote>
<p>Ak的开发者页面在这里 <a href="http://akismet.com/development/" title="我爱正则表达式" target="_blank">Akismet API Documentation</a>。我用了其中的<a href="http://www.voidspace.org.uk/python/modules.shtml" title="我爱正则表达式" target="_blank">Python 模块</a> 将其封装为一个class，只需要init和check即可：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;comment.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-14 15:58</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">from</span> akismet <span style="color: #ff7700;font-weight:bold;">import</span> Akismet<br />
<br />
<span style="color: #ff7700;font-weight:bold;">class</span> Comment<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">api</span><span style="color: #66cc66;">=</span>Akismet<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">api</span> <span style="color: #ff7700;font-weight:bold;">is</span> <span style="color: #008000;">None</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>.<span style="color: black;">api</span>.<span style="color: black;">verify_key</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'No Valid Akismet API'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; exit<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> init<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: #66cc66;">,</span> comment<span style="color: #66cc66;">,</span> <span style="color: #dc143c;">user</span><span style="color: #66cc66;">=</span><span style="color: #483d8b;">''</span><span style="color: #66cc66;">,</span> ip<span style="color: #66cc66;">=</span><span style="color: #483d8b;">''</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">email</span><span style="color: #66cc66;">=</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: #dc143c;">user</span><span style="color: #66cc66;">=</span><span style="color: #dc143c;">user</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">comment</span><span style="color: #66cc66;">=</span>comment.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">ip</span><span style="color: #66cc66;">=</span>ip<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: #dc143c;">email</span><span style="color: #66cc66;">=</span><span style="color: #dc143c;">email</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> check<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">api</span>.<span style="color: black;">comment_check</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">comment</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'comment_author'</span>: <span style="color: #008000;">self</span>.<span style="color: #dc143c;">user</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'comment_author_email'</span>:<span style="color: #008000;">self</span>.<span style="color: #dc143c;">email</span><span style="color: #66cc66;">,</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'user_ip'</span>:<span style="color: #008000;">self</span>.<span style="color: black;">ip</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'user_agent'</span>:<span style="color: #483d8b;">&quot;Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8&quot;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#125;</span><span style="color: #66cc66;">,</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>这是程序的主要部分</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http:#iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;dzas.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-14 15:20</span><br />
<br />
<span style="color: #808080; font-style: italic;">#anti spam for discuz! bbs.</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">from</span> comment <span style="color: #ff7700;font-weight:bold;">import</span> Comment <span style="color: #ff7700;font-weight:bold;">as</span> C<br />
<span style="color: #ff7700;font-weight:bold;">from</span> eml <span style="color: #ff7700;font-weight:bold;">import</span> Email<br />
<br />
<span style="color: #808080; font-style: italic;">#########################################################</span><br />
<span style="color: #808080; font-style: italic;">#global settings</span><br />
send_email_log <span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
<br />
dbhost <span style="color: #66cc66;">=</span> <span style="color: #483d8b;">'yourhost.website.com'</span><span style="color: #66cc66;">;</span><br />
dbuser <span style="color: #66cc66;">=</span> <span style="color: #483d8b;">'db_user'</span><span style="color: #66cc66;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
dbpw <span style="color: #66cc66;">=</span> <span style="color: #483d8b;">'yourpassword'</span><span style="color: #66cc66;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
dbname <span style="color: #66cc66;">=</span> <span style="color: #483d8b;">'dbname'</span><span style="color: #66cc66;">;</span> &nbsp; &nbsp;<br />
dbpre<span style="color: #66cc66;">=</span><span style="color: #483d8b;">'cdb_'</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#punish : punish for the user if he/she publish spam</span><br />
punish_score<span style="color: #66cc66;">=</span><span style="color: #ff4500;">2</span><br />
<br />
<span style="color: #808080; font-style: italic;">#now is 2 hours; you may change</span><br />
sql<span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #483d8b;">'last_n_hr'</span>:<span style="color: #483d8b;">'''select `%sposts`.pid, `%sposts`.author, `%sposts`.message, `%sposts`.useip, `%smembers`.email, authorid from `%sposts`, `%smembers` where dateline&gt;UNIX_TIMESTAMP(now())-2*3600 and `%sposts`.author=`%smembers`.username order by pid desc;'''</span> % <span style="color: black;">&#40;</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span>dbpre<span style="color: #66cc66;">,</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; <span style="color: #483d8b;">'hide'</span>:<span style="color: #483d8b;">'update %sposts set `status`= 1 where pid in (%s);'</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; <span style="color: #483d8b;">'punish'</span>:<span style="color: #483d8b;">'update %smembers set credits=credits-%s where uid=%s;'</span><span style="color: #66cc66;">,</span> <br />
&nbsp; &nbsp; <span style="color: #483d8b;">'find_hided'</span>:<span style="color: #483d8b;">&quot;SELECT * FROM &nbsp;`%sposts` WHERE &nbsp;`status`=1;&quot;</span>%<span style="color: black;">&#40;</span>dbpre<span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span><br />
<span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> MySQLdb<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> init_db<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; db<span style="color: #66cc66;">=</span>MySQLdb.<span style="color: black;">connect</span><span style="color: black;">&#40;</span>host<span style="color: #66cc66;">=</span>dbhost<span style="color: #66cc66;">,</span> <span style="color: #dc143c;">user</span><span style="color: #66cc66;">=</span>dbuser<span style="color: #66cc66;">,</span> passwd<span style="color: #66cc66;">=</span>dbpw<span style="color: #66cc66;">,</span>db<span style="color: #66cc66;">=</span>dbname<span style="color: #66cc66;">,</span> charset<span style="color: #66cc66;">=</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> db<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> hide_spam<span style="color: black;">&#40;</span>db<span style="color: #66cc66;">,</span> spam<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c<span style="color: #66cc66;">=</span>db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; sql_str<span style="color: #66cc66;">=</span>sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'hide'</span><span style="color: black;">&#93;</span> % <span style="color: black;">&#40;</span>dbpre<span style="color: #66cc66;">,</span><span style="color: #483d8b;">','</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>spam<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql_str<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; db.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> punish<span style="color: black;">&#40;</span>db<span style="color: #66cc66;">,</span>score<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c<span style="color: #66cc66;">=</span>db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> u <span style="color: #ff7700;font-weight:bold;">in</span> score.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; s<span style="color: #66cc66;">=</span>score<span style="color: black;">&#91;</span>u<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; sql_str<span style="color: #66cc66;">=</span>sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'punish'</span><span style="color: black;">&#93;</span> % <span style="color: black;">&#40;</span>dbpre<span style="color: #66cc66;">,</span>s*punish_score<span style="color: #66cc66;">,</span>u<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql_str<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; db.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> get_msg<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c<span style="color: #66cc66;">=</span>db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'last_n_hr'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; records<span style="color: #66cc66;">=</span> c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; result<span style="color: #66cc66;">=</span><span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> records:<br />
&nbsp; &nbsp; &nbsp; &nbsp; result.<span style="color: black;">append</span><span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'pid'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'user'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'msg'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'ip'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'email'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'uid'</span>:r<span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#125;</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> result<br />
<br />
<span style="color: #808080; font-style: italic;">#send log to admin. change the global variable </span><br />
<span style="color: #808080; font-style: italic;">#send_email_log = 1 or 0 to enable/disable</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> report<span style="color: black;">&#40;</span>spam<span style="color: #66cc66;">,</span> score<span style="color: black;">&#41;</span>:<br />
<br />
&nbsp; &nbsp; sub<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;%s spams caputured&quot;</span> % <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>spam<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; body<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;spammers including: %s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> % <span style="color: #483d8b;">', '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span> <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'user'</span><span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; body+<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;spam pids including: %s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> % <span style="color: #483d8b;">', '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span> <span style="color: black;">&#91;</span> <span style="color: #008000;">str</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'pid'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><br />
&nbsp;<br />
&nbsp; &nbsp; body+<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;useful sql: %s<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> % sql<span style="color: black;">&#91;</span><span style="color: #483d8b;">'find_hided'</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; body+<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;spam preview: <span style="color: #000099; font-weight: bold;">\n</span>%s&quot;</span> % <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span> m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'msg'</span><span style="color: black;">&#93;</span>.<span style="color: black;">splitlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; Email<span style="color: black;">&#40;</span>sub.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span>body.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#the core part</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> anti_spam<span style="color: black;">&#40;</span>db<span style="color: #66cc66;">,</span> msgs<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; c<span style="color: #66cc66;">=</span>C<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; spam<span style="color: #66cc66;">=</span><span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; score<span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>.<span style="color: black;">fromkeys</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'uid'</span><span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> msgs<span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span> <span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> msgs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; c.<span style="color: black;">init</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'msg'</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span> m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'user'</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span> m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'ip'</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span> m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'email'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> c.<span style="color: black;">check</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; score<span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'uid'</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; spam.<span style="color: black;">append</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> s <span style="color: #ff7700;font-weight:bold;">in</span> score.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> score<span style="color: black;">&#91;</span>s<span style="color: black;">&#93;</span><span style="color: #66cc66;">==</span><span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">del</span> score<span style="color: black;">&#91;</span>s<span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> score <span style="color: #ff7700;font-weight:bold;">and</span> spam: <br />
&nbsp; &nbsp; &nbsp; &nbsp; hide_spam<span style="color: black;">&#40;</span>db<span style="color: #66cc66;">,</span> <span style="color: black;">&#91;</span><span style="color: #008000;">str</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#91;</span><span style="color: #483d8b;">'pid'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> m <span style="color: #ff7700;font-weight:bold;">in</span> spam<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; punish<span style="color: black;">&#40;</span>db<span style="color: #66cc66;">,</span> score<span style="color: black;">&#41;</span> <br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> send_email_log:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; report<span style="color: black;">&#40;</span>spam<span style="color: #66cc66;">,</span> score<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>: <br />
&nbsp; &nbsp; db<span style="color: #66cc66;">=</span>init_db<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; msg<span style="color: #66cc66;">=</span>get_msg<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; anti_spam<span style="color: black;">&#40;</span>db<span style="color: #66cc66;">,</span> msg<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__<span style="color: #66cc66;">==</span><span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/anti-spam.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>复杂的正则表达式应该如何构造</title>
		<link>http://iregex.org/blog/craft-complex-regex.html</link>
		<comments>http://iregex.org/blog/craft-complex-regex.html#comments</comments>
		<pubDate>Fri, 06 Aug 2010 07:02:46 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[问答]]></category>
		<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=138</guid>
		<description><![CDATA[昨天Snopo问我如何写一段正则表达式，来提取sql的条件语句。解答之余，想写一篇文章介绍一下经验。文题本来是《如何构造复杂的正则表达式》，但是觉得有些歧义，就感觉正则式本来很简单... ]]></description>
			<content:encoded><![CDATA[<p>昨天Snopo问我如何写一段正则表达式，来提取sql的条件语句。解答之余，想写一篇文章介绍一下经验。文题本来是《如何构造复杂的正则表达式》，但是觉得有些歧义，就感觉正则式本来很简单，我在教人如何将它小事化大一样。正好相反，我的本意是说，即使复杂的正则式也不怕，找出合适的方法，将其构造出来。</p>
<p><span id="more-138"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">避重就轻</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>
    Snopo给出的文本是这样的：<code class="codecolorer text default"><span class="text">or and name='zhangsan' and id=001 or age&gt;20 or area='%renmin%' and like</span></code>，问，如何提取其中正确的SQL查询语句。</p>
<p>
       简要分析可知，中间部分是合乎要求的，只是两端的有若干个<code class="codecolorer text default"><span class="text">like, or, and</span></code>。构造能够解析合乎SQL语法的查询语句的正则表达式，应该是比较复杂的。可是，对于具体的问题，也可以更简单。上述的不良构的SQL语句，应该是使用程序自动生成的，它的两端会有一些不符合题意的文本。只要将这些文本去除就可以了。</p>
<p>于是，我写出了正则表达式：<code class="codecolorer perl default"><span class="perl"><span style="color: #000066;">s</span><span style="color: #339933;">/^</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #b1b100;">or</span><span style="color: #339933;">|</span><span style="color: #b1b100;">and</span><span style="color: #339933;">|</span>like<span style="color: #009900;">&#41;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+|</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #b1b100;">or</span><span style="color: #339933;">|</span><span style="color: #b1b100;">and</span><span style="color: #339933;">|</span>like<span style="color: #009900;">&#41;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span><span style="color: #0000ff;">$/</span><span style="color: #339933;">/</span>mi<span style="color: #339933;">;</span></span></code>，这样就把多行字串首尾的<code class="codecolorer text default"><span class="text">like, or, and</span></code>以及可能的空白字符全部去掉了，剩下的内容即为所求。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">分而治之</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>
        答案发过去之后，Snopo显然不是很满意这种“偷懒”的办法。他继续问道，能否写出正则式，用来匹配合符SQL语法要求的条件查询语句？（只考虑where部分即可，不必写完整的select。）
    </p>
<p>的确，从快速解决问题的角度来说，只要能够行之有效地解决，用什么办法都可以；不过从学习知识的角度来说，不避重就轻，而是刨根问底，才是正途。既如此，就看一下如何使用正则，将该SQL查询语句解决掉。</p>
<p>
        最简单的查询语句，应该是真假判断，即  <code class="codecolorer text default"><span class="text">where 1; where True; where false</span></code>，等等。 这样的语句使用正则式，直接<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:-?</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+|</span>True<span style="color: #339933;">|</span>False<span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span>i</span></code>。
        </p>
<p>
        稍复杂些的单条语句，可以是左右比较，即</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">name like 'zhang%', 或 age&gt;25 ，或 work in ('it', 'hr', 'R&amp;D')</div></td></tr></tbody></table></div>
<p>。将其简单化，结构就变为<code class="codecolorer text default"><span class="text">A OP B</span></code>。其中A代表变量，OP代表比较操作符，B代表值。
        </p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>A: 最简单的A，应该是<code class="codecolorer text default"><span class="text">\w+</span></code>。考虑到实际情况，变量包含点号或脱字符，例如<code class="codecolorer text default"><span class="text">`table.salary`</span></code>，可以记为<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\w</span><span style="color: #339933;">.</span><span style="color: #ff0000;">`]+/</span></span></code>。这是比较笼统的细化。如果要求比较苛刻，还可以做到让脱字符同时在左右两边出现（条件判断）。 </li>
<li>OP: Where 常用的几种关系比较为：<code class="codecolorer text default"><span class="text">=, &lt;&gt;, &gt;, &lt;, &gt;=, &lt;=, Between, Like, in</span></code>。使用简单的正则描述之，成为：<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">&lt;&gt;=</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">|</span>Between<span style="color: #339933;">|</span>Like<span style="color: #339933;">|</span>In<span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span>i</span></code>。
                </li>
<li>B: B 的情况又可分为3种：变量，数字，字符串，列表。为简单起见，这里就不考虑算术表达式了。<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>变量的话，直接延用A的定义即可。不赘述。</li>
<li>数字：使用<code class="codecolorer text default"><span class="text">/\d+/</span></code>来定义。不考虑小数和负数了。</li>
<li>字符串：包括单引号字串和双引号字串。中间可以包括被转义的引号。我写了一个符合这一要求的引号字串正则表达式，形如：<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #ff0000;">&quot;]|[^<span style="color: #000099; font-weight: bold;">\\</span>1])*?<span style="color: #000099; font-weight: bold;">\1</span>/</span></span></code>。不过，由于它只是庞大机器的一个零件，这样写的风险是极其大的。首先，它使用了反向引用；其次，该反向引用使用了全局的反向引用编号。我写了自动生成全局编号的函数，来解决这一问题。不过，这里谈细节是不是太深入了。应该先谈框架，再说细节才对。不应该一入手就陷进细节的汪洋大海。 </li>
<li>列表：列表是形如<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">3</span> <span style="color: #339933;">,</span> <span style="color: #cc66cc;">4</span><span style="color: #009900;">&#41;</span> 或 <span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;it&quot;</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;hr&quot;</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;r&amp;d&quot;</span><span style="color: #009900;">&#41;</span></span></code>之类的东东，它由简单变量以逗号相连，两边加上括号组成。列表的单项以I表示，它代表 数字|字符串。此时，列表就变为：<code class="codecolorer text default"><span class="text">/\(I(?:,I)*?\)/</span></code>。它表示，左括号，一个I，一系列由逗号、I组成的其它列表项（0个或多个），右括号。简单起见没有考虑空白字符。
                        </li>
</ul>
</blockquote>
</li>
<li>至此，可以总结出单条语句的正则框架：<code class="codecolorer perl default"><span class="perl">S <span style="color: #339933;">=~</span> <span style="color: #339933;">/</span>A OP B<span style="color: #339933;">/</span>i</span></code>。S在此代表单条语句。 </li>
</ul>
</blockquote>
<p>
        更为复杂的是多条语句，可以由单条语句组成，中间使用 and 或 or 连接。合理地构造单条语句，将其稳定地编制为多条语句，任务就完成了。
    </p>
<p>
        沿用上面的示例，以S代表单条语句，那么复合语句C就是 <code class="codecolorer perl default"><span class="perl">C <span style="color: #339933;">=~</span> S<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #b1b100;">or</span><span style="color: #339933;">|</span><span style="color: #b1b100;">and</span><span style="color: #009900;">&#41;</span> S<span style="color: #009900;">&#41;</span><span style="color: #339933;">*?/</span></span></code>。至此，一个初具规模的条件语句解析器就诞生了。下面以python为例，一步一步实现出来。
    </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Python实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>重申一句：虽然给出了实现，但是仍请注重思路，忽略代码。</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;test.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-06 17:12</span><br />
<br />
<span style="color: #808080; font-style: italic;">#generage quoted string;</span><br />
<span style="color: #808080; font-style: italic;">#including ' and &quot; string</span><br />
<span style="color: #808080; font-style: italic;">#allow \' and \&quot; inside</span><br />
index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">0</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> gen_quote_str<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>: <br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">global</span> index<br />
&nbsp; &nbsp; index+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; char<span style="color: #66cc66;">=</span><span style="color: #008000;">chr</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">96</span>+index<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;(?P&lt;quote_%s&gt;['&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['&quot;]|[^'&quot;])*?(?P=quote_%s)&quot;&quot;&quot;</span>% <span style="color: black;">&#40;</span>char<span style="color: #66cc66;">,</span> char<span style="color: black;">&#41;</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#simple variable </span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">'[<span style="color: #000099; font-weight: bold;">\w</span>.`]+'</span><br />
<br />
<span style="color: #808080; font-style: italic;">#operators</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> op<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">'(?:[&lt;&gt;=]{1,2}|Between|Like|In)'</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#list item within (,)</span><br />
<span style="color: #808080; font-style: italic;">#eg: 'a', 23, a.b, &quot;asdfasdf\&quot;aasdf&quot;</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;(?:%s|%s)&quot;</span> % <span style="color: black;">&#40;</span>a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> gen_quote_str<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
<br />
<span style="color: #808080; font-style: italic;">#a complite list, like</span><br />
<span style="color: #808080; font-style: italic;">#eg: (23, 24, 44), (&quot;regex&quot;, &quot;is&quot;, &quot;good&quot;)</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;<span style="color: #000099; font-weight: bold;">\(</span> <span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; %s <br />
&nbsp; &nbsp; (?:,<span style="color: #000099; font-weight: bold;">\s</span>* %s)* <span style="color: #000099; font-weight: bold;">\s</span>* <br />
<span style="color: #000099; font-weight: bold;">\)</span>&quot;&quot;&quot;</span> % <span style="color: black;">&#40;</span>item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#simple comparison</span><br />
<span style="color: #808080; font-style: italic;">#eg: a=15 , b&gt;23</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;%s <span style="color: #000099; font-weight: bold;">\s</span>* %s <span style="color: #000099; font-weight: bold;">\s</span>* (?:<span style="color: #000099; font-weight: bold;">\w</span>+| %s | %s )&quot;&quot;&quot;</span> % <span style="color: black;">&#40;</span>a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> op<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> gen_quote_str<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
<span style="color: #808080; font-style: italic;">#complex comparison</span><br />
<span style="color: #808080; font-style: italic;"># name like 'zhang%' and age&gt;23 and work in (&quot;hr&quot;, &quot;it&quot;, 'r&amp;d')</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> c<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> r<span style="color: #483d8b;">&quot;&quot;&quot;<br />
(?ix) %s <br />
(?:<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; (?:and|or)<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; %s &nbsp;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
)*<br />
&nbsp; &nbsp; &nbsp;&quot;&quot;&quot;</span> % <span style="color: black;">&#40;</span>s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;A:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span> a<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;OP:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span> op<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;ITEM:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span> item<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;ITEMS:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span> items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;S:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span> s<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;C:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span> c<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>该代码在我的机器上(Ubuntu 10.04, Python 2.6.5)运行的结果是：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">A:&nbsp; <span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+<br />
OP: <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;=</span><span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span><span style="color: #66cc66;">,</span><span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span><br />
ITEM: &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_a<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_a<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
ITEMS:&nbsp; \<span style="color: black;">&#40;</span> \s* <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_b<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_b<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: #66cc66;">,</span>\s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_c<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_c<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>* \s* <br />
\<span style="color: black;">&#41;</span><br />
S:&nbsp; <span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+ \s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;=</span><span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span><span style="color: #66cc66;">,</span><span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span> \s* <span style="color: black;">&#40;</span>?:\w+| <span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_d<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_d<span style="color: black;">&#41;</span> | \<span style="color: black;">&#40;</span> \s* <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_e<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_e<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: #66cc66;">,</span>\s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_f<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_f<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>* \s* <br />
\<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span><br />
C:&nbsp; <br />
<span style="color: black;">&#40;</span>?ix<span style="color: black;">&#41;</span> <span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+ \s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;=</span><span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span><span style="color: #66cc66;">,</span><span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span> \s* <span style="color: black;">&#40;</span>?:\w+| <span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_g<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_g<span style="color: black;">&#41;</span> | \<span style="color: black;">&#40;</span> \s* <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_h<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_h<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: #66cc66;">,</span>\s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_i<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_i<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>* \s* <br />
\<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> <br />
<span style="color: black;">&#40;</span>?:\s*<br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: #ff7700;font-weight:bold;">and</span>|<span style="color: #ff7700;font-weight:bold;">or</span><span style="color: black;">&#41;</span>\s*<br />
&nbsp; &nbsp; <span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+ \s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span><span style="color: #66cc66;">&lt;&gt;=</span><span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">1</span><span style="color: #66cc66;">,</span><span style="color: #ff4500;">2</span><span style="color: black;">&#125;</span>|Between|Like|In<span style="color: black;">&#41;</span> \s* <span style="color: black;">&#40;</span>?:\w+| <span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_j<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_j<span style="color: black;">&#41;</span> | \<span style="color: black;">&#40;</span> \s* <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_k<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_k<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: black;">&#40;</span>?:<span style="color: #66cc66;">,</span>\s* <span style="color: black;">&#40;</span>?:<span style="color: black;">&#91;</span>\w.`<span style="color: black;">&#93;</span>+|<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">&lt;</span>quote_l<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'&quot;])(?:<span style="color: #000099; font-weight: bold;">\\</span>['</span><span style="color: #483d8b;">&quot;]|[^'&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>*?<span style="color: black;">&#40;</span>?P<span style="color: #66cc66;">=</span>quote_l<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>* \s* <br />
\<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> &nbsp;\s*<br />
<span style="color: black;">&#41;</span>*</div></td></tr></tbody></table></div>
<p>请看匹配效果图：</p>
<p><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/2010-08-07_Selection_02.png" /></p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">算术表达式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>我记得刚才好像提到“为简单起见，这里就不考虑算术表达式了”。不过，解析算术表达式是个非常有趣的话题，只要是算法书，都会提及（中缀表达式转前缀表达式，诸如此类）。当然它也可以使用正则表达式来描述。</p>
<p>其主要思路是：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">expr -&gt; expr + term | expr - term | term<br />
term -&gt; term * factor | term / factor | factor<br />
factor -&gt; digit | ( expr )</div></td></tr></tbody></table></div>
<p>以及代码：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;math.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-07 00:44</span><br />
<br />
integer<span style="color: #66cc66;">=</span>r<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\d</span>+&quot;</span><br />
<br />
factor<span style="color: #66cc66;">=</span>r<span style="color: #483d8b;">&quot;%s (?:<span style="color: #000099; font-weight: bold;">\.</span> %s)?&quot;</span> % <span style="color: black;">&#40;</span>integer<span style="color: #66cc66;">,</span> integer<span style="color: black;">&#41;</span><br />
<br />
term<span style="color: #66cc66;">=</span> <span style="color: #483d8b;">&quot;%s(?: <span style="color: #000099; font-weight: bold;">\s</span>* [*/] <span style="color: #000099; font-weight: bold;">\s</span>* %s)* &quot;</span> % <span style="color: black;">&#40;</span>factor<span style="color: #66cc66;">,</span> factor<span style="color: black;">&#41;</span><br />
<br />
expr<span style="color: #66cc66;">=</span> <span style="color: #483d8b;">&quot;(?x) %s(?: <span style="color: #000099; font-weight: bold;">\s</span>* [+-] <span style="color: #000099; font-weight: bold;">\s</span>* %s)* &quot;</span> % <span style="color: black;">&#40;</span>term<span style="color: #66cc66;">,</span> term<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">print</span> expr</div></td></tr></tbody></table></div>
<p>看一下它的输出和匹配效果图：</p>
<p><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/2010-08-07_Selection_01.png"/></p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">小贴士</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>如果不用复杂的正则式就能解决问题，一定不要用。 </li>
<li>如果必须写比较复杂的正则式，请参考以下原则。</li>
<li>从大处着眼，先理解待解析的文本的整体结构是什么样子，划分为小部件；</li>
<li>从细处着手，试图实现每一个小部件，力求每一部分都是完整、坚固的，且放在全局也不会冲突。
</li>
<li>合理组装这些部件。</li>
<li>分而治之的好处：只有某个模块出错，其它部分没错时，可以迅速定位错误，消除BUG。</li>
<li>谨慎使用捕获括号，除非你知道自己在做什么，知道它会有什么副作用，以及是否有可行的解决措施。对于短小的正则式来说，一两个多余的括号是无伤大雅的；但是对于复杂的正则式来说，一对多余的括号可能就是致命的错误。</li>
<li>尽量使用free-space模式。此时你可以自由地添加注释和空白字符，以便提高正则表达式的可读性。</li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/craft-complex-regex.html/feed</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>格式化HTML标签缩进</title>
		<link>http://iregex.org/blog/html-tag-indentation.html</link>
		<comments>http://iregex.org/blog/html-tag-indentation.html#comments</comments>
		<pubDate>Wed, 04 Aug 2010 05:06:40 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[问答]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tag]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=137</guid>
		<description><![CDATA[读者“神の呼出”留言询问如何格式化HTML的标签缩进，并给出了他的思路和解法，是从纯粹的正则出发。例如，寻找配对的标签要用到后向引用，标签嵌套则使用递归。不过，这两个特性虽然... ]]></description>
			<content:encoded><![CDATA[<p>读者“神の呼出”<a href="http://iregex.org/blog/recursive-regex-in-php.html">留言</a>询问如何格式化HTML的标签缩进，并给出了他的思路和解法，是从纯粹的正则出发。例如，寻找配对的标签要用到后向引用，标签嵌套则使用递归。不过，这两个特性虽然很有用，却不宜滥用。本文试图从另一个角度出发，简化思路，降低对正则的依赖，以便提高速度。</p>
<p><span id="more-137"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">问题描述</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>总目标：格式化HTML文本，按标签的层级输出。</li>
<li>标签是平衡的，例如<code class="codecolorer html4strict default"><span class="html4strict"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">div</span>&gt;&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">div</span>&gt;</span></span></code>。</li>
<li>标签也有可能不是以配对的形式出现的，例如<code class="codecolorer html4strict default"><span class="html4strict"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">img</span> ... <span style="color: #66cc66;">/</span>&gt;</span></span></code>。</li>
<li>标签可能嵌套出现的，例如：
<div class="codecolorer-container html4strict mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="html4strict codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">div</span>&gt;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">div</span>&gt;&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">div</span>&gt;</span><br />
<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">div</span>&gt;</span></div></td></tr></tbody></table></div>
</li>
<li>所有的文本可能是以单行形式给出，没有换行符、水平制表符等空白字符。</li>
<li>输出时，每出现一组新的标签，缩进一个层级。</li>
<li>最内层的标签应处于同级。</li>
<li>所有的文字应与其父标签同级。</li>
<li>独立元素实现同级缩进。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">解决思路</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>我主要说一下思路，并给出Python版的实现。其中用到的正则，都是简单正则，可以方便地翻译为其它语言。</p>
<ul>
<li>由于源文本是单行文本（默认情况），需要在合适的地方插入换行符。我的思路是，在<strong>不是位于行首的左尖括号处</strong>加入换行符。要使用多行模式，以便让<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">^</span></span></code>能匹配字串内的行首。使用的正则式是<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?&lt;!^</span><span style="color: #009900;">&#41;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?=&lt;</span><span style="color: #009900;">&#41;</span></span></code>。注意，它顺便去掉了尖括号左侧的可能的任意个空白字符。实际运行时，它不但处理单行，还处理多行。</li>
<li>同理，在<strong>不是位于行尾</strong>的右尖括号的右侧插入换行符。正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#40;</span>?<span style="color: #66cc66;">&lt;!</span>^<span style="color: black;">&#41;</span>\s*<span style="color: black;">&#40;</span>?<span style="color: #66cc66;">=&lt;</span><span style="color: black;">&#41;</span></span></code>。同理。</li>
<li>现在，以<strong>文本行</strong>为单位处理每一行文本。</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>设置层级变量level，初始值为0。</li>
<li>如果层级向右缩进，则level++；如果向左伸出，则level&#8211;。</li>
<li>如果该行包含<code class="codecolorer text default"><span class="text">&quot;/&gt;&quot;</span></code>，则这是一个独立的缩进单位，层级不变，直接输出level个层级符号，以及该行文本即可。</li>
<li>否则，如果该行以<code class="codecolorer text default"><span class="text">&quot;&lt;/&quot;</span></code>开头，则表明是一个层级的结束，应该先level&#8211;，再输出该行内容。此顺序很重要。 </li>
<li>否则，如果该行以<code class="codecolorer text default"><span class="text">&quot;&lt;&quot;</span></code>开头，则表明这是一个层级的开始，应该先level++，再输出该行内容。顺序同样重要。 </li>
<li>其余情况，就是普通文本，直接继承上个层级的缩进量，再输出该行文本即可。</li>
</ul>
</blockquote>
</ul>
<p>    程序至此为止。当然，如果想处理更加复杂的情况，可以酌情增减语句。例如，我所处理的文本，有的是一个标签太长，因此分行写的，例如：</p>
<div class="codecolorer-container html4strict mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br /></div></td><td><div class="html4strict codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #00bbdd;">&lt;!DOCTYPE </span><br />
<span style="color: #00bbdd;">&nbsp; &nbsp; html </span><br />
<span style="color: #00bbdd;">&nbsp; &nbsp; PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; </span><br />
<span style="color: #00bbdd;">&nbsp; &nbsp; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;</span></div></td></tr></tbody></table></div>
<p>对于这样的情况，我只好给出纯粹的正则解法，虽然速度不快，但是不重复，不遗漏：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#combine &lt; \n..&gt; lines</span><br />
x<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span><span style="color: #66cc66;">,</span>content<span style="color: black;">&#41;</span> <br />
<span style="color: #ff7700;font-weight:bold;">while</span> x: <br />
&nbsp; &nbsp; content<span style="color: #66cc66;">=</span>content.<span style="color: black;">replace</span><span style="color: black;">&#40;</span>x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span>x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot; &quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; x<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span><span style="color: #66cc66;">,</span>content<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>这种情况，我想不出正则以外的解法。
</p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Python代码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>实现只是小道。如果理解了上述思路，很容易转为其它语言的代码。JS, PHP都可以。请读者自已实现。有问题请留言。</p>
<p>该python代码的使用方式是：<code class="codecolorer bash default"><span class="bash">&nbsp;.<span style="color: #000000; font-weight: bold;">/</span>format_html.py source.html<span style="color: #000000; font-weight: bold;">&gt;</span> dest.html</span></code></p>
<p>完整代码：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;format_html.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-04</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><span style="color: #66cc66;">,</span><span style="color: #dc143c;">sys</span><br />
<br />
indent<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><br />
f<span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
content<span style="color: #66cc66;">=</span>f.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
content<span style="color: #66cc66;">=</span>content.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#combine &lt; \n..&gt; lines</span><br />
x<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span><span style="color: #66cc66;">,</span>content<span style="color: black;">&#41;</span> <br />
<span style="color: #ff7700;font-weight:bold;">while</span> x: <br />
&nbsp; &nbsp; content<span style="color: #66cc66;">=</span>content.<span style="color: black;">replace</span><span style="color: black;">&#40;</span>x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span>x.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot; &quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; x<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">search</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(&lt;[^&lt;&gt;]+)<span style="color: #000099; font-weight: bold;">\s</span>*<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\s</span>*&quot;</span><span style="color: #66cc66;">,</span>content<span style="color: black;">&#41;</span><br />
<br />
content<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?m)(?&lt;!^)<span style="color: #000099; font-weight: bold;">\s</span>*(?=&lt;)&quot;</span><span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">,</span> content<span style="color: black;">&#41;</span><br />
content<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?&lt;=&gt;)<span style="color: #000099; font-weight: bold;">\s</span>*(?=<span style="color: #000099; font-weight: bold;">\S</span>)&quot;</span><span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">,</span> content<span style="color: black;">&#41;</span><span style="color: #66cc66;">;</span><br />
lines<span style="color: #66cc66;">=</span>content.<span style="color: black;">splitlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
level<span style="color: #66cc66;">=</span><span style="color: #ff4500;">0</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> l <span style="color: #ff7700;font-weight:bold;">in</span> lines: <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #483d8b;">&quot;/&gt;&quot;</span> <span style="color: #ff7700;font-weight:bold;">in</span> l:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span>%<span style="color: black;">&#40;</span>indent*level<span style="color: #66cc66;">,</span>l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">elif</span> l<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">==</span><span style="color: #483d8b;">'&lt;/'</span> : &nbsp;<br />
&nbsp; &nbsp; &nbsp; &nbsp; level -<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span>%<span style="color: black;">&#40;</span>indent*level<span style="color: #66cc66;">,</span>l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">elif</span> l<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">==</span><span style="color: #483d8b;">'&lt;'</span>: <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span>%<span style="color: black;">&#40;</span>indent*level<span style="color: #66cc66;">,</span>l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; level +<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span> <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%s%s&quot;</span>%<span style="color: black;">&#40;</span>indent*level<span style="color: #66cc66;">,</span>l<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/html-tag-indentation.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Trie in Python</title>
		<link>http://iregex.org/blog/trie-in-python.html</link>
		<comments>http://iregex.org/blog/trie-in-python.html#comments</comments>
		<pubDate>Sun, 01 Aug 2010 14:58:54 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[trie]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=136</guid>
		<description><![CDATA[关于 Trie 的介绍，请读上文Trie，此不赘述。本文主要分析 Trie 实现原理，并给出 Python 的实现。 构造检索树 先正更上文不精确之处。上文说， 具体说来，就是提取出备选项文本的公共部分，... ]]></description>
			<content:encoded><![CDATA[<p>关于 Trie 的介绍，请读上文<a href="http://iregex.org/blog/trie.html" title="我爱正则表达式" target="_blank">Trie</a>，此不赘述。本文主要分析 Trie 实现原理，并给出 Python 的实现。
</p>
<p><span id="more-136"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">构造检索树</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>先正更上文不精确之处。上文说，</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>具体说来，就是提取出备选项文本的公共部分，构造“检索树”…</p></blockquote>
<p>其实 Trie 并不是提取<strong>所有的</strong>“公共部分”，而是只提取“前缀”而已。例如，对于正则式<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>abbcc<span style="color: #339933;">|</span>abcc<span style="color: #339933;">/</span></span></code>，它生成的结果是<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?-</span>xism<span style="color: #339933;">:</span>ab<span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>bcc<span style="color: #339933;">|</span>cc<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></span></code>，而非<code class="codecolorer perl default"><span class="perl">abb<span style="color: #339933;">?</span>cc</span></code>，可见它并没有智能到足够程度，可应用之而不可迷信之。具体原因，可以通过读源码以及本文分析而理解。
    </p>
<p>新建一个 Trie 对象之后，每向它添加一个字串，它都做如下操作：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>整体的数据结构是 Hash 表，亦即Python中的Dictionary。</li>
<li>Hash表的每一个元素指向它自身；以所输入的子串的每个字母为Key，如果它自身为空，则指向一个新建的匿名Hash；</li>
<li>最后一个元素的Key为空字串‘’，value为1。这也是判断每个分支是否结束的标志。</li>
</ul>
</blockquote>
<p>请看一下对于字串 <code class="codecolorer text default"><span class="text">foobar</span></code>分析后所生成的数据结构：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">{<br />
'f' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'b' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'r' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp;}<br />
}</div></td></tr></tbody></table></div>
<p>很美观，对不对。奇妙的是，对于第一条字串，它生成的结构是这样的；对于新插入的第二条，第三条……第N条字串，它不是另起炉炉灶，而是萧规曹随，见缝插针，充分利用前面已经成生的数据结构。这要归功于Hash/Dictionary这种数据结构的特点。看一下针对于<code class="codecolorer text default"><span class="text">foobar foobah fooxar foozap fooza</span></code> 完全插入后的效果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp; 'f' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'o' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'b' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'h' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'r' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'x' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'r' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'z' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'a' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'' =&gt; 1,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'p' =&gt; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '' =&gt; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}<br />
}</div></td></tr></tbody></table></div>
<p>这个结构图很直观地解释了为什么是提取前缀而非后缀中缀什么的。</p>
<p>构造一个检索树，已经不是一个问题。现在来看看如何将它转换为正则表达式。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">检索树的正则表现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>原文源码比较精炼，用了许多Perl特有的语法且无注释。我简要解释一下作者思路。</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>如果当前节点的 key 为空，且当前只有一个键，则该分支结束，返回空值。这也是前文伏笔的照应：为什么要加上<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">$ref</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">''</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span></span></code>。 </li>
<li>对于不为空的节点，一一分析之。</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>只要当前节点不为空，一直递归调用本函数，将当前key+下个节点的递归结果push到数组<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@alt</span></span></code>中备用。 </li>
<li>否则只将 key push到 <code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@cc</span></span></code>中。 </li>
</ul>
</blockquote>
<li><code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@cc</span></span></code>是用来保存单个字母的，而<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@alt</span></span></code>则是用来保存多个字母的备选项的。 </li>
<li>将<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@cc</span></span></code>中的元素格式化为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#91;</span>abc<span style="color: #009900;">&#93;</span></span></code>的样子。当然，如果只有一个元素就不必了。 </li>
<li>将<code class="codecolorer perl default"><span class="perl"><span style="color: #0000ff;">@alt</span></span></code>中的元素格式化为<code class="codecolorer perl default"><span class="perl"><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>abc<span style="color: #339933;">|</span>xyz<span style="color: #009900;">&#41;</span></span></code>的样子，一个元素则免。
        </li>
<li>在适当的地方添加问号，表示备选。</li>
</ul>
<p>读懂了源码，自己实现起来就不是问题了。我在读此代码时，使用了纸笔抄写、观测<code class="codecolorer perl default"><span class="perl"><span style="color: #000066;">print</span></span></code>、<code class="codecolorer perl default"><span class="perl">Data<span style="color: #339933;">::</span><span style="color: #006600;">Dumper</span></span></code>输出等方式来辅助理解。事实证明卓有成效。</p>
</blockquote>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">移植到Python</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>从Perl到Python，其实就像从英语到法语的转换一样，只是将拼写方式，细微语法修整一下即可，算不得伤筋动骨的大手术。我是用它来验证理解、熟悉语法细节的。代码如下。只要<code class="codecolorer python default"><span class="python"><span style="color: #ff7700;font-weight:bold;">from</span> trie <span style="color: #ff7700;font-weight:bold;">import</span> Trie</span></code>就能使用了。
</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;tr.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-08-01 20:24</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">class</span> Trie<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;&quot;Regexp::Trie in python&quot;&quot;&quot;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">data</span><span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> add<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: #66cc66;">,</span> word<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span><span style="color: #008000;">self</span>.<span style="color: black;">data</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> word:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span>ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span><span style="color: #483d8b;">''</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> dump<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">data</span><br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> _regexp<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: #66cc66;">,</span> pData<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; data<span style="color: #66cc66;">=</span>pData<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> data.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>data.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">==</span><span style="color: #ff4500;">1</span>: <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">None</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; alt<span style="color: #66cc66;">=</span><span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; cc<span style="color: #66cc66;">=</span><span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; q<span style="color: #66cc66;">=</span><span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>data.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>data<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: #66cc66;">,</span><span style="color: #008000;">dict</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; recurse<span style="color: #66cc66;">=</span><span style="color: #008000;">self</span>._regexp<span style="color: black;">&#40;</span>data<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt.<span style="color: black;">append</span><span style="color: black;">&#40;</span>char+recurse<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cc.<span style="color: black;">append</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; q<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; cconly<span style="color: #66cc66;">=</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> <span style="color: #ff4500;">0</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: #ff4500;">1</span> &nbsp;<span style="color: #808080; font-style: italic;">#if len, 0; else:0</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#41;</span><span style="color: #66cc66;">&gt;</span><span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#41;</span><span style="color: #66cc66;">==</span><span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt.<span style="color: black;">append</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'['</span>+<span style="color: #483d8b;">''</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>cc<span style="color: black;">&#41;</span>+<span style="color: #483d8b;">']'</span><span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span><span style="color: #66cc66;">==</span><span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result<span style="color: #66cc66;">=</span>alt<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;(?:&quot;</span>+<span style="color: #483d8b;">&quot;|&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot;)&quot;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> q:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> cconly:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result+<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;?&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;(?:%s)?&quot;</span> % result<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> result <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> regexp<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;(?-xism:%s)&quot;</span> % <span style="color: #008000;">self</span>._regexp<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">dump</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<br />
a<span style="color: #66cc66;">=</span>Trie<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">for</span> w <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #483d8b;">'foobar'</span><span style="color: #66cc66;">,</span> <span style="color: #483d8b;">'foobah'</span><span style="color: #66cc66;">,</span> <span style="color: #483d8b;">'fooxar'</span><span style="color: #66cc66;">,</span> <span style="color: #483d8b;">'foozap'</span><span style="color: #66cc66;">,</span> <span style="color: #483d8b;">'fooza'</span><span style="color: black;">&#93;</span>:<br />
&nbsp; &nbsp; a.<span style="color: black;">add</span><span style="color: black;">&#40;</span>w<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> a.<span style="color: black;">regexp</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>Ubuntu 10.04, Python 2.6.5 环境下测试通过。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/trie-in-python.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Python 中文正则笔记</title>
		<link>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html</link>
		<comments>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html#comments</comments>
		<pubDate>Sun, 27 Jun 2010 03:50:41 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[cjk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=129</guid>
		<description><![CDATA[总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个... ]]></description>
			<content:encoded><![CDATA[<p>总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。</p>
<p><span id="more-129"></span></p>
<p>从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一点经验</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>可以使用 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数查看字串的原始格式。这对于写正则表达式有所帮助。
            </li>
<li>Python 的 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span></span></code>模块有两个相似的函数：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span>.<span style="color: black;">match</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">re</span>.<span style="color: black;">search</span></span></code> 。两个函数的匹配过程完全一致，只是起点不同。<code class="codecolorer python default"><span class="python">match</span></code>只从字串的开始位置进行匹配，如果失败，它就此放弃；而<code class="codecolorer python default"><span class="python">search</span></code>则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解<code class="codecolorer python default"><span class="python">match</span></code>的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，<code class="codecolorer python default"><span class="python">search</span></code>通常是你需要的那个函数。</li>
<li>从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用<code class="codecolorer python default"><span class="python">findall<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>这个函数。例子见后面的代码。</li>
<li><code class="codecolorer python default"><span class="python">utf8</span></code>下，每个汉字占据3个字符位置，正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#91;</span>\x80-\xff<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#125;</span></span></code>，这个都知道了吧。</li>
<li><code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>下，汉字的格式如<code class="codecolorer python default"><span class="python">\uXXXX</span></code>，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。</li>
<li>两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span></span></code>，来自定义所需要匹配的文本。</li>
<li>匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的<code class="codecolorer python default"><span class="python">utf8</span></code>，此时你不用额外做什么；如果是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>，就需要在正则式之前加上<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;&quot;</span></span></code>格式。</li>
<li>可以这样定义<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>字符串：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">string</span><span style="color: #66cc66;">=</span>u<span style="color: #483d8b;">&quot;我爱正则表达式&quot;</span></span></code>。如果字串不是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>的，可以使用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数转换之。如果你知道源字串的编码，可以使用<code class="codecolorer python default"><span class="python">newstr<span style="color: #66cc66;">=</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>oldstring<span style="color: #66cc66;">,</span> original_coding_name<span style="color: black;">&#41;</span></span></code>的方式转换，例如 linux 下常用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span><span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;utf8&quot;</span><span style="color: black;">&#41;</span></span></code>，windows 下或许会用<code class="codecolorer python default"><span class="python">cp936</span></code>吧，没测试。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例程序</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;py_utf8_unicode.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-06-27 09:11</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> findPart<span style="color: black;">&#40;</span>regex<span style="color: #66cc66;">,</span> text<span style="color: #66cc66;">,</span> name<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; res<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>regex<span style="color: #66cc66;">,</span> text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;There are %d %s parts:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>% <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>res<span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> name<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span>r<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span><br />
<br />
<span style="color: #808080; font-style: italic;">#sample is utf8 by default.</span><br />
sample<span style="color: #66cc66;">=</span><span style="color: #483d8b;">'''en: Regular expression is a powerful tool for manipulating text.<br />
zh: 正则表达式是一种很有用的处理文本的工具。<br />
jp: 正規表現は非常に役に立つツールテキストを操作することです。<br />
jp-char: あアいイうウえエおオ<br />
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.<br />
puc: 。？！、，；：“ ”‘ ’——……·－·《》〈〉！￥％＆＊＃<br />
'''</span><br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw utf8 string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#find the non-ascii chars:</span><br />
findPart<span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]+&quot;</span><span style="color: #66cc66;">,</span>sample<span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;non-ascii&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#convert the utf8 to unicode</span><br />
usample<span style="color: #66cc66;">=</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sample<span style="color: #66cc66;">,</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw unicode string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>usample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#get each language parts:</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode chinese&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>ac00-<span style="color: #000099; font-weight: bold;">\u</span>d7ff]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode korean&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode japanese katakana&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode japanese hiragana&quot;</span><span style="color: black;">&#41;</span> <br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3000-<span style="color: #000099; font-weight: bold;">\u</span>303f<span style="color: #000099; font-weight: bold;">\u</span>fb00-<span style="color: #000099; font-weight: bold;">\u</span>fffd]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode cjk Punctuation&quot;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">the raw utf8 string is:<br />
'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n'<br />
<br />
There are 14 non-ascii parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具。<br />
&nbsp; &nbsp; 正規表現は非常に役に立つツールテキストを操作することです。<br />
&nbsp; &nbsp; あアいイうウえエおオ<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
&nbsp; &nbsp; 。？！、，；：“<br />
&nbsp; &nbsp; ”‘<br />
&nbsp; &nbsp; ’——……·－·《》〈〉！￥％＆＊＃<br />
<br />
the raw unicode string is:<br />
u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n'<br />
<br />
There are 6 unicode chinese parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具<br />
&nbsp; &nbsp; 正規表現<br />
&nbsp; &nbsp; 非常<br />
&nbsp; &nbsp; 役<br />
&nbsp; &nbsp; 立<br />
&nbsp; &nbsp; 操作<br />
<br />
There are 8 unicode korean parts:<br />
<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
<br />
There are 6 unicode japanese katakana parts:<br />
<br />
&nbsp; &nbsp; ツールテキスト<br />
&nbsp; &nbsp; ア<br />
&nbsp; &nbsp; イ<br />
&nbsp; &nbsp; ウ<br />
&nbsp; &nbsp; エ<br />
&nbsp; &nbsp; オ<br />
<br />
There are 11 unicode japanese hiragana parts:<br />
<br />
&nbsp; &nbsp; は<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; つ<br />
&nbsp; &nbsp; を<br />
&nbsp; &nbsp; することです<br />
&nbsp; &nbsp; あ<br />
&nbsp; &nbsp; い<br />
&nbsp; &nbsp; う<br />
&nbsp; &nbsp; え<br />
&nbsp; &nbsp; お<br />
<br />
There are 5 unicode cjk Punctuation parts:<br />
<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。？！、，；：<br />
&nbsp; &nbsp; －<br />
&nbsp; &nbsp; 《》〈〉！￥％＆＊＃</div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>正则笔记</title>
		<link>http://iregex.org/blog/regex-note-20100621.html</link>
		<comments>http://iregex.org/blog/regex-note-20100621.html#comments</comments>
		<pubDate>Mon, 21 Jun 2010 15:04:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[pos]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=128</guid>
		<description><![CDATA[笔记三则，贴在这里。 首字母大小写无关模式 有一段时间，我在写正则表达式来匹配Drug关键字时，经常写出 /viagra&#124;cialis&#124;anti-ed/ 这样的表达式。为了让它更美观，我会给关键词排序；为... ]]></description>
			<content:encoded><![CDATA[<p>笔记三则，贴在这里。</p>
<p><span id="more-128"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">首字母大小写无关模式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>有一段时间，我在写正则表达式来匹配<code class="codecolorer text default"><span class="text">Drug</span></code>关键字时，经常写出 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">|</span>cialis<span style="color: #339933;">|</span>anti<span style="color: #339933;">-</span>ed<span style="color: #339933;">/</span></span></code> 这样的表达式。为了让它更美观，我会给关键词排序；为了提升速度，我会使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span>Vv<span style="color: #009900;">&#93;</span>iagra<span style="color: #339933;">/</span></span></code> 而非<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">/</span>i</span></code> ，只让必要的部分进行大小写通配模式。确切地说，我是需要对每个单词的首字母进行大小写无关的匹配。 </p>
<p>我写了这样的一个函数，专门用来批量转换。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#convert regex to sorted list, then provide both lower/upper case for the first letter of each word</span><br />
<span style="color: #666666; font-style: italic;">#luf means lower upper first</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> luf<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; split the regex with the delimiter |</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@arr</span><span style="color: #339933;">=</span><span style="color: #000066;">sort</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\|/</span><span style="color: #339933;">,</span><span style="color: #000066;">shift</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; provide both the upper and lower case for the &nbsp;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; first leffer of each word </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #0000ff;">\b</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>zA<span style="color: #339933;">-</span>Z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\l</span><span style="color: #0000ff;">$1</span><span style="color: #0000ff;">\u</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">/</span>g<span style="color: #339933;">;</span><span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; join the keyword to a regex again</span><br />
&nbsp; &nbsp; <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'|'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000066;">print</span> luf <span style="color: #ff0000;">&quot;sex pill|viagra|cialis|anti-ed&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; the output is:[aA]nti-[eE]d|[cC]ialis|[sS]ex [pP]ill|[vV]iagra</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">控制全局匹配下次开始的位置</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>记得jyf曾经问过我，如何控制匹配开始的位置。嗯，现在我可以回答这个问题了。Perl 提供了 pos 函数，可以在 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>g</span></code> 全局匹配中调整下次匹配开始的位置。举例如下：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>其输出结果是每两个字母，即<code class="codecolorer text default"><span class="text">ab, cd, ef</span></code></p>
<p>可以使用 pos($_)来重新定位下一次匹配开始的位置，如：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">pos</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">--;</span> &nbsp;<span style="color: #666666; font-style: italic;">#pos($_)++;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">pos($_)--: &nbsp;ab, bc, cd, de, ef, fg.<br />
pos($_)++: &nbsp;ab, de.</div></td></tr></tbody></table></div>
<p>可以阅读 Perl 文档中关于 <a href="http://perldoc.perl.org/functions/pos.html" title="我爱正则表达式" target="_blank">pos</a>的章节获取详细信息。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">散列与正则表达式替换</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
《effective-perl-2e》第三章有这样一个例子（见下面的代码），将特殊符号转义。</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">my %ent = { '&amp;' =&gt; 'amp', '&lt;' =&gt; 'lt', '&gt;' =&gt; 'gt' };<br />
$html =~ s/([&amp;&lt;&gt;])/&amp;$ent{$1};/g;</div></td></tr></tbody></table></div>
<p>这个例子非常非常巧妙。它灵活地运用了散列这种数据结构，将待替换的部分作为 key ，将与其对应的替换内容作为 value 。这样只要有匹配就会捕获，然后将捕获的部分作为 key ，反查到 value 并运用到替换中，体现了高级语言的效率。</p>
<p>不过，这样的 Perl 代码，能否移植到 Python 中呢？ Python 同样支持正则，支持散列（Python 中叫做 Dictionary），但是似乎不支持在替换过程中插入太多花哨的东西（替换行内变量内插）。</p>
<p>查阅 Python 的文档，（在 shell 下 执行 python ，然后 import re，然后 help(re)），：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">sub(pattern, repl, string, count=0)<br />
&nbsp; &nbsp; Return the string obtained by replacing the leftmost<br />
&nbsp; &nbsp; non-overlapping occurrences of the pattern in string by the<br />
&nbsp; &nbsp; replacement repl. &nbsp;repl can be either a string or a callable;<br />
&nbsp; &nbsp; if a string, backslash escapes in it are processed. &nbsp;If it is<br />
&nbsp; &nbsp; a callable, it's passed the match object and must return<br />
&nbsp; &nbsp; a replacement string to be used.</div></td></tr></tbody></table></div>
<p>原来 python 和 php 一样，是支持在替换的过程中使用 callable 回调函数的。该函数的默认参数是一个匹配对象变量。这样一来，问题就简单了：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">ent<span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: #483d8b;">'&lt;'</span>:<span style="color: #483d8b;">&quot;lt&quot;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&gt;'</span>:<span style="color: #483d8b;">&quot;gt&quot;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&amp;'</span>:<span style="color: #483d8b;">&quot;amp&quot;</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; <span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> rep<span style="color: black;">&#40;</span>mo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> ent<span style="color: black;">&#91;</span>mo.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><br />
<br />
html<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;([&amp;&lt;&gt;])&quot;</span><span style="color: #66cc66;">,</span>rep<span style="color: #66cc66;">,</span> html<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>python 替换函数 callback 的关键点在于其参数是一个匹配对象变量。只要明白了这一点，查一下手册，看看该种对象都有哪些属性，一一拿来使用，就能写出灵活高效的 python 正则替换代码。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-note-20100621.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>由正则式反推文本:REExtractor</title>
		<link>http://iregex.org/blog/reextractor.html</link>
		<comments>http://iregex.org/blog/reextractor.html#comments</comments>
		<pubDate>Tue, 02 Feb 2010 09:12:35 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[REExtractor]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=76</guid>
		<description><![CDATA[发现一款简单有趣的正则表达式应用：REExtractor，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是 Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性... ]]></description>
			<content:encoded><![CDATA[<p>发现一款简单有趣的正则表达式应用：<a id="f-4f" href="http://re2form.appspot.com/" title="我爱正则表达式|由正则式反推文本">REExtractor</a>，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是<br />
Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性。不过，理论上可以做到，执行时却有限制。<span id="more-76"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一些限制</h3>
<ol>
<li>平台是GAE，语言是python，因此用的是python正则。或需代理才能访问使用。</li>
<li>支持的元字符或缩写：<font class="Apple-style-span" color="#FF00FF">(), [],{m,n},{n},|,\w,\d</font>。如果需要用到这些字符的字面值，请使用反斜线转义之。其中这里的\w等同于<font class="Apple-style-span" color="#FF00FF">[a-zA-Z0-9]</font>，为62个字符之一，而不是通常意义上的包括下划线在内的<font class="Apple-style-span" color="#FF00FF">[_a-zA_Z0-9]</font>，63字符之一。但是可以用<font class="Apple-style-span" color="#FF00FF">[_\w]</font>来代替，没问题的。</li>
<li>不支持的元字符：<font class="Apple-style-span" color="#FF00FF">.(点号),^,$,\b,\D,\W,\1&#8230;（后向引用）, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>等。</li>
<ul>
<li>如果出现<font class="Apple-style-span" color="#FF00FF">.</font>点号，则直接输出。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">^, $, \b, \1, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>， 程序无视之。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">\D或\b或[^]</font>，则程序会报错。原因是范围太宽。</li>
</ul>
<li>不支持可能性在1000条以上结果的正则表达式。例如，<font class="Apple-style-span" color="#FF00FF">\w{2}</font>，因为它的可能性是62×62。但是你可以使用\w\d，因为它的可能性是62×10。</li>
</ol>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">它能做什么</h3>
<p><a href="http://iregex.org/blog/REExtractor.html" target="_blank" title="我爱正则表达式|由正则式反推文本"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100202170343.jpg" border="0" alt="我爱正则表达式|由正则式反推文本"></a><br />
好吧，虽然限制多多，但是你仍然可以拿它来做一些有趣的应用。下面略举两例。</p>
<ul>
<li>生成一些简单的邮箱地址。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">[abc]{3}\d@1(26|63).com</font> ，它生成540条邮箱地址。</li>
<li>生成一些人名。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">张[小大勇赞强战海][虎猫龙彪平]</font>。它生成35条人名。是的，它支持中文，并且每个中文字都可以当成一个字符来应用。如果你家要添一个宝宝，可以将一些可能的字排列一下，看看哪些组合比较赏心、顺口，再从中选择一个。</li>
</ul>
<p>平心而论，上面的这些小应用，当然可以直接编程实现，限制更少，更灵活，更强大。但是有必要每次都开编译器么？尝试一下这款小程序，也挺有趣的。而且，上一节中提及的一些限制，其实也是蛮有道理的。毕竟从正则式反推文本，用不到大多数的零宽断言（不过<font class="Apple-style-span" color="#FF00FF">\1</font>这种反向引用应该挺常用的，却不支持）。当作一个小玩具就好。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/reextractor.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

