<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; utf8</title>
	<atom:link href="http://iregex.org/blog/tag/utf8/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Tue, 29 Mar 2011 05:04:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>		<item>
		<title>一个简单的中文分词程序</title>
		<link>http://iregex.org/blog/simple-nlp-for-chinese.html</link>
		<comments>http://iregex.org/blog/simple-nlp-for-chinese.html#comments</comments>
		<pubDate>Sun, 26 Sep 2010 14:41:12 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=151</guid>
		<description><![CDATA[kds:“前驻法大使吴建民指出，应该理**国”,想了一下，原来两个星号是“性爱”两字，生活在一个机械屏蔽时代的中国还真有喜感。——via 想必您也看到了推特上关于“理＊＊国”的笑话了。... ]]></description>
			<content:encoded><![CDATA[<blockquote  style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
    kds:“前驻法大使吴建民指出，应该理**国”,想了一下，原来两个星号是“性爱”两字，生活在一个机械屏蔽时代的中国还真有喜感。——<a href="https://twitter.com/rightf/status/25555437368" title="我爱正则表达式" target="_blank">via</a>
</p></blockquote>
<p>想必您也看到了推特上关于“<a href="https://twitter.com/rex_zhasm/status/25567030862" title="我爱正则表达式" target="_blank">理＊＊国</a>”的笑话了。我正好想学一下中文分词方面的知识，这是第一篇。</p>
<p><span id="more-151"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">分词原理与实现</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>英语等以空白字符作为分隔符的语言，分词不是问题。中文分词，需要处理的细节太多。单就“<a href="http://is.gd/ftZNO" title="我爱正则表达式" target="_blank">真歧义</a>”这一问题（简言之，如果没有上下文，连活生生的人也无法确定如何断句的歧义句）的处理方法而言，前辈们就已写出洋洋洒洒许多文字。不过这属于进阶题目。我想先实现一个最简单的分词程序。</p>
<p>以我的理解，最简单的分词程序，应该是先将中文文本切成最小的单位－－汉字－－再从词典里找词，将这些字按照最左最长原则（与正则精神暗合），合并为以词为单位的集合。这样的应该是最快的，只按照给定的数据划分合并即可，不必考虑语法元素的权重（词性：名动形数量代等等，语法：主谓宾定状补），以及上下文的出现次数。</p>
<p>关于源文本的切分，就参照<a href="http://iregex.org/blog/words-counter-in-python.html" title="我爱正则表达式" target="_blank">《统计汉字／英文单词数》</a>一文的思路，使用正则表达式<code class="codecolorer python default"><span class="python">r<span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span></span></code>来匹配即可。</p>
<p>关于词典，我使用的是<a href="http://www.mdbg.net/chindict/chindict.php?page=cedict" title="我爱正则表达式" target="_blank">CC-CEDICT</a>的词典，原因有三：没有版权问题；速度较快；Chrome也在用它（发现了吧：在Chrome上双击中文句子，会自动选择中文词汇而不是单字或整行进行反选高亮）。</p>
<p>接下来是如何分词。经过思考，我发现搜索树的原理可以拿来就用。原理请见此文：<a href="http://iregex.org/blog/trie-in-python.html" title="我爱正则表达式" target="_blank">Trie in Python</a>。具体方法是，将词库逐字读入内存，建立搜索树；然后对目标文本进行逐字分析，如果该字之后还可搜索，则继续搜索；否则停止，作为一个词汇单位处理。</p>
<p>这样的算法理论上比较快（未进行benchmark），原因有三：使用Trie结构，本质上是哈希表，空间换时间，是O(0)级的搜索；词库只有800K，可以轻易载入，内存空间没占多少；算法最慢的部分是载入Trie的阶段，之后速度就不再受影响。</p>
<p>不过，谈到它的扩充性，目前只能在words.txt中手动添加新词，而不能实现机器学习。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">源码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>完整的程序（包括我处理过的词库列表）放在<a href="http://github.com/zhasm/simpleNLP" title="我爱正则表达式" target="_blank">github</a>上了。有兴趣的可以把玩一下。这里列出主程序：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;nlp.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-09-26 19:15</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<br />
regex<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;(?x) (?: [<span style="color: #000099; font-weight: bold;">\w</span>-]+ &nbsp;| [<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3} )&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> init_wordslist<span style="color: black;">&#40;</span>fn<span style="color: #66cc66;">=</span><span style="color: #483d8b;">&quot;./words.txt&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; f<span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span>fn<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; lines<span style="color: #66cc66;">=</span><span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>f.<span style="color: black;">readlines</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> lines<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> words_2_trie<span style="color: black;">&#40;</span>wordslist<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; d<span style="color: #66cc66;">=</span><span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> wordslist: <br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>d<br />
&nbsp; &nbsp; &nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>word<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> chars:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span>ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">and</span> ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">or</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: black;">&#91;</span><span style="color: #483d8b;">''</span><span style="color: black;">&#93;</span><span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> d<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>trie<br />
&nbsp; &nbsp; index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> char <span style="color: #ff7700;font-weight:bold;">in</span> chars:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> ref.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>char<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> char<span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ref<span style="color: #66cc66;">=</span>ref<span style="color: black;">&#91;</span>char<span style="color: black;">&#93;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index+<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> index<span style="color: #66cc66;">==</span><span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index<span style="color: #66cc66;">=</span><span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> char<span style="color: #66cc66;">,</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'*'</span><span style="color: #66cc66;">,</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>chars<span style="color: black;">&#91;</span>index:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">pass</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">break</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#init</span><br />
&nbsp; &nbsp; words<span style="color: #66cc66;">=</span>init_wordslist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; trie<span style="color: #66cc66;">=</span>words_2_trie<span style="color: black;">&#40;</span>words<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#read content</span><br />
&nbsp; &nbsp; fn<span style="color: #66cc66;">=</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #dc143c;">string</span><span style="color: #66cc66;">=</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span>fn<span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; chars<span style="color: #66cc66;">=</span>regex.<span style="color: black;">findall</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#do the job</span><br />
&nbsp; &nbsp; search_in_trie<span style="color: black;">&#40;</span>chars<span style="color: #66cc66;">,</span> trie<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">if</span> __name__<span style="color: #66cc66;">==</span><span style="color: #483d8b;">'__main__'</span>:<br />
&nbsp; &nbsp; main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">本机测试</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>测试的文本如下：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">只听得一个女子低低应了一声。绿竹翁道：“姑姑请看，这部琴谱可有些古怪。”那<br />
女子又嗯了一声，琴音响起，调了调弦，停了一会，似是在将断了的琴弦换去，又调了调<br />
弦，便奏了起来。初时所奏和绿竹翁相同，到后来越转越高，那琴韵竟然履险如夷，举重<br />
若轻，毫不费力的便转了上去。令狐冲又惊又喜，依稀记得便是那天晚上所听到曲洋所奏<br />
的琴韵。这一曲时而慷慨激昂，时而温柔雅致，令狐冲虽不明乐理，但觉这位婆婆所奏，<br />
和曲洋所奏的曲调虽同，意趣却大有差别。这婆婆所奏的曲调平和中正，令人听着只觉音<br />
乐之美，却无曲洋所奏热血如沸的激奋。奏了良久，琴韵渐缓，似乎乐音在不住远去，倒<br />
像奏琴之人走出了数十丈之遥，又走到数里之外，细微几不可再闻。<br />
<br />
理性爱国<br />
性爱体验<br />
我爱正则表达式</div></td></tr></tbody></table></div>
<p>请留意末尾三行。</p>
<p>再看一下程序处理的结果：（＊表示词汇间的分隔）</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">只 * 听 得 * 一 个 * 女 子 * 低 低 * 应 * 了 * 一 声 * 。 * 绿 * 竹 * 翁 * 道 * ： * “ * 姑 姑 * 请 看 * ， * 这 * 部 * 琴 * 谱 * 可 有 * 些 * 古 怪 * 。 * ” * 那 * 女 子 * 又 * 嗯 * 了 * 一 声 * ， * 琴 * 音 响 * 起 * ， * 调 * 了 * 调 * 弦 * ， * 停 * 了 * 一 会 * ， * 似 是 * 在 * 将 * 断 * 了 * 的 * 琴 弦 * 换 * 去 * ， * 又 * 调 * 了 * 调 * 弦 * ， * 便 * 奏 * 了 * 起 来 * 。 * 初 * 时 * 所 * 奏 * 和 * 绿 * 竹 * 翁 * 相 同 * ， * 到 * 后 来 * 越 * 转 * 越 * 高 * ， * 那 * 琴 * 韵 * 竟 然 * 履 险 如 夷 * ， * 举 重 * 若 * 轻 * ， * 毫 不 费 力 * 的 * 便 * 转 * 了 * 上 去 * 。 * 令 狐 * 冲 * 又 * 惊 * 又 * 喜 * ， * 依 稀 * 记 得 * 便 是 * 那 天 * 晚 上 * 所 * 听 到 * 曲 * 洋 * 所 * 奏 * 的 * 琴 * 韵 * 。 * 这 一 * 曲 * 时 而 * 慷 慨 * 激 昂 * ， * 时 而 * 温 柔 * 雅 致 * ， * 令 狐 * 冲 * 虽 * 不 明 * 乐 理 * ， * 但 * 觉 * 这 位 * 婆 婆 * 所 * 奏 * ， * 和 * 曲 * 洋 * 所 * 奏 * 的 * 曲 调 * 虽 * 同 * ， * 意 趣 * 却 * 大 有 * 差 别 * 。 * 这 * 婆 婆 * 所 * 奏 * 的 * 曲 调 * 平 和 * 中 正 * ， * 令 人 * 听 * 着 * 只 * 觉 * 音 乐 之 * 美 * ， * 却 * 无 * 曲 * 洋 * 所 * 奏 * 热 血 * 如 * 沸 * 的 * 激 * 奋 * 。 * 奏 * 了 * 良 久 * ， * 琴 * 韵 * 渐 * 缓 * ， * 似 乎 * 乐 音 * 在 * 不 住 * 远 * 去 * ， * 倒 像 * 奏 * 琴 * 之 * 人 * 走 出 * 了 * 数 十 * 丈 * 之 * 遥 * ， * 又 * 走 * 到 * 数 * 里 * 之 外 * ， * 细 微 * 几 * 不 可 再 * 闻 * 。 * 理 性 * 爱 国 * 性 爱 * 体 验 * 我 * 爱 * 正 则 * 表 达 式</div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">更新</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p><strong>2010-10-03更新</strong>:发现本程序的一个bug。已改进算法，更精确，更快速。程序详见GitHub，链接如前。</p>
<p>请看新程序的分词结果：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>只 * 听 * 得 * 一 * 个 * 女子 * 低 * 低 * 应 * 了 * 一声 * 。 * 绿 * 竹 * 翁 * 道 * ： * “ * 姑姑 * 请看 * ， * 这 * 部 * 琴谱 * 可 * 有些 * 古怪 * 。 * ” * 那 * 女子 * 又 * 嗯 * 了 * 一声 * ， * 琴 * 音响 * 起 * ， * 调 * 了 * 调 * 弦 * ， * 停 * 了 * 一会 * ， * 似 * 是 * 在 * 将 * 断 * 了 * 的 * 琴弦 * 换 * 去 * ， * 又 * 调 * 了 * 调 * 弦 * ， * 便 * 奏 * 了 * 起来 * 。 * 初 * 时 * 所 * 奏 * 和 * 绿 * 竹 * 翁 * 相同 * ， * 到 * 后来 * 越 * 转 * 越 * 高 * ， * 那 * 琴 * 韵 * 竟然 * 履 * 险 * 如 * 夷 * ， * 举重 * 若 * 轻 * ， * 毫不 * 费力 * 的 * 便 * 转 * 了 * 上去 * 。 * 令狐 * 冲 * 又 * 惊 * 又 * 喜 * ， * 依稀 * 记得 * 便是 * 那天 * 晚上 * 所 * 听到 * 曲 * 洋 * 所 * 奏 * 的 * 琴 * 韵 * 。 * 这 * 一 * 曲 * 时而 * 慷慨 * 激昂 * ， * 时而 * 温柔 * 雅致 * ， * 令狐 * 冲 * 虽 * 不明 * 乐理 * ， * 但 * 觉 * 这位 * 婆婆 * 所 * 奏 * ， * 和 * 曲 * 洋 * 所 * 奏 * 的 * 曲调 * 虽 * 同 * ， * 意趣 * 却 * 大有 * 差别 * 。 * 这 * 婆婆 * 所 * 奏 * 的 * 曲调 * 平和 * 中正 * ， * 令人 * 听 * 着 * 只 * 觉 * 音乐 * 之 * 美 * ， * 却 * 无 * 曲 * 洋 * 所 * 奏 * 热血 * 如 * 沸 * 的 * 激 * 奋 * 。 * 奏 * 了 * 良久 * ， * 琴 * 韵 * 渐 * 缓 * ， * 似乎 * 乐音 * 在 * 不住 * 远 * 去 * ， * 倒像 * 奏 * 琴 * 之 * 人 * 走出 * 了 * 数 * 十 * 丈 * 之 * 遥 * ， * 又 * 走 * 到 * 数 * 里 * 之外 * ， * 细微 * 几 * 不可 * 再 * 闻 * 。 *</p>
<p>理性 * 爱国 *</p>
<p>性爱 * 体验 *</p>
<p>我 * 爱 * 正则 * 表达式 *</p>
<p>轻 * 音乐 *</p>
</blockquote>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/simple-nlp-for-chinese.html/feed</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Python 中文正则笔记</title>
		<link>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html</link>
		<comments>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html#comments</comments>
		<pubDate>Sun, 27 Jun 2010 03:50:41 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[cjk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=129</guid>
		<description><![CDATA[总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个... ]]></description>
			<content:encoded><![CDATA[<p>总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。</p>
<p><span id="more-129"></span></p>
<p>从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一点经验</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>可以使用 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数查看字串的原始格式。这对于写正则表达式有所帮助。
            </li>
<li>Python 的 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span></span></code>模块有两个相似的函数：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span>.<span style="color: black;">match</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">re</span>.<span style="color: black;">search</span></span></code> 。两个函数的匹配过程完全一致，只是起点不同。<code class="codecolorer python default"><span class="python">match</span></code>只从字串的开始位置进行匹配，如果失败，它就此放弃；而<code class="codecolorer python default"><span class="python">search</span></code>则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解<code class="codecolorer python default"><span class="python">match</span></code>的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，<code class="codecolorer python default"><span class="python">search</span></code>通常是你需要的那个函数。</li>
<li>从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用<code class="codecolorer python default"><span class="python">findall<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>这个函数。例子见后面的代码。</li>
<li><code class="codecolorer python default"><span class="python">utf8</span></code>下，每个汉字占据3个字符位置，正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#91;</span>\x80-\xff<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#125;</span></span></code>，这个都知道了吧。</li>
<li><code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>下，汉字的格式如<code class="codecolorer python default"><span class="python">\uXXXX</span></code>，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。</li>
<li>两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span></span></code>，来自定义所需要匹配的文本。</li>
<li>匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的<code class="codecolorer python default"><span class="python">utf8</span></code>，此时你不用额外做什么；如果是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>，就需要在正则式之前加上<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;&quot;</span></span></code>格式。</li>
<li>可以这样定义<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>字符串：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">string</span><span style="color: #66cc66;">=</span>u<span style="color: #483d8b;">&quot;我爱正则表达式&quot;</span></span></code>。如果字串不是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>的，可以使用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数转换之。如果你知道源字串的编码，可以使用<code class="codecolorer python default"><span class="python">newstr<span style="color: #66cc66;">=</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>oldstring<span style="color: #66cc66;">,</span> original_coding_name<span style="color: black;">&#41;</span></span></code>的方式转换，例如 linux 下常用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span><span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;utf8&quot;</span><span style="color: black;">&#41;</span></span></code>，windows 下或许会用<code class="codecolorer python default"><span class="python">cp936</span></code>吧，没测试。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例程序</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;py_utf8_unicode.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-06-27 09:11</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> findPart<span style="color: black;">&#40;</span>regex<span style="color: #66cc66;">,</span> text<span style="color: #66cc66;">,</span> name<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; res<span style="color: #66cc66;">=</span><span style="color: #dc143c;">re</span>.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>regex<span style="color: #66cc66;">,</span> text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;There are %d %s parts:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>% <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>res<span style="color: black;">&#41;</span><span style="color: #66cc66;">,</span> name<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #66cc66;">,</span>r<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span><br />
<br />
<span style="color: #808080; font-style: italic;">#sample is utf8 by default.</span><br />
sample<span style="color: #66cc66;">=</span><span style="color: #483d8b;">'''en: Regular expression is a powerful tool for manipulating text.<br />
zh: 正则表达式是一种很有用的处理文本的工具。<br />
jp: 正規表現は非常に役に立つツールテキストを操作することです。<br />
jp-char: あアいイうウえエおオ<br />
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.<br />
puc: 。？！、，；：“ ”‘ ’——……·－·《》〈〉！￥％＆＊＃<br />
'''</span><br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw utf8 string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#find the non-ascii chars:</span><br />
findPart<span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]+&quot;</span><span style="color: #66cc66;">,</span>sample<span style="color: #66cc66;">,</span><span style="color: #483d8b;">&quot;non-ascii&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#convert the utf8 to unicode</span><br />
usample<span style="color: #66cc66;">=</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sample<span style="color: #66cc66;">,</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw unicode string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">,</span> <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>usample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#get each language parts:</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode chinese&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>ac00-<span style="color: #000099; font-weight: bold;">\u</span>d7ff]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode korean&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode japanese katakana&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode japanese hiragana&quot;</span><span style="color: black;">&#41;</span> <br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3000-<span style="color: #000099; font-weight: bold;">\u</span>303f<span style="color: #000099; font-weight: bold;">\u</span>fb00-<span style="color: #000099; font-weight: bold;">\u</span>fffd]+&quot;</span><span style="color: #66cc66;">,</span> usample<span style="color: #66cc66;">,</span> <span style="color: #483d8b;">&quot;unicode cjk Punctuation&quot;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">the raw utf8 string is:<br />
'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n'<br />
<br />
There are 14 non-ascii parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具。<br />
&nbsp; &nbsp; 正規表現は非常に役に立つツールテキストを操作することです。<br />
&nbsp; &nbsp; あアいイうウえエおオ<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
&nbsp; &nbsp; 。？！、，；：“<br />
&nbsp; &nbsp; ”‘<br />
&nbsp; &nbsp; ’——……·－·《》〈〉！￥％＆＊＃<br />
<br />
the raw unicode string is:<br />
u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n'<br />
<br />
There are 6 unicode chinese parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具<br />
&nbsp; &nbsp; 正規表現<br />
&nbsp; &nbsp; 非常<br />
&nbsp; &nbsp; 役<br />
&nbsp; &nbsp; 立<br />
&nbsp; &nbsp; 操作<br />
<br />
There are 8 unicode korean parts:<br />
<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
<br />
There are 6 unicode japanese katakana parts:<br />
<br />
&nbsp; &nbsp; ツールテキスト<br />
&nbsp; &nbsp; ア<br />
&nbsp; &nbsp; イ<br />
&nbsp; &nbsp; ウ<br />
&nbsp; &nbsp; エ<br />
&nbsp; &nbsp; オ<br />
<br />
There are 11 unicode japanese hiragana parts:<br />
<br />
&nbsp; &nbsp; は<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; つ<br />
&nbsp; &nbsp; を<br />
&nbsp; &nbsp; することです<br />
&nbsp; &nbsp; あ<br />
&nbsp; &nbsp; い<br />
&nbsp; &nbsp; う<br />
&nbsp; &nbsp; え<br />
&nbsp; &nbsp; お<br />
<br />
There are 5 unicode cjk Punctuation parts:<br />
<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。？！、，；：<br />
&nbsp; &nbsp; －<br />
&nbsp; &nbsp; 《》〈〉！￥％＆＊＃</div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>wordpress UTF8 中文字数统计插件</title>
		<link>http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html</link>
		<comments>http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html#comments</comments>
		<pubDate>Fri, 02 Jan 2009 14:37:55 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[utf8]]></category>
		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=50</guid>
		<description><![CDATA[最近想在博客中实现这样的功能：“本文字数XXX，继续阅读&#8230;”。在网上找了一款Word Count Plugin for WordPress，作者是 Murray Williams，可惜它只能统计英文单词数，却不能统计中文字数。我下载... ]]></description>
			<content:encoded><![CDATA[<p><a target="_blank" href="http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html"><img src="http://i293.photobucket.com/albums/mm60/zhasm/wordpressutf8-1.png" /></a> </p>
<p>最近想在博客中实现这样的功能：“本文字数XXX，继续阅读&#8230;”。在网上找了一款<a target="_blank" href="http://www.murraywilliams.com/software/word-count-plugin-for-wordpress/" rel="nofollow">Word Count Plugin for WordPress</a>，作者是 <a target="_blank" href="http://www.murraywilliams.com/">Murray Williams</a>，可惜它只能统计英文单词数，却不能统计中文字数。我下载了源码，自己动手修改，实现了想要的功能。修改过程中涉及了PHP语言中如何使用<a target="_blank" href="http://iregex.org">正则表达式</a>来匹配中文，于是我把过程写这在里。 </p>
<p><span id="more-50"></span><br />
该插件的核心部分是这样的：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">function</span> mtw_wordcount<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">global</span> <span style="color: #000088;">$page</span><span style="color: #339933;">,</span> <span style="color: #000088;">$pages</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #339933;">!</span><span style="color: #990000;">function_exists</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'str_word_count'</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">str_word_count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #339933;">,</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>看来，它并没有自己来数字数，还是调用PHP函数str_word_count实现的。查文档，找到它的官方文档，<a title="http://us2.php.net/str_word_count" href="http://us2.php.net/str_word_count">http://us2.php.net/str_word_count</a>。它能统计字串中的英文单词数，按要求返回单词数目或单方数组。其原理是以默认或指定的分隔符来将字串分隔，再逐个统计。由于该函数对utf8无效，应该使用utf8版：<a target="_blank" href="http://us2.php.net/manual/en/function.str-word-count.php#85592">str_word_count_utf8</a>。</p>
<p>英文的单词之间有空格等分隔符，中文的应该怎样统计呢？由于Wordpress的中文字符编码为utf8，我们就从utf8入手。根据以前我写过一篇文章《<a target="_blank" href="http://iregex.org/blog/regex-to-match-chinese.html">匹配中文的正则表达式</a>》可知，在utf8中，匹配单个汉字的正则式是[\x80-\xff]{3}。这样一来，只要将中文的每一个单字视为一个英语单词来处理，那么统计出来的单词数量就应该是正确的。例如，</p>
<p><tt class="string">你好，世界。Hello world.</tt></p>
<p>我的处理方法是，将所有的中文单字替换成两边带空格的英文字母a，即使用正则表达式：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/[\x80-\xff]{3}/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">' a '</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>考虑到中文标点不计入总数，前边还应该有一条：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/～|！|｀|·|＃|￥|％|…|—|<br />
（|）|＋|－|＝|｛|｝|［|］|\＼|｜|“|”|’|‘|；|：|《|》|〈|〉|、|？|。|，/&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">' '</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>此时运行程序，得到正确的结果：6。</p>
<p>wordpress UTF8 中文字数统计插件完整的PHP程序是：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span><br />
load_plugin_textdomain<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'mtw-wordcount'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'wp-content/plugins/mtw-wordcount'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #666666; font-style: italic;">/*&nbsp; Call this from inside &quot;The Loop&quot; (see WordPress documentation) to get<br />
&nbsp; &nbsp; a WordCount of the current posting. Function RETURNS the Word Count as<br />
&nbsp; &nbsp; a value, it does not automatically display (ie. 'echo') the value.<br />
&nbsp;*/</span><br />
<br />
&nbsp; &nbsp; <span style="color: #990000;">define</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;WORD_COUNT_MASK&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">function</span> str_word_count_utf8<span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$format</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/～|！|｀|·|＃|￥|％|…|—|（|）|＋|－|＝|｛|｝|［|］|\＼|｜|“|”|’|‘|；|：|《|》|〈|〉|、|？|。|，/&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">' '</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/[\x80-\xff]{3}/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">' a '</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">switch</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$format</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">case</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span>WORD_COUNT_MASK<span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">case</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span>WORD_COUNT_MASK<span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #339933;">,</span> PREG_OFFSET_CAPTURE<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$match</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$result</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$match</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$match</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #000088;">$result</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span>WORD_COUNT_MASK<span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<br />
<br />
<span style="color: #000000; font-weight: bold;">function</span> mtw_wordcount<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">global</span> <span style="color: #000088;">$page</span><span style="color: #339933;">,</span> <span style="color: #000088;">$pages</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #990000;">function_exists</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'str_word_count_utf8'</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> str_word_count_utf8<span style="color: #009900;">&#40;</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #339933;">,</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">/*&nbsp; Auxilliary method. In case you want to use this somewhere outside the<br />
&nbsp; &nbsp; loop, where the $page and $pages globals don't necessarily work. Pass<br />
&nbsp; &nbsp; the string you want counted instead.<br />
*/</span><br />
<span style="color: #000000; font-weight: bold;">function</span> mtw_string_wordcount<span style="color: #009900;">&#40;</span><span style="color: #000088;">$instring</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #990000;">function_exists</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'str_word_count_utf8'</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> str_word_count_utf8<span style="color: #009900;">&#40;</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$instring</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #339933;">,</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$instring</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>这样修改后，该插件就能在wp中正确统计中文字数。我的使用方法是：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span> the_content<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;本文共计&quot;</span><span style="color: #339933;">.</span>mtw_wordcount<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">'字，已浏览'</span><span style="color: #339933;">.</span>the_views<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #339933;">,</span><span style="color: #009900; font-weight: bold;">false</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">'次。　继续阅读... »'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>其中的the_views(&quot;&quot;,false)部分是另一款插件<a href="http://fantasyworld.idv.tw/programs/wp_postviews_plus/">WP-PostViews Plus</a>实现的效果，不赘述。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>探索匹配中文的正则表达式</title>
		<link>http://iregex.org/blog/exploration-on-regular-rexpressions-that-match-chinese.html</link>
		<comments>http://iregex.org/blog/exploration-on-regular-rexpressions-that-match-chinese.html#comments</comments>
		<pubDate>Sat, 23 Aug 2008 16:22:29 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>
		<category><![CDATA[正则表达式]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=31</guid>
		<description><![CDATA[按：本文使用的RegexBuddy为3.1.0（完全）版，并非最新版3.1.1（截至2008.08.23）。需要该版本的请在这篇文章后留言。 注：参考www.regular-expressions.info的风格，更新了本模板的style.css文件，加入了与... ]]></description>
			<content:encoded><![CDATA[<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>按：本文使用的RegexBuddy为3.1.0（完全）版，并非最新版3.1.1（截至2008.08.23）。需要该版本的请在<a href="http://iregex.org/blog/regexbuddy.html" target="_blank"><font color="#ff008c">这篇</font></a>文章后留言。</p>
<p>注：参考<a href="http://www.regular-expressions.info" target="_blank">www.regular-expressions.info</a>的风格，更新了本模板的style.css文件，加入了与正则式代码相关的格式： </p>
<ul>
<li><strong>正则式</strong>格式举例：<tt class="regex">[a-z]+@[a-z]+?\.[a-z]+</tt> </li>
<li><strong>匹配</strong>格式举例：<tt class="match">pig@animals.com</tt>和<tt class="match">chicken@birds.com</tt> </li>
<li><strong>普通文本</strong>格式举例：<tt class="string">这是一些普通文本。hello regex world. pig@animals.com和chicken@birds.com</tt> </li>
</ul>
<p><span id="more-31"></span></p>
<p>可以这样使用：在字符串<tt class="string">这是一些普通文本。hello regex world. pig@animals.com和chicken@birds.com</tt>使用正则式<tt class="regex">[a-z]+@[a-z]+?\.[a-z]+</tt>加以匹配，得到的结果为：<tt class="match">pig@animals.com</tt>和<tt class="match">chicken@birds.com</tt>。 </p>
</blockquote>
<p><strong>极端粗放型</strong>：点号其实是近乎万能的，可以匹配任何字符，限制只在于换行符的匹配上。匹配中文自然不在话下。作为可有可无的背景符，一个<tt class="regex">.*</tt>就能匹配掉包括中文在内的全部字符。这当然是一种极端的情况，因为这样显示不出中文字符串的特性。这不是本文要探讨的。</p>
<p><strong>极端集约型</strong>：如果搜索特定文本，例如在<tt class="string">一二三四五六七八九十拾佰百千仟万亿</tt>中匹配<tt class="regex">十拾</tt>， 直接使用m/<tt class="regex">十拾</tt>/就能搞定。这同样不是本文要探讨的。与<tt class="regex">\w</tt>能匹配英文字母一样，本文想找的是能够匹配所有汉字，而不匹配其它文本的一种简写方式。 </p>
<p><strong>普适型型</strong>：由于汉字属于Unicode，我们就从unicode里面找。在<a href="http://unicode.org/reports/tr18/" target="_blank">Unicode Regular Expressions</a>，列出了unicode的许多种表达方式。搜索chinese，找到如下一行：</p>
<table width="400" border="1" cellpadding="2" cellspacing="1" unselectable="on">
<tbody>
<tr>
<td  valign="top" width="200">Writing Systems</td>
<td  valign="top" width="200">Blocks</td>
</tr>
<tr>
<td  valign="top" width="200">&#8230;</td>
<td  valign="top" width="200">&#8230;</td>
</tr>
<tr>
<td  valign="top" width="200">Chinese</td>
<td  valign="top" width="200">CJK Unified Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility Ideographs, CJK Compatibility Forms, Enclosed CJK Letters and Months, Small Form Variants, Bopomofo, Bopomofo Extended</td>
</tr>
</tbody>
</table>
<p>关于CJK的含义，是指中日韩统一表意文字（Chinese Japanese Korean Unified Ideographs），可以参考<a href="http://baike.baidu.com/view/628156.html" target="_blank">百度释义</a>，或<a href="http://en.wikipedia.org/wiki/CJK" target="_blank">wiki</a>词条。</p>
<p>再查了一下<a href="http://www.regular-expressions.info/" target="_blank">regular expressions</a>,查到其<a href="http://www.regular-expressions.info/unicode.html" target="_blank">unicode</a>一节有这样的内容：
</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p><tt class="regex">\p{InCJK_Unified_Ideographs}</tt>: U+4E00..U+9FFF </p></blockquote>
<p>
看到这里，我想起了以前写的<a href="http://iregex.org/blog/regular-expressions-to-match-chinese-username-in-asp.html" target="_blank">《匹配用户名的asp正则表达式(包括中文)》</a>一文中，提到的中文匹配为<tt class="regex">[\u4e00-\u9fa5]</tt>，原来是有其对应的速记方式的，虽然两者有最后一组字符的差异。看附图可见U+9fa5，最后一个汉字的模样。<img src="http://i3.6.cn/cvbnm/80/3c/69/ac41d1186fde1c67bf7cef334bc6a0c7.jpg" style="border: 1px solid rgb(255, 255, 255); margin: 0px 10px 10px; clear: both; padding-left: 0px; " alt="我爱正则表达式｜在RegexBuddy中如何使用正则表达式匹配中文字符｜http://iregex.org" /> 此序列的第一位，U+4e00，是汉字<tt class="string">一</tt>。
</p>
<p><strong>自定义</strong>：到目前为止，相当于给汉字找到了官方的身份和说法，使用<tt class="regex">\p{InCJK_Unified_Ideographs}</tt>就能匹配所有的中文字符。我们其实也可以将一些重复出现的东西，封装起来，以备使用。例如，对于阿拉伯数字，我们有<tt class="regex">\d</tt>可以用。对于中文数字一二三四等等，我们有没有办法呢？</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$zh_digit</span><span style="color: #339933;">=</span><span style="color: #009966; font-style: italic;">qr/一|二|三|四|五|六|七|八|九|十|零|〇|百|千|万|亿|佰|仟|壹|贰|叁|肆|伍|陆|柒|捌|玖|拾/</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #0000ff;">$str</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;人民币五十一万零三百元整。大写：伍拾壹万零三佰元整。&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$str</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #0000ff;">$zh_digit</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">//</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$1</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</p>
<p>
<img src="http://i3.6.cn/cvbnm/6f/3d/c2/f974a15dbf6a2ceed6c6744961f39b27.jpg" src="http://i3.6.cn/cvbnm/80/3c/69/ac41d1186fde1c67bf7cef334bc6a0c7.jpg" style="border: 1px solid rgb(255, 255, 255); margin: 0px 10px 10px; clear: both; padding-left: 0px; " alt="我爱正则表达式｜在RegexBuddy中如何使用正则表达式匹配中文字符｜http://iregex.org" />
</p>
<p>其输出结果见附图。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">结论</h3>
<p>可以使用<tt class="regex">\p{InCJK_Unified_Ideographs}</tt>匹配任意中文字符。在不支持该种标记方式时，也可以使用<tt class="regex">[\u4e00-\u9fa5]</tt>加以匹配。</p>
<p>关于文正则表达式，我觉得尚未穷其奥秘。以前在linux（utf8编码）下，编写scim输入平台的郑码码表时，匹配中文所使用的正则表达式为<tt class="regex">[\x80-\xff]{3}</tt>，也能很好地工作。请参阅此文：<a href="http://zhasm.com/blog/longwen-zhengma-ime-table-in-scim-format.html" target="_blank" title="我爱正则表达式｜在RegexBuddy中如何使用正则表达式匹配中文字符｜http://iregex.org">龙文郑码码表 for scim</a>。其原理我尚不清楚，留待之后有时间研究。如有知情者，也请不吝赐教，先行谢过。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/exploration-on-regular-rexpressions-that-match-chinese.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>匹配中文的正则表达式</title>
		<link>http://iregex.org/blog/regex-to-match-chinese.html</link>
		<comments>http://iregex.org/blog/regex-to-match-chinese.html#comments</comments>
		<pubDate>Mon, 02 Jun 2008 06:23:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=14</guid>
		<description><![CDATA[以前在编写linux下的scim郑码码表时，就跟正则式的中文匹配问题打过交道。当时总结了这样一条经验，utf8编码格式下，中文正则式应该这样书写： 1&#91;\x80-\xff&#93;&#123;3&#125; 当然，这与语言无... ]]></description>
			<content:encoded><![CDATA[<p>以前在编写linux下的scim郑码码表时，就跟正则式的中文匹配问题打过交道。当时总结了这样一条经验，utf8编码格式下，中文正则式应该这样书写：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\x80</span><span style="color: #339933;">-</span><span style="color: #0000ff;">\xff</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>当然，这与语言无关。在perl与python中，都是一样的。</p>
<p>现在，这条正则式又派上用场了。正在编写的一个小程序<a href="http://code.google.com/p/fanfoufans/wiki/MiniBlogsUpdater" target="_blank" title="一次输入，五处更新！同时更新twitter,海内，叽歪的，做啥，饭否的微博客。">MiniBlogs Updater</a>中，需要计算用户所输入的文字字数。因为中英文字符编码长度不一，如果直接使用python中的len()函数，它计算的是该字串的实际长度，一个中文字并非等同于一个英文字母的。因此，需要把中文字当成英文字母来处理。</p>
<p>我写了这样一条语句来处理：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">length<span style="color: #66cc66;">=</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3}'</span><span style="color: #66cc66;">,</span><span style="color: #483d8b;">'a'</span><span style="color: #66cc66;">,</span>msg<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>它的意思是，把所有的中文都替换成英文字母a，然后再统计字数。（只是统计而已，不修改源字串。）这条语句在windows下utf8文件中能够正常工作。</p>
<p>再分享两则与匹配中文的正则表达式有用的链接：</p>
<ul>
<li><a href="http://bbs.chinaunix.net/viewthread.php?tid=975358" target="_blank">常见中文正则表达式匹配结果比较</a></li>
<li><a href="http://bbs.chinaunix.net/viewthread.php?tid=907172" target="_blank">[分享]对各字符集编码范围的总结[更新日期2007-03-12]</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-to-match-chinese.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

