<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; chinese</title>
	<atom:link href="http://iregex.org/blog/tag/chinese/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Tue, 31 Aug 2010 04:35:39 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>Python 中文正则笔记</title>
		<link>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html</link>
		<comments>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html#comments</comments>
		<pubDate>Sun, 27 Jun 2010 03:50:41 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[笔记]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[cjk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=129</guid>
		<description><![CDATA[总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个... ]]></description>
			<content:encoded><![CDATA[<p>总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。</p>
<p><span id="more-129"></span></p>
<p>从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一点经验</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>可以使用 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数查看字串的原始格式。这对于写正则表达式有所帮助。
            </li>
<li>Python 的 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span></span></code>模块有两个相似的函数：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span>.<span style="color: black;">match</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">search</span></span></code> 。两个函数的匹配过程完全一致，只是起点不同。<code class="codecolorer python default"><span class="python">match</span></code>只从字串的开始位置进行匹配，如果失败，它就此放弃；而<code class="codecolorer python default"><span class="python">search</span></code>则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解<code class="codecolorer python default"><span class="python">match</span></code>的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，<code class="codecolorer python default"><span class="python">search</span></code>通常是你需要的那个函数。</li>
<li>从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用<code class="codecolorer python default"><span class="python">findall<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>这个函数。例子见后面的代码。</li>
<li><code class="codecolorer python default"><span class="python">utf8</span></code>下，每个汉字占据3个字符位置，正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#91;</span>\x80-\xff<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#125;</span></span></code>，这个都知道了吧。</li>
<li><code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>下，汉字的格式如<code class="codecolorer python default"><span class="python">\uXXXX</span></code>，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。</li>
<li>两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span></span></code>，来自定义所需要匹配的文本。</li>
<li>匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的<code class="codecolorer python default"><span class="python">utf8</span></code>，此时你不用额外做什么；如果是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>，就需要在正则式之前加上<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;&quot;</span></span></code>格式。</li>
<li>可以这样定义<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>字符串：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">string</span>=u<span style="color: #483d8b;">&quot;我爱正则表达式&quot;</span></span></code>。如果字串不是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>的，可以使用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数转换之。如果你知道源字串的编码，可以使用<code class="codecolorer python default"><span class="python">newstr=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>oldstring, original_coding_name<span style="color: black;">&#41;</span></span></code>的方式转换，例如 linux 下常用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>, <span style="color: #483d8b;">&quot;utf8&quot;</span><span style="color: black;">&#41;</span></span></code>，windows 下或许会用<code class="codecolorer python default"><span class="python">cp936</span></code>吧，没测试。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例程序</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;py_utf8_unicode.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-06-27 09:11</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> findPart<span style="color: black;">&#40;</span>regex, text, name<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; res=<span style="color: #dc143c;">re</span>.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>regex, text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;There are %d %s parts:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>res<span style="color: black;">&#41;</span>, name<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>,r<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span><br />
<br />
<span style="color: #808080; font-style: italic;">#sample is utf8 by default.</span><br />
sample=<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'en: Regular expression is a powerful tool for manipulating text.<br />
zh: 正则表达式是一种很有用的处理文本的工具。<br />
jp: 正規表現は非常に役に立つツールテキストを操作することです。<br />
jp-char: あアいイうウえエおオ<br />
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.<br />
puc: 。？！、，；：“ ”‘ ’——……·－·《》〈〉！￥％＆＊＃<br />
'</span><span style="color: #483d8b;">''</span><br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw utf8 string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#find the non-ascii chars:</span><br />
findPart<span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]+&quot;</span>,sample,<span style="color: #483d8b;">&quot;non-ascii&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#convert the utf8 to unicode</span><br />
usample=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sample,<span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw unicode string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>usample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#get each language parts:</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode chinese&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>ac00-<span style="color: #000099; font-weight: bold;">\u</span>d7ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode korean&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese katakana&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese hiragana&quot;</span><span style="color: black;">&#41;</span> <br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3000-<span style="color: #000099; font-weight: bold;">\u</span>303f<span style="color: #000099; font-weight: bold;">\u</span>fb00-<span style="color: #000099; font-weight: bold;">\u</span>fffd]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode cjk Punctuation&quot;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">the raw utf8 string is:<br />
'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n'<br />
<br />
There are 14 non-ascii parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具。<br />
&nbsp; &nbsp; 正規表現は非常に役に立つツールテキストを操作することです。<br />
&nbsp; &nbsp; あアいイうウえエおオ<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
&nbsp; &nbsp; 。？！、，；：“<br />
&nbsp; &nbsp; ”‘<br />
&nbsp; &nbsp; ’——……·－·《》〈〉！￥％＆＊＃<br />
<br />
the raw unicode string is:<br />
u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n'<br />
<br />
There are 6 unicode chinese parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具<br />
&nbsp; &nbsp; 正規表現<br />
&nbsp; &nbsp; 非常<br />
&nbsp; &nbsp; 役<br />
&nbsp; &nbsp; 立<br />
&nbsp; &nbsp; 操作<br />
<br />
There are 8 unicode korean parts:<br />
<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
<br />
There are 6 unicode japanese katakana parts:<br />
<br />
&nbsp; &nbsp; ツールテキスト<br />
&nbsp; &nbsp; ア<br />
&nbsp; &nbsp; イ<br />
&nbsp; &nbsp; ウ<br />
&nbsp; &nbsp; エ<br />
&nbsp; &nbsp; オ<br />
<br />
There are 11 unicode japanese hiragana parts:<br />
<br />
&nbsp; &nbsp; は<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; つ<br />
&nbsp; &nbsp; を<br />
&nbsp; &nbsp; することです<br />
&nbsp; &nbsp; あ<br />
&nbsp; &nbsp; い<br />
&nbsp; &nbsp; う<br />
&nbsp; &nbsp; え<br />
&nbsp; &nbsp; お<br />
<br />
There are 5 unicode cjk Punctuation parts:<br />
<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。？！、，；：<br />
&nbsp; &nbsp; －<br />
&nbsp; &nbsp; 《》〈〉！￥％＆＊＃</div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>wordpress UTF8 中文字数统计插件</title>
		<link>http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html</link>
		<comments>http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html#comments</comments>
		<pubDate>Fri, 02 Jan 2009 14:37:55 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[utf8]]></category>
		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=50</guid>
		<description><![CDATA[最近想在博客中实现这样的功能：“本文字数XXX，继续阅读&#8230;”。在网上找了一款Word Count Plugin for WordPress，作者是 Murray Williams，可惜它只能统计英文单词数，却不能统计中文字数。我下载... ]]></description>
			<content:encoded><![CDATA[<p><a target="_blank" href="http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html"><img src="http://i293.photobucket.com/albums/mm60/zhasm/wordpressutf8-1.png" /></a> </p>
<p>最近想在博客中实现这样的功能：“本文字数XXX，继续阅读&#8230;”。在网上找了一款<a target="_blank" href="http://www.murraywilliams.com/software/word-count-plugin-for-wordpress/" rel="nofollow">Word Count Plugin for WordPress</a>，作者是 <a target="_blank" href="http://www.murraywilliams.com/">Murray Williams</a>，可惜它只能统计英文单词数，却不能统计中文字数。我下载了源码，自己动手修改，实现了想要的功能。修改过程中涉及了PHP语言中如何使用<a target="_blank" href="http://iregex.org">正则表达式</a>来匹配中文，于是我把过程写这在里。 </p>
<p><span id="more-50"></span><br />
该插件的核心部分是这样的：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">function</span> mtw_wordcount<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">global</span> <span style="color: #000088;">$page</span><span style="color: #339933;">,</span> <span style="color: #000088;">$pages</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #339933;">!</span><span style="color: #990000;">function_exists</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'str_word_count'</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">str_word_count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #339933;">,</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>看来，它并没有自己来数字数，还是调用PHP函数str_word_count实现的。查文档，找到它的官方文档，<a title="http://us2.php.net/str_word_count" href="http://us2.php.net/str_word_count">http://us2.php.net/str_word_count</a>。它能统计字串中的英文单词数，按要求返回单词数目或单方数组。其原理是以默认或指定的分隔符来将字串分隔，再逐个统计。由于该函数对utf8无效，应该使用utf8版：<a target="_blank" href="http://us2.php.net/manual/en/function.str-word-count.php#85592">str_word_count_utf8</a>。</p>
<p>英文的单词之间有空格等分隔符，中文的应该怎样统计呢？由于Wordpress的中文字符编码为utf8，我们就从utf8入手。根据以前我写过一篇文章《<a target="_blank" href="http://iregex.org/blog/regex-to-match-chinese.html">匹配中文的正则表达式</a>》可知，在utf8中，匹配单个汉字的正则式是[\x80-\xff]{3}。这样一来，只要将中文的每一个单字视为一个英语单词来处理，那么统计出来的单词数量就应该是正确的。例如，</p>
<p><tt class="string">你好，世界。Hello world.</tt></p>
<p>我的处理方法是，将所有的中文单字替换成两边带空格的英文字母a，即使用正则表达式：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/[\x80-\xff]{3}/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">' a '</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>考虑到中文标点不计入总数，前边还应该有一条：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/～|！|｀|·|＃|￥|％|…|—|<br />
（|）|＋|－|＝|｛|｝|［|］|\＼|｜|“|”|’|‘|；|：|《|》|〈|〉|、|？|。|，/&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">' '</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>此时运行程序，得到正确的结果：6。</p>
<p>wordpress UTF8 中文字数统计插件完整的PHP程序是：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span><br />
load_plugin_textdomain<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'mtw-wordcount'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'wp-content/plugins/mtw-wordcount'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #666666; font-style: italic;">/*&nbsp; Call this from inside &quot;The Loop&quot; (see WordPress documentation) to get<br />
&nbsp; &nbsp; a WordCount of the current posting. Function RETURNS the Word Count as<br />
&nbsp; &nbsp; a value, it does not automatically display (ie. 'echo') the value.<br />
&nbsp;*/</span><br />
<br />
&nbsp; &nbsp; <span style="color: #990000;">define</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;WORD_COUNT_MASK&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">function</span> str_word_count_utf8<span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$format</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/～|！|｀|·|＃|￥|％|…|—|（|）|＋|－|＝|｛|｝|［|］|\＼|｜|“|”|’|‘|；|：|《|》|〈|〉|、|？|。|，/&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">' '</span><span style="color: #339933;">,</span><span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/[\x80-\xff]{3}/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">' a '</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">switch</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$format</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">case</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span>WORD_COUNT_MASK<span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">case</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span>WORD_COUNT_MASK<span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #339933;">,</span> PREG_OFFSET_CAPTURE<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$match</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$result</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$match</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$match</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #000088;">$result</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span>WORD_COUNT_MASK<span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<br />
<br />
<span style="color: #000000; font-weight: bold;">function</span> mtw_wordcount<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">global</span> <span style="color: #000088;">$page</span><span style="color: #339933;">,</span> <span style="color: #000088;">$pages</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #990000;">function_exists</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'str_word_count_utf8'</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> str_word_count_utf8<span style="color: #009900;">&#40;</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #339933;">,</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">/*&nbsp; Auxilliary method. In case you want to use this somewhere outside the<br />
&nbsp; &nbsp; loop, where the $page and $pages globals don't necessarily work. Pass<br />
&nbsp; &nbsp; the string you want counted instead.<br />
*/</span><br />
<span style="color: #000000; font-weight: bold;">function</span> mtw_string_wordcount<span style="color: #009900;">&#40;</span><span style="color: #000088;">$instring</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #990000;">function_exists</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'str_word_count_utf8'</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> str_word_count_utf8<span style="color: #009900;">&#40;</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$instring</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #339933;">,</span><span style="color: #990000;">strip_tags</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$instring</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>这样修改后，该插件就能在wp中正确统计中文字数。我的使用方法是：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span> the_content<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;本文共计&quot;</span><span style="color: #339933;">.</span>mtw_wordcount<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">'字，已浏览'</span><span style="color: #339933;">.</span>the_views<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #339933;">,</span><span style="color: #009900; font-weight: bold;">false</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">'次。　继续阅读... »'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
<p>其中的the_views(&quot;&quot;,false)部分是另一款插件<a href="http://fantasyworld.idv.tw/programs/wp_postviews_plus/">WP-PostViews Plus</a>实现的效果，不赘述。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/wordpress-word-counter-for-utf8-chinese.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>windows下的正则式工具介绍之三：MTracer2.0介绍以及与RegexBuddy比较</title>
		<link>http://iregex.org/blog/mtracer-vs-regexbuddy.html</link>
		<comments>http://iregex.org/blog/mtracer-vs-regexbuddy.html#comments</comments>
		<pubDate>Tue, 16 Sep 2008 03:29:13 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[软件]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[mtracer]]></category>
		<category><![CDATA[powergrep]]></category>
		<category><![CDATA[regexbuddy]]></category>
		<category><![CDATA[tool]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=32</guid>
		<description><![CDATA[RegexBuddy和PowerGrep是我在windows下常用的两款正则式工具。前者是帮助编写正则式的辅助工具，后者是进行批量搜索替换的实用工具。这两款软件都是外国人写的。今天介绍国人史寿伟先生写的一... ]]></description>
			<content:encoded><![CDATA[<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://www.regexlab.com/images/logo_s.gif" />RegexBuddy和PowerGrep是我在windows下常用的两款正则式工具。前者是帮助编写正则式的辅助工具，后者是进行批量搜索替换的实用工具。这两款软件都是外国人写的。今天介绍国人<a target="_blank" href="http://www.softreg.com.cn/Search.aspx?t=authorid&amp;v=/efff07ff-b552-4034-8860-1964f285c4fc/">史寿伟先生</a>写的一款正则式工具，MTracer 2.0。 </p>
<p><span id="more-32"></span></p>
<p>&#160;</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">MTracer 2.0软件版本</h2>
<p>MTracer2.0全名是RegexMatchTracer，官方主页在regexlab.com。上面说最近更新日期是2007.10.07，不过，从我今天下载的程序的修改日期来看，是2008.09.13。变化是，之前作者提供的是绿色的程序，现在提供的是msi的安装包。 本文以下提到MTracer时，若无特殊说明，均指MTracer2.0。</p>
<p>&#160;</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">MTracer软件特性</h2>
<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i3.6.cn/cvbnm/61/db/85/73daa684cce5f9fe1d6f327aa5d5502c.jpg" />作为一款正则表达式撰写辅助程序，它拥有查找匹配、替换模式、分割模式，可以分别进行相应操作。前两种用的比较多，各种语言或正则式工具都有相应语句或函数来实现；第三种<strong>分割模式</strong>是指<strong>使用正则表达式来描述字符串的分割符，以便将字串分割成子串数组。</strong>举个简单的例子来说，可以使用正则式<tt class="regex">\d+;?</tt>将<tt class="string">abcd12;sdf55656asdfasd82asd33x</tt>字串分割成子串数组：</p>
<ul>
<li>abcd </li>
<li>sdf </li>
<li>asdfasd </li>
<li>asd </li>
<li>x </li>
</ul>
<p>在实际操作时，你总会遇到使用分割模式才能最有效率地解决的问题。</p>
<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i3.6.cn/cvbnm/df/9f/96/cfd4162eebd15ed35237a49bf94b82ae.jpg" /> 除了一般的选项（忽略大小写、单行\多行、全局），它提供了从右向左、扩展模式两种匹配选项。</p>
<p><strong>从右向左</strong>：平时查找字串最右边的正则式匹配，可以借助于<tt class="regex">?</tt>和<tt class="regex">$</tt>来定位。而在MTracer下，这款比较令人耳目一新的选项十分好玩，虽然我还没有在实际操作时遇到过确需此选项的例子。</p>
<p><strong>扩展模式</strong>：包括如下选项：</p>
<ul>
<li>注释 <tt class="regex">(?#xxx)</tt>：在正则式中加入注释以便提高可读性； </li>
<li>模式修改符 <tt class="regex">(?ismg-ismg)</tt>：小范围内修改匹配模式； </li>
<li>非捕获组 <tt class="regex">(?:xxx)</tt>：匹配而不捕获，便于计数，同时还可以节省内存，提高效率（根据《精通正则表达式》的说法）； </li>
<li>预搜索（零宽度断言）：十分有用的匹配选项，<strong>只匹配位置，而不消耗字符</strong>；有四种模式，详见<a target="_blank" href="http://www.mediafire.com/?ugf3z8tklbl">手册</a>； </li>
<li>独立表达式 <tt class="regex">(?&gt;pattern)</tt>：此选项在《精通正则表达式》中，被余晟先生翻译为&#8220;固化分组&#8221;，即无论匹配成功与否，内部都不进行回退，都不会再次尝试匹配； </li>
<li>条件表达式<tt class="regex">(?(x)y|z)</tt>：与C语言的三目操作符类似，x条件为真则进行y匹配，否则进行z匹配。其中x部分有四种模式，详见<a target="_blank" href="http://www.mediafire.com/?ugf3z8tklbl">手册</a>； </li>
<li>递归表达式 <tt class="regex">(?R)</tt>：对另一部分子表达式的引用，而不是对其匹配结果的引用。当被引用的表达式包含自身，则形成递归引用。 </li>
</ul>
<p>个人比较常用的选项是非捕获组和预搜索。</p>
<p>&#160;</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">MTracer最有用的特性</h2>
<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i3.6.cn/cvbnm/1c/a8/f0/2fed4bd2ff41a771aecc1558210c5442.jpg" /> 相对于另一款正则式撰写辅助程序RegexBuddy而言，MTracer最有用的特性是中文的正则式分析树。虽然RegexBuddy也有此功能，但是MTracer的正则式分析树是中文的。这对于需要此功能但是不愿意使用英文软件的用户来说非常方便。对于初学者也很有帮助。</p>
<p>值得说明的是，正则式的中文术语在中文里还没有形成统一的规范。以我自身的阅读经历而言，似可分为两类：何伟平译的《Perl Programming》第三版中大量使用的正则式术语译法为第一套方案，余晟先生译的《精通正则表达式》第三版中使用的正则表达式为第二套。个人倾向于第二套。</p>
<p>MTracer使用的正则表达式术语，与这两套术语均有相异之处。</p>
<p>如果有人使用上述第二套术语将RegexBuddy汉化的话，估计在中国的普及率会大幅提高。题外话。</p>
<p>&#160;</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">&#160;</h2>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">与RegexBuddy的简单比较</h2>
<p>还是列表说明吧。</p>
<table border="1" cellspacing="1" cellpadding="2" width="401">
<tbody>
<tr>
<td width="105" align="center"><strong>属性</strong></td>
<td width="132" align="center"><strong>MTracer</strong></td>
<td width="158" align="center"><strong>RegexBuddy</strong></td>
</tr>
<tr>
<td width="103">界面语言</td>
<td width="132">中文</td>
<td width="159">英文</td>
</tr>
<tr>
<td width="102">大小</td>
<td width="132">471kb</td>
<td width="160">9.1 MB</td>
</tr>
<tr>
<td width="103">价格(单用户)</td>
<td width="131">个人RMB49.00<br />
        <br />公司RMB298.00</td>
<td width="160">US$ 39.95</td>
</tr>
<tr>
<td width="103">免费版的限制</td>
<td width="131">正则式长度限制</td>
<td width="160">7天免费使用时间</td>
</tr>
<tr>
<td width="103">帮助文件</td>
<td width="131">原来版本有，最新的msi版已经不带帮助文件；需要的话可以下载本站上传的<a target="_blank" href="http://www.mediafire.com/?ugf3z8tklbl">手册</a>。</td>
<td width="160">齐全。包括4本电子书，其中3本是关于正则表达式的教程、参考手册，1本是RegexBuddy的操作手册。十分翔实。语言是英文。</td>
</tr>
<tr>
<td width="103">匹配模式</td>
<td width="131">
<ol>
<li>匹配 </li>
<li>替换 </li>
<li>分割 </li>
</ol>
</td>
<td width="160">
<ol>
<li>匹配 </li>
<li>替换 </li>
<li>分割 </li>
</ol>
</td>
</tr>
<tr>
<td width="103">匹配选项</td>
<td width="131">
<ol>
<li>忽略大小写（开关） </li>
<li>单行、多行 </li>
<li>全局局部可选 </li>
<li>从右至左 </li>
<li>扩展模式（详见上文） </li>
</ol>
</td>
<td width="160">
<ol>
<li>忽略大小写 （开关） </li>
<li>单行、多行 </li>
<li>默认全局 </li>
<li>点号匹配新行 </li>
<li>^$匹配换行符 </li>
<li>空白字符宽松模式 </li>
<li>扩展模式（依语言而定） </li>
</ol>
</td>
</tr>
<tr>
<td width="103">历史功能</td>
<td width="131">&#8220;文本片断&#8221;即是</td>
<td width="160">History</td>
</tr>
<tr>
<td width="103">导出字串为指定语言</td>
<td width="131">
<ol>
<li>原状导出 </li>
<li>Visual Basic </li>
<li>C/C++ </li>
</ol>
</td>
<td width="160">
<ol>
<li>原状导出 </li>
<li>C/C# </li>
<li>Perl(m//或s///格式) </li>
<li>Basic </li>
<li>Java </li>
<li>JavaScript </li>
<li>Pascal </li>
<li>PHP (//) </li>
<li>PostgreSQL </li>
<li>Python </li>
<li>Ruby </li>
<li>SQL </li>
<li>Tcl </li>
<li>XML </li>
<li><strong>还包含在上述语言中如何调用该正则式的模块，功能强大、有用、好用。（Use）</strong> </li>
</ol>
</td>
</tr>
<tr>
<td width="103">常用正则库</td>
<td width="131">帮助文件中有提及</td>
<td width="160">程序中包含该功能，有定义、例代码、匹配实例。<br />
        <br />帮助文件中也有。</td>
</tr>
<tr>
<td width="103">扩展性</td>
<td width="131">支持插件（疑安装目录下的stdplgin.dll是其插件，功能不详。）</td>
<td width="160">与PowerGrep相关联。</td>
</tr>
<tr>
<td width="103">帮助论坛</td>
<td width="131">开放，地址<a target="_blank" href="http://www.regexlab.com/zh/discuss/forum.aspx?b=1">在此</a></td>
<td width="160">仅对付费注册软件的用户开放。</td>
</tr>
<tr>
<td width="103">界面高度可定制</td>
<td width="132">不支持</td>
<td width="161">支持</td>
</tr>
<tr>
<td width="103">是否支持批量替换外部文件</td>
<td width="132">不支持</td>
<td width="161">支持</td>
</tr>
</tbody>
</table>
<p>当然，没有提到的功能、细节还不少，这里仅仅是列出本人关注的属性。</p>
<p>&#160;</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">MTracer 2.0注册</h2>
<p>付出劳动，换来报酬，是当今任何行业都认可的规则。MTracer需要注册费，这一点也十分正常。其价格是48元人民币。比起US$ 39.95的RegexBuddy来，可谓性价比极高。</p>
<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i3.6.cn/cvbnm/a7/08/a1/f3e2daa8f96c1c47d4b08ef7a6c14224.jpg" /> </p>
<p>如果不注册，会有什么限制呢？请看图。哦，是有100个字符的限制。</p>
<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i3.6.cn/cvbnm/4c/d6/e2/5dcb18e7a4610987954ee363624a3584.jpg" /> 这是注册后的图。</p>
<p>对于未注册版本，平时写一些短小的正则式，是没有问题的。如果想无限制地使用该程序，花48元支持一下国产程序，何乐而不为？</p>
<p>现在无论主动还是被动，版权意识都在深入人心。这也是本人一直没有把完全版的RegexBuddy和PowerGrep的下载链接贴到blog上、而是采取来Email索取的原因之一。</p>
<p><img style="border-bottom: rgb(255,255,255) 1px solid; border-left: rgb(255,255,255) 1px solid; margin: 0px 10px 10px; padding-left: 0px; float: right; clear: both; border-top: rgb(255,255,255) 1px solid; border-right: rgb(255,255,255) 1px solid" src="http://i3.6.cn/cvbnm/06/e9/0d/fe543fab025def9627a7f1a4df30f354.jpg" /> MTracer2.0采用注册号的方式注册。看来作者深知在国内破解成风，于是很幽默地添加了这样一则菜单选项：如何8小时破解本软件？呵呵。8小时我们可以做许多事情，没有必要浪费在track、debug、crack上。你的时间很值钱的。</p>
<p>这两款软件的完全版我这里都有，如果需要RegexBuddy的话，请占击链接留下信箱索取即可；但是MTracer的话就<strong>不要</strong>索取了。</p>
<p>&#160;</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">相关阅读</h2>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;"><a href="http://iregex.org/blog/regexbuddy.html">windows下的正则式工具介绍之一：RegexBuddy</a></h4>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;"><a href="http://iregex.org/blog/powergrep.html">windows下的正则式工具介绍之二：powergrep</a></h4>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/mtracer-vs-regexbuddy.html/feed</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>探索匹配中文的正则表达式</title>
		<link>http://iregex.org/blog/exploration-on-regular-rexpressions-that-match-chinese.html</link>
		<comments>http://iregex.org/blog/exploration-on-regular-rexpressions-that-match-chinese.html#comments</comments>
		<pubDate>Sat, 23 Aug 2008 16:22:29 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>
		<category><![CDATA[正则表达式]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=31</guid>
		<description><![CDATA[按：本文使用的RegexBuddy为3.1.0（完全）版，并非最新版3.1.1（截至2008.08.23）。需要该版本的请在这篇文章后留言。 注：参考www.regular-expressions.info的风格，更新了本模板的style.css文件，加入了与... ]]></description>
			<content:encoded><![CDATA[<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>按：本文使用的RegexBuddy为3.1.0（完全）版，并非最新版3.1.1（截至2008.08.23）。需要该版本的请在<a href="http://iregex.org/blog/regexbuddy.html" target="_blank"><font color="#ff008c">这篇</font></a>文章后留言。</p>
<p>注：参考<a href="http://www.regular-expressions.info" target="_blank">www.regular-expressions.info</a>的风格，更新了本模板的style.css文件，加入了与正则式代码相关的格式： </p>
<ul>
<li><strong>正则式</strong>格式举例：<tt class="regex">[a-z]+@[a-z]+?\.[a-z]+</tt> </li>
<li><strong>匹配</strong>格式举例：<tt class="match">pig@animals.com</tt>和<tt class="match">chicken@birds.com</tt> </li>
<li><strong>普通文本</strong>格式举例：<tt class="string">这是一些普通文本。hello regex world. pig@animals.com和chicken@birds.com</tt> </li>
</ul>
<p><span id="more-31"></span></p>
<p>可以这样使用：在字符串<tt class="string">这是一些普通文本。hello regex world. pig@animals.com和chicken@birds.com</tt>使用正则式<tt class="regex">[a-z]+@[a-z]+?\.[a-z]+</tt>加以匹配，得到的结果为：<tt class="match">pig@animals.com</tt>和<tt class="match">chicken@birds.com</tt>。 </p>
</blockquote>
<p><strong>极端粗放型</strong>：点号其实是近乎万能的，可以匹配任何字符，限制只在于换行符的匹配上。匹配中文自然不在话下。作为可有可无的背景符，一个<tt class="regex">.*</tt>就能匹配掉包括中文在内的全部字符。这当然是一种极端的情况，因为这样显示不出中文字符串的特性。这不是本文要探讨的。</p>
<p><strong>极端集约型</strong>：如果搜索特定文本，例如在<tt class="string">一二三四五六七八九十拾佰百千仟万亿</tt>中匹配<tt class="regex">十拾</tt>， 直接使用m/<tt class="regex">十拾</tt>/就能搞定。这同样不是本文要探讨的。与<tt class="regex">\w</tt>能匹配英文字母一样，本文想找的是能够匹配所有汉字，而不匹配其它文本的一种简写方式。 </p>
<p><strong>普适型型</strong>：由于汉字属于Unicode，我们就从unicode里面找。在<a href="http://unicode.org/reports/tr18/" target="_blank">Unicode Regular Expressions</a>，列出了unicode的许多种表达方式。搜索chinese，找到如下一行：</p>
<table width="400" border="1" cellpadding="2" cellspacing="1" unselectable="on">
<tbody>
<tr>
<td  valign="top" width="200">Writing Systems</td>
<td  valign="top" width="200">Blocks</td>
</tr>
<tr>
<td  valign="top" width="200">&#8230;</td>
<td  valign="top" width="200">&#8230;</td>
</tr>
<tr>
<td  valign="top" width="200">Chinese</td>
<td  valign="top" width="200">CJK Unified Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility Ideographs, CJK Compatibility Forms, Enclosed CJK Letters and Months, Small Form Variants, Bopomofo, Bopomofo Extended</td>
</tr>
</tbody>
</table>
<p>关于CJK的含义，是指中日韩统一表意文字（Chinese Japanese Korean Unified Ideographs），可以参考<a href="http://baike.baidu.com/view/628156.html" target="_blank">百度释义</a>，或<a href="http://en.wikipedia.org/wiki/CJK" target="_blank">wiki</a>词条。</p>
<p>再查了一下<a href="http://www.regular-expressions.info/" target="_blank">regular expressions</a>,查到其<a href="http://www.regular-expressions.info/unicode.html" target="_blank">unicode</a>一节有这样的内容：
</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p><tt class="regex">\p{InCJK_Unified_Ideographs}</tt>: U+4E00..U+9FFF </p></blockquote>
<p>
看到这里，我想起了以前写的<a href="http://iregex.org/blog/regular-expressions-to-match-chinese-username-in-asp.html" target="_blank">《匹配用户名的asp正则表达式(包括中文)》</a>一文中，提到的中文匹配为<tt class="regex">[\u4e00-\u9fa5]</tt>，原来是有其对应的速记方式的，虽然两者有最后一组字符的差异。看附图可见U+9fa5，最后一个汉字的模样。<img src="http://i3.6.cn/cvbnm/80/3c/69/ac41d1186fde1c67bf7cef334bc6a0c7.jpg" style="border: 1px solid rgb(255, 255, 255); margin: 0px 10px 10px; clear: both; padding-left: 0px; " alt="我爱正则表达式｜在RegexBuddy中如何使用正则表达式匹配中文字符｜http://iregex.org" /> 此序列的第一位，U+4e00，是汉字<tt class="string">一</tt>。
</p>
<p><strong>自定义</strong>：到目前为止，相当于给汉字找到了官方的身份和说法，使用<tt class="regex">\p{InCJK_Unified_Ideographs}</tt>就能匹配所有的中文字符。我们其实也可以将一些重复出现的东西，封装起来，以备使用。例如，对于阿拉伯数字，我们有<tt class="regex">\d</tt>可以用。对于中文数字一二三四等等，我们有没有办法呢？</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$zh_digit</span><span style="color: #339933;">=</span><span style="color: #009966; font-style: italic;">qr/一|二|三|四|五|六|七|八|九|十|零|〇|百|千|万|亿|佰|仟|壹|贰|叁|肆|伍|陆|柒|捌|玖|拾/</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #0000ff;">$str</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;人民币五十一万零三百元整。大写：伍拾壹万零三佰元整。&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$str</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span><span style="color: #0000ff;">$zh_digit</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">//</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$1</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</p>
<p>
<img src="http://i3.6.cn/cvbnm/6f/3d/c2/f974a15dbf6a2ceed6c6744961f39b27.jpg" src="http://i3.6.cn/cvbnm/80/3c/69/ac41d1186fde1c67bf7cef334bc6a0c7.jpg" style="border: 1px solid rgb(255, 255, 255); margin: 0px 10px 10px; clear: both; padding-left: 0px; " alt="我爱正则表达式｜在RegexBuddy中如何使用正则表达式匹配中文字符｜http://iregex.org" />
</p>
<p>其输出结果见附图。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">结论</h3>
<p>可以使用<tt class="regex">\p{InCJK_Unified_Ideographs}</tt>匹配任意中文字符。在不支持该种标记方式时，也可以使用<tt class="regex">[\u4e00-\u9fa5]</tt>加以匹配。</p>
<p>关于文正则表达式，我觉得尚未穷其奥秘。以前在linux（utf8编码）下，编写scim输入平台的郑码码表时，匹配中文所使用的正则表达式为<tt class="regex">[\x80-\xff]{3}</tt>，也能很好地工作。请参阅此文：<a href="http://zhasm.com/blog/longwen-zhengma-ime-table-in-scim-format.html" target="_blank" title="我爱正则表达式｜在RegexBuddy中如何使用正则表达式匹配中文字符｜http://iregex.org">龙文郑码码表 for scim</a>。其原理我尚不清楚，留待之后有时间研究。如有知情者，也请不吝赐教，先行谢过。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/exploration-on-regular-rexpressions-that-match-chinese.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>匹配用户名的asp正则表达式(包括中文)</title>
		<link>http://iregex.org/blog/regular-expressions-to-match-chinese-username-in-asp.html</link>
		<comments>http://iregex.org/blog/regular-expressions-to-match-chinese-username-in-asp.html#comments</comments>
		<pubDate>Sun, 13 Jul 2008 14:01:17 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[问答]]></category>
		<category><![CDATA[asp]]></category>
		<category><![CDATA[chinese]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=18</guid>
		<description><![CDATA[有人在正则表达式中文站贴出这样一道问题： 求ASP 用户名 表达式 用户名长度在2-20字符之间，由中文/大小写字母/数字/中划线-/下线线_组成。 这个问题不算难，只要下边一行核心代码就能搞... ]]></description>
			<content:encoded><![CDATA[<p>有人在<a href="http://www.regex.net.cn" target="_blank">正则表达式中文站</a>贴出<a href="http://www.regex.net.cn/redirect.php?tid=30&amp;goto=lastpost#lastpost" target="_blank">这样</a>一道问题：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">求ASP 用户名 表达式</h3>
<p>用户名长度在2-20字符之间，由中文/大小写字母/数字/中划线-/下线线_组成。</p></blockquote>
<p>这个问题不算难，只要下边一行核心代码就能搞定：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff0000;">&quot;^[-_a-zA-Z0-9<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]{2,20}$&quot;</span></div></td></tr></tbody></table></div>
<p>关键是没有使用过ASP语言。按<span style="color: #6666cc;"><a href="http://www.webase.net.cn/html/Program/Asp/200711/29.html" target="_blank">此页</a></span>的提示，设置了ASP环境。查询了一些在线的入门级ASP教程之后，解答如下：<span id="more-19"></span></p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-weight: bold;">&lt;</span>form action<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;verify.asp&quot;</span> method<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;post&quot;</span><span style="color: #006600; font-weight: bold;">&gt;</span><br />
姓名：<br />
<span style="color: #006600; font-weight: bold;">&lt;</span>input name<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;name&quot;</span> type<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;text&quot;</span> <span style="color: #006600; font-weight: bold;">/&gt;</span><br />
<br />
<span style="color: #006600; font-weight: bold;">&lt;</span>input name<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;Submit&quot;</span> type<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;submit&quot;</span> value<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;提交&quot;</span> <span style="color: #006600; font-weight: bold;">/&gt;</span><br />
<span style="color: #006600; font-weight: bold;">&lt;</span>input name<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;Submit2&quot;</span> type<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;reset&quot;</span> value<span style="color: #006600; font-weight: bold;">=</span><span style="color: #cc0000;">&quot;重置&quot;</span> <span style="color: #006600; font-weight: bold;">/&gt;</span><br />
<span style="color: #006600; font-weight: bold;">&lt;/</span>form<span style="color: #006600; font-weight: bold;">&gt;</span></div></td></tr></tbody></table></div>
<p>它调用以下verify.asp文件：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-weight: bold;">&amp;</span>lt<span style="color: #006600; font-weight: bold;">;%</span><br />
<span style="color: #0000ff; font-weight: bold;">Function</span> RegExpTest<span style="color: #006600; font-weight:bold;">&#40;</span>patrn, strng<span style="color: #006600; font-weight:bold;">&#41;</span><br />
<span style="color: #990099; font-weight: bold;">Dim</span> regEx, retVal <span style="color: #008000;">' 建立变量。</span><br />
<span style="color: #990099; font-weight: bold;">Set</span> regEx <span style="color: #006600; font-weight: bold;">=</span> <span style="color: #0000ff; font-weight: bold;">New</span> RegExp <span style="color: #008000;">' 建立正则表达式。</span><br />
regEx.<span style="color: #9900cc;">Pattern</span> <span style="color: #006600; font-weight: bold;">=</span> patrn <span style="color: #008000;">' 设置模式。</span><br />
regEx.<span style="color: #9900cc;">IgnoreCase</span> <span style="color: #006600; font-weight: bold;">=</span> <span style="color: #0000ff; font-weight: bold;">False</span> <span style="color: #008000;">' 设置是否区分大小写。</span><br />
retVal <span style="color: #006600; font-weight: bold;">=</span> regEx.<span style="color: #9900cc;">Test</span><span style="color: #006600; font-weight:bold;">&#40;</span>strng<span style="color: #006600; font-weight:bold;">&#41;</span> <span style="color: #008000;">' 执行搜索测试。</span><br />
<span style="color: #990099; font-weight: bold;">If</span> retVal <span style="color: #990099; font-weight: bold;">Then</span><br />
RegExpTest <span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;合法用户名。&quot;</span><br />
<span style="color: #990099; font-weight: bold;">Else</span><br />
RegExpTest <span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;非法用户名。&quot;</span><br />
<span style="color: #990099; font-weight: bold;">End</span> <span style="color: #990099; font-weight: bold;">If</span><br />
<span style="color: #990099; font-weight: bold;">End</span> <span style="color: #0000ff; font-weight: bold;">Function</span><br />
<br />
name<span style="color: #006600; font-weight: bold;">=</span><span style="color: #990099; font-weight: bold;">request</span>.<span style="color: #330066;">form</span><span style="color: #006600; font-weight:bold;">&#40;</span><span style="color: #cc0000;">&quot;name&quot;</span><span style="color: #006600; font-weight:bold;">&#41;</span><br />
psw<span style="color: #006600; font-weight: bold;">=</span><span style="color: #990099; font-weight: bold;">request</span>.<span style="color: #330066;">form</span><span style="color: #006600; font-weight:bold;">&#40;</span><span style="color: #cc0000;">&quot;psw&quot;</span><span style="color: #006600; font-weight:bold;">&#41;</span><br />
sex<span style="color: #006600; font-weight: bold;">=</span><span style="color: #990099; font-weight: bold;">request</span>.<span style="color: #330066;">form</span><span style="color: #006600; font-weight:bold;">&#40;</span><span style="color: #cc0000;">&quot;sex&quot;</span><span style="color: #006600; font-weight:bold;">&#41;</span><br />
city<span style="color: #006600; font-weight: bold;">=</span><span style="color: #990099; font-weight: bold;">request</span>.<span style="color: #330066;">form</span><span style="color: #006600; font-weight:bold;">&#40;</span><span style="color: #cc0000;">&quot;city&quot;</span><span style="color: #006600; font-weight:bold;">&#41;</span><br />
<span style="color: #990099; font-weight: bold;">Response</span>.<span style="color: #330066;">write</span> RegExpTest<span style="color: #006600; font-weight:bold;">&#40;</span><span style="color: #cc0000;">&quot;^[-_a-zA-Z0-9\u4e00-\u9fa5]{2,20}$&quot;</span>, name<span style="color: #006600; font-weight:bold;">&#41;</span><br />
<span style="color: #006600; font-weight: bold;">%&amp;</span>gt<span style="color: #006600; font-weight: bold;">;</span></div></td></tr></tbody></table></div>
<p>运行界面见附图。<img style="max-width: 800px;" src="http://i3.6.cn/cvbnm/83/1a/5c/bc56d8b70e9fcc5f9565b47cc651def5.jpg" alt="" /></p>
<p>另外，还有一些<a href="http://iregex.org">正则表达式</a>可供参考：</p>
<p>匹配中文字符的<a href="http://iregex.org">正则表达式</a>：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-weight:bold;">&#91;</span>\u4e00-\u9fa5<span style="color: #006600; font-weight:bold;">&#93;</span></div></td></tr></tbody></table></div>
<p>匹配双字节字符(包括汉字在内)<a href="http://iregex.org">正则表达式</a>：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-weight:bold;">&#91;</span>^\x00-\xff<span style="color: #006600; font-weight:bold;">&#93;</span></div></td></tr></tbody></table></div>
<p>匹配空行的<a href="http://iregex.org">正则表达式</a>：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">\n<span style="color: #006600; font-weight:bold;">&#91;</span>\s<span style="color: #006600; font-weight: bold;">|</span> &nbsp; <span style="color: #006600; font-weight:bold;">&#93;</span><span style="color: #006600; font-weight: bold;">*</span>\r</div></td></tr></tbody></table></div>
<p>匹配HTML标记的<a href="http://iregex.org">正则表达式</a>：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-weight: bold;">/</span> <span style="color: #006600; font-weight: bold;">&amp;</span>lt<span style="color: #006600; font-weight: bold;">;</span><span style="color: #006600; font-weight:bold;">&#40;</span>.<span style="color: #006600; font-weight: bold;">*</span><span style="color: #006600; font-weight:bold;">&#41;</span><span style="color: #006600; font-weight: bold;">&amp;</span>gt<span style="color: #006600; font-weight: bold;">;</span> .<span style="color: #006600; font-weight: bold;">*</span> <span style="color: #006600; font-weight: bold;">&amp;</span>lt<span style="color: #006600; font-weight: bold;">;</span>\<span style="color: #006600; font-weight: bold;">/</span>\<span style="color: #800000;">1</span><span style="color: #006600; font-weight: bold;">&amp;</span>gt<span style="color: #006600; font-weight: bold;">;</span> <span style="color: #006600; font-weight: bold;">|</span> <span style="color: #006600; font-weight: bold;">&amp;</span>lt<span style="color: #006600; font-weight: bold;">;</span><span style="color: #006600; font-weight:bold;">&#40;</span>.<span style="color: #006600; font-weight: bold;">*</span><span style="color: #006600; font-weight:bold;">&#41;</span> &nbsp; \<span style="color: #006600; font-weight: bold;">/&amp;</span>gt<span style="color: #006600; font-weight: bold;">;</span> <span style="color: #006600; font-weight: bold;">/</span></div></td></tr></tbody></table></div>
<p>匹配首尾空格的<a href="http://iregex.org">正则表达式</a>：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #006600; font-weight:bold;">&#40;</span>^\s<span style="color: #006600; font-weight: bold;">*</span><span style="color: #006600; font-weight:bold;">&#41;</span><span style="color: #006600; font-weight: bold;">|</span><span style="color: #006600; font-weight:bold;">&#40;</span>\s<span style="color: #006600; font-weight: bold;">*</span>$<span style="color: #006600; font-weight:bold;">&#41;</span></div></td></tr></tbody></table></div>
<p>用<a href="http://iregex.org">正则表达式</a>限制只能输入中文：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">onkeyup<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;value=value.replace(/[^\u4E00-\u9FA5]/g, ' ') &quot;</span> &nbsp; onbeforepaste<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;clipboardData.setData( 'text ',clipboardData.getData( 'text ').replace(/[^\u4E00-\u9FA5]/g, ' ')) &quot;</span></div></td></tr></tbody></table></div>
<p>用<a href="http://iregex.org">正则表达式</a>限制只能输入全角字符：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">onkeyup<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;value=value.replace(/[^\uFF00-\uFFFF]/g, ' ') &quot;</span> &nbsp; onbeforepaste<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;clipboardData.setData( 'text ',clipboardData.getData( 'text ').replace(/[^\uFF00-\uFFFF]/g, ' ')) &quot;</span></div></td></tr></tbody></table></div>
<p>用<a href="http://iregex.org">正则表达式</a>限制只能输入数字：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">onkeyup<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;value=value.replace(/[^\d]/g, ' ') &nbsp; &quot;</span>onbeforepaste<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;clipboardData.setData( 'text ',clipboardData.getData( 'text ').replace(/[^\d]/g, ' ')) &quot;</span></div></td></tr></tbody></table></div>
<p>用<a href="http://iregex.org">正则表达式</a>限制只能输入数字和英文：</p>
<div class="codecolorer-container asp mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="asp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">onkeyup<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;value=value.replace(/[\W]/g, ' ') &nbsp; &quot;</span>onbeforepaste<span style="color: #006600; font-weight: bold;">=</span> <span style="color: #cc0000;">&quot;clipboardData.setData( 'text ',clipboardData.getData( 'text ').replace(/[^\d]/g, ' ')) &quot;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regular-expressions-to-match-chinese-username-in-asp.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>匹配中文的正则表达式</title>
		<link>http://iregex.org/blog/regex-to-match-chinese.html</link>
		<comments>http://iregex.org/blog/regex-to-match-chinese.html#comments</comments>
		<pubDate>Mon, 02 Jun 2008 06:23:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=14</guid>
		<description><![CDATA[以前在编写linux下的scim郑码码表时，就跟正则式的中文匹配问题打过交道。当时总结了这样一条经验，utf8编码格式下，中文正则式应该这样书写： 1&#91;\x80-\xff&#93;&#123;3&#125; 当然，这与语言无... ]]></description>
			<content:encoded><![CDATA[<p>以前在编写linux下的scim郑码码表时，就跟正则式的中文匹配问题打过交道。当时总结了这样一条经验，utf8编码格式下，中文正则式应该这样书写：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\x80</span><span style="color: #339933;">-</span><span style="color: #0000ff;">\xff</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>当然，这与语言无关。在perl与python中，都是一样的。</p>
<p>现在，这条正则式又派上用场了。正在编写的一个小程序<a href="http://code.google.com/p/fanfoufans/wiki/MiniBlogsUpdater" target="_blank" title="一次输入，五处更新！同时更新twitter,海内，叽歪的，做啥，饭否的微博客。">MiniBlogs Updater</a>中，需要计算用户所输入的文字字数。因为中英文字符编码长度不一，如果直接使用python中的len()函数，它计算的是该字串的实际长度，一个中文字并非等同于一个英文字母的。因此，需要把中文字当成英文字母来处理。</p>
<p>我写了这样一条语句来处理：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">length=<span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3}'</span>,<span style="color: #483d8b;">'a'</span>,msg<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>它的意思是，把所有的中文都替换成英文字母a，然后再统计字数。（只是统计而已，不修改源字串。）这条语句在windows下utf8文件中能够正常工作。</p>
<p>再分享两则与匹配中文的正则表达式有用的链接：</p>
<ul>
<li><a href="http://bbs.chinaunix.net/viewthread.php?tid=975358" target="_blank">常见中文正则表达式匹配结果比较</a></li>
<li><a href="http://bbs.chinaunix.net/viewthread.php?tid=907172" target="_blank">[分享]对各字符集编码范围的总结[更新日期2007-03-12]</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-to-match-chinese.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
