<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; python</title>
	<atom:link href="http://iregex.org/blog/tag/python/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Sun, 27 Jun 2010 04:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>Python 中文正则笔记</title>
		<link>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html</link>
		<comments>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html#comments</comments>
		<pubDate>Sun, 27 Jun 2010 03:50:41 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[cjk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=129</guid>
		<description><![CDATA[总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个... ]]></description>
			<content:encoded><![CDATA[<p>总结在 python 语言里使用正则表达式匹配中文的经验。关键词：中文，cjk，utf8，unicode，python。</p>
<p><span id="more-129"></span></p>
<p>从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一点经验</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>可以使用 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数查看字串的原始格式。这对于写正则表达式有所帮助。
            </li>
<li>Python 的 <code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span></span></code>模块有两个相似的函数：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">re</span>.<span style="color: black;">match</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">search</span></span></code> 。两个函数的匹配过程完全一致，只是起点不同。<code class="codecolorer python default"><span class="python">match</span></code>只从字串的开始位置进行匹配，如果失败，它就此放弃；而<code class="codecolorer python default"><span class="python">search</span></code>则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解<code class="codecolorer python default"><span class="python">match</span></code>的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，<code class="codecolorer python default"><span class="python">search</span></code>通常是你需要的那个函数。</li>
<li>从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用<code class="codecolorer python default"><span class="python">findall<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>这个函数。例子见后面的代码。</li>
<li><code class="codecolorer python default"><span class="python">utf8</span></code>下，每个汉字占据3个字符位置，正则式为<code class="codecolorer python default"><span class="python"><span style="color: black;">&#91;</span>\x80-\xff<span style="color: black;">&#93;</span><span style="color: black;">&#123;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#125;</span></span></code>，这个都知道了吧。</li>
<li><code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>下，汉字的格式如<code class="codecolorer python default"><span class="python">\uXXXX</span></code>，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。</li>
<li>两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span></span></code>，来自定义所需要匹配的文本。</li>
<li>匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的<code class="codecolorer python default"><span class="python">utf8</span></code>，此时你不用额外做什么；如果是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>，就需要在正则式之前加上<code class="codecolorer python default"><span class="python">u<span style="color: #483d8b;">&quot;&quot;</span></span></code>格式。</li>
<li>可以这样定义<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>字符串：<code class="codecolorer python default"><span class="python"><span style="color: #dc143c;">string</span>=u<span style="color: #483d8b;">&quot;我爱正则表达式&quot;</span></span></code>。如果字串不是<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span></span></code>的，可以使用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></span></code>函数转换之。如果你知道源字串的编码，可以使用<code class="codecolorer python default"><span class="python">newstr=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>oldstring, original_coding_name<span style="color: black;">&#41;</span></span></code>的方式转换，例如 linux 下常用<code class="codecolorer python default"><span class="python"><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>, <span style="color: #483d8b;">&quot;utf8&quot;</span><span style="color: black;">&#41;</span></span></code>，windows 下或许会用<code class="codecolorer python default"><span class="python">cp936</span></code>吧，没测试。</li>
</ul>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">例程序</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<span style="color: #808080; font-style: italic;">#</span><br />
<span style="color: #808080; font-style: italic;">#author: &nbsp; &nbsp; &nbsp; &nbsp; rex</span><br />
<span style="color: #808080; font-style: italic;">#blog: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://iregex.org</span><br />
<span style="color: #808080; font-style: italic;">#filename &nbsp; &nbsp; &nbsp; &nbsp;py_utf8_unicode.py</span><br />
<span style="color: #808080; font-style: italic;">#created: &nbsp; &nbsp; &nbsp; &nbsp;2010-06-27 09:11</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> findPart<span style="color: black;">&#40;</span>regex, text, name<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; res=<span style="color: #dc143c;">re</span>.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>regex, text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;There are %d %s parts:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>res<span style="color: black;">&#41;</span>, name<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> res:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>,r<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span><br />
<br />
<span style="color: #808080; font-style: italic;">#sample is utf8 by default.</span><br />
sample=<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'en: Regular expression is a powerful tool for manipulating text.<br />
zh: 正则表达式是一种很有用的处理文本的工具。<br />
jp: 正規表現は非常に役に立つツールテキストを操作することです。<br />
jp-char: あアいイうウえエおオ<br />
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.<br />
puc: 。？！、，；：“ ”‘ ’——……·－·《》〈〉！￥％＆＊＃<br />
'</span><span style="color: #483d8b;">''</span><br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw utf8 string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#find the non-ascii chars:</span><br />
findPart<span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]+&quot;</span>,sample,<span style="color: #483d8b;">&quot;non-ascii&quot;</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#convert the utf8 to unicode</span><br />
usample=<span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sample,<span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">#let's look its raw representation under the hood:</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;the raw unicode string is:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #dc143c;">repr</span><span style="color: black;">&#40;</span>usample<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">print</span> <br />
<br />
<span style="color: #808080; font-style: italic;">#get each language parts:</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>4e00-<span style="color: #000099; font-weight: bold;">\u</span>9fa5]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode chinese&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>ac00-<span style="color: #000099; font-weight: bold;">\u</span>d7ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode korean&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>30a0-<span style="color: #000099; font-weight: bold;">\u</span>30ff]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese katakana&quot;</span><span style="color: black;">&#41;</span><br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3040-<span style="color: #000099; font-weight: bold;">\u</span>309f]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode japanese hiragana&quot;</span><span style="color: black;">&#41;</span> <br />
findPart<span style="color: black;">&#40;</span>u<span style="color: #483d8b;">&quot;[<span style="color: #000099; font-weight: bold;">\u</span>3000-<span style="color: #000099; font-weight: bold;">\u</span>303f<span style="color: #000099; font-weight: bold;">\u</span>fb00-<span style="color: #000099; font-weight: bold;">\u</span>fffd]+&quot;</span>, usample, <span style="color: #483d8b;">&quot;unicode cjk Punctuation&quot;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>其输出结果为：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">the raw utf8 string is:<br />
'en: Regular expression is a powerful tool for manipulating text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\xe7\x90\x86\xe6\x96\x87\xe6\x9c\xac\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe3\x80\x82\njp: \xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\xb8\xeb\xa5\xbc \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\npuc: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\xbc\x83\n'<br />
<br />
There are 14 non-ascii parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具。<br />
&nbsp; &nbsp; 正規表現は非常に役に立つツールテキストを操作することです。<br />
&nbsp; &nbsp; あアいイうウえエおオ<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
&nbsp; &nbsp; 。？！、，；：“<br />
&nbsp; &nbsp; ”‘<br />
&nbsp; &nbsp; ’——……·－·《》〈〉！￥％＆＊＃<br />
<br />
the raw unicode string is:<br />
u'en: Regular expression is a powerful tool for manipulating text.\nzh: \u6b63\u5219\u8868\u8fbe\u5f0f\u662f\u4e00\u79cd\u5f88\u6709\u7528\u7684\u5904\u7406\u6587\u672c\u7684\u5de5\u5177\u3002\njp: \u6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815\uaddc \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.\npuc: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n'<br />
<br />
There are 6 unicode chinese parts:<br />
<br />
&nbsp; &nbsp; 正则表达式是一种很有用的处理文本的工具<br />
&nbsp; &nbsp; 正規表現<br />
&nbsp; &nbsp; 非常<br />
&nbsp; &nbsp; 役<br />
&nbsp; &nbsp; 立<br />
&nbsp; &nbsp; 操作<br />
<br />
There are 8 unicode korean parts:<br />
<br />
&nbsp; &nbsp; 정규<br />
&nbsp; &nbsp; 표현식은<br />
&nbsp; &nbsp; 매우<br />
&nbsp; &nbsp; 유용한<br />
&nbsp; &nbsp; 도구<br />
&nbsp; &nbsp; 텍스트를<br />
&nbsp; &nbsp; 조작하는<br />
&nbsp; &nbsp; 것입니다<br />
<br />
There are 6 unicode japanese katakana parts:<br />
<br />
&nbsp; &nbsp; ツールテキスト<br />
&nbsp; &nbsp; ア<br />
&nbsp; &nbsp; イ<br />
&nbsp; &nbsp; ウ<br />
&nbsp; &nbsp; エ<br />
&nbsp; &nbsp; オ<br />
<br />
There are 11 unicode japanese hiragana parts:<br />
<br />
&nbsp; &nbsp; は<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; に<br />
&nbsp; &nbsp; つ<br />
&nbsp; &nbsp; を<br />
&nbsp; &nbsp; することです<br />
&nbsp; &nbsp; あ<br />
&nbsp; &nbsp; い<br />
&nbsp; &nbsp; う<br />
&nbsp; &nbsp; え<br />
&nbsp; &nbsp; お<br />
<br />
There are 5 unicode cjk Punctuation parts:<br />
<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。<br />
&nbsp; &nbsp; 。？！、，；：<br />
&nbsp; &nbsp; －<br />
&nbsp; &nbsp; 《》〈〉！￥％＆＊＃</div></td></tr></tbody></table></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-chinese-unicode-regular-expressions.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>正则笔记</title>
		<link>http://iregex.org/blog/regex-note-20100621.html</link>
		<comments>http://iregex.org/blog/regex-note-20100621.html#comments</comments>
		<pubDate>Mon, 21 Jun 2010 15:04:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[pos]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=128</guid>
		<description><![CDATA[笔记三则，贴在这里。 首字母大小写无关模式 有一段时间，我在写正则表达式来匹配Drug关键字时，经常写出 /viagra&#124;cialis&#124;anti-ed/ 这样的表达式。为了让它更美观，我会给关键词排序；为... ]]></description>
			<content:encoded><![CDATA[<p>笔记三则，贴在这里。</p>
<p><span id="more-128"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">首字母大小写无关模式</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>有一段时间，我在写正则表达式来匹配<code class="codecolorer text default"><span class="text">Drug</span></code>关键字时，经常写出 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">|</span>cialis<span style="color: #339933;">|</span>anti<span style="color: #339933;">-</span>ed<span style="color: #339933;">/</span></span></code> 这样的表达式。为了让它更美观，我会给关键词排序；为了提升速度，我会使用 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span>Vv<span style="color: #009900;">&#93;</span>iagra<span style="color: #339933;">/</span></span></code> 而非<code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>viagra<span style="color: #339933;">/</span>i</span></code> ，只让必要的部分进行大小写通配模式。确切地说，我是需要对每个单词的首字母进行大小写无关的匹配。 </p>
<p>我写了这样的一个函数，专门用来批量转换。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#convert regex to sorted list, then provide both lower/upper case for the first letter of each word</span><br />
<span style="color: #666666; font-style: italic;">#luf means lower upper first</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> luf<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; split the regex with the delimiter |</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@arr</span><span style="color: #339933;">=</span><span style="color: #000066;">sort</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/\|/</span><span style="color: #339933;">,</span><span style="color: #000066;">shift</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; provide both the upper and lower case for the &nbsp;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; first leffer of each word </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><span style="color: #000066;">s</span><span style="color: #339933;">/</span><span style="color: #0000ff;">\b</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span>a<span style="color: #339933;">-</span>zA<span style="color: #339933;">-</span>Z<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\l</span><span style="color: #0000ff;">$1</span><span style="color: #0000ff;">\u</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">/</span>g<span style="color: #339933;">;</span><span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># &nbsp; join the keyword to a regex again</span><br />
&nbsp; &nbsp; <span style="color: #000066;">join</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'|'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">@arr</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #000066;">print</span> luf <span style="color: #ff0000;">&quot;sex pill|viagra|cialis|anti-ed&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; the output is:[aA]nti-[eE]d|[cC]ialis|[sS]ex [pP]ill|[vV]iagra</span></div></td></tr></tbody></table></div>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">控制全局匹配下次开始的位置</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<p>记得jyf曾经问过我，如何控制匹配开始的位置。嗯，现在我可以回答这个问题了。Perl 提供了 pos 函数，可以在 <code class="codecolorer perl default"><span class="perl"><span style="color: #339933;">/</span>g</span></code> 全局匹配中调整下次匹配开始的位置。举例如下：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>其输出结果是每两个字母，即<code class="codecolorer text default"><span class="text">ab, cd, ef</span></code></p>
<p>可以使用 pos($_)来重新定位下一次匹配开始的位置，如：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$_</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;abcdefg&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/../g</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">pos</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">--;</span> &nbsp;<span style="color: #666666; font-style: italic;">#pos($_)++;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$&amp;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>输出结果：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">pos($_)--: &nbsp;ab, bc, cd, de, ef, fg.<br />
pos($_)++: &nbsp;ab, de.</div></td></tr></tbody></table></div>
<p>可以阅读 Perl 文档中关于 <a href="http://perldoc.perl.org/functions/pos.html" title="我爱正则表达式" target="_blank">pos</a>的章节获取详细信息。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">散列与正则表达式替换</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>
《effective-perl-2e》第三章有这样一个例子（见下面的代码），将特殊符号转义。</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">my %ent = { '&amp;' =&gt; 'amp', '&lt;' =&gt; 'lt', '&gt;' =&gt; 'gt' };<br />
$html =~ s/([&amp;&lt;&gt;])/&amp;$ent{$1};/g;</div></td></tr></tbody></table></div>
<p>这个例子非常非常巧妙。它灵活地运用了散列这种数据结构，将待替换的部分作为 key ，将与其对应的替换内容作为 value 。这样只要有匹配就会捕获，然后将捕获的部分作为 key ，反查到 value 并运用到替换中，体现了高级语言的效率。</p>
<p>不过，这样的 Perl 代码，能否移植到 Python 中呢？ Python 同样支持正则，支持散列（Python 中叫做 Dictionary），但是似乎不支持在替换过程中插入太多花哨的东西（替换行内变量内插）。</p>
<p>查阅 Python 的文档，（在 shell 下 执行 python ，然后 import re，然后 help(re)），：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">sub(pattern, repl, string, count=0)<br />
&nbsp; &nbsp; Return the string obtained by replacing the leftmost<br />
&nbsp; &nbsp; non-overlapping occurrences of the pattern in string by the<br />
&nbsp; &nbsp; replacement repl. &nbsp;repl can be either a string or a callable;<br />
&nbsp; &nbsp; if a string, backslash escapes in it are processed. &nbsp;If it is<br />
&nbsp; &nbsp; a callable, it's passed the match object and must return<br />
&nbsp; &nbsp; a replacement string to be used.</div></td></tr></tbody></table></div>
<p>原来 python 和 php 一样，是支持在替换的过程中使用 callable 回调函数的。该函数的默认参数是一个匹配对象变量。这样一来，问题就简单了：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">ent=<span style="color: black;">&#123;</span><span style="color: #483d8b;">'&lt;'</span>:<span style="color: #483d8b;">&quot;lt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&gt;'</span>:<span style="color: #483d8b;">&quot;gt&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: #483d8b;">'&amp;'</span>:<span style="color: #483d8b;">&quot;amp&quot;</span>,<br />
&nbsp; &nbsp; <span style="color: black;">&#125;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> rep<span style="color: black;">&#40;</span>mo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> ent<span style="color: black;">&#91;</span>mo.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><br />
<br />
html=<span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;([&amp;&lt;&gt;])&quot;</span>,rep, html<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>python 替换函数 callback 的关键点在于其参数是一个匹配对象变量。只要明白了这一点，查一下手册，看看该种对象都有哪些属性，一一拿来使用，就能写出灵活高效的 python 正则替换代码。</p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-note-20100621.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>由正则式反推文本:REExtractor</title>
		<link>http://iregex.org/blog/reextractor.html</link>
		<comments>http://iregex.org/blog/reextractor.html#comments</comments>
		<pubDate>Tue, 02 Feb 2010 09:12:35 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[REExtractor]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=76</guid>
		<description><![CDATA[发现一款简单有趣的正则表达式应用：REExtractor，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是 Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性... ]]></description>
			<content:encoded><![CDATA[<p>发现一款简单有趣的正则表达式应用：<a id="f-4f" href="http://re2form.appspot.com/" title="我爱正则表达式|由正则式反推文本">REExtractor</a>，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是<br />
Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性。不过，理论上可以做到，执行时却有限制。<span id="more-76"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一些限制</h3>
<ol>
<li>平台是GAE，语言是python，因此用的是python正则。或需代理才能访问使用。</li>
<li>支持的元字符或缩写：<font class="Apple-style-span" color="#FF00FF">(), [],{m,n},{n},|,\w,\d</font>。如果需要用到这些字符的字面值，请使用反斜线转义之。其中这里的\w等同于<font class="Apple-style-span" color="#FF00FF">[a-zA-Z0-9]</font>，为62个字符之一，而不是通常意义上的包括下划线在内的<font class="Apple-style-span" color="#FF00FF">[_a-zA_Z0-9]</font>，63字符之一。但是可以用<font class="Apple-style-span" color="#FF00FF">[_\w]</font>来代替，没问题的。</li>
<li>不支持的元字符：<font class="Apple-style-span" color="#FF00FF">.(点号),^,$,\b,\D,\W,\1&#8230;（后向引用）, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>等。</li>
<ul>
<li>如果出现<font class="Apple-style-span" color="#FF00FF">.</font>点号，则直接输出。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">^, $, \b, \1, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>， 程序无视之。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">\D或\b或[^]</font>，则程序会报错。原因是范围太宽。</li>
</ul>
<li>不支持可能性在1000条以上结果的正则表达式。例如，<font class="Apple-style-span" color="#FF00FF">\w{2}</font>，因为它的可能性是62×62。但是你可以使用\w\d，因为它的可能性是62×10。</li>
</ol>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">它能做什么</h3>
<p><a href="http://iregex.org/blog/REExtractor.html" target="_blank" title="我爱正则表达式|由正则式反推文本"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100202170343.jpg" border="0" alt="我爱正则表达式|由正则式反推文本"></a><br />
好吧，虽然限制多多，但是你仍然可以拿它来做一些有趣的应用。下面略举两例。</p>
<ul>
<li>生成一些简单的邮箱地址。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">[abc]{3}\d@1(26|63).com</font> ，它生成540条邮箱地址。</li>
<li>生成一些人名。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">张[小大勇赞强战海][虎猫龙彪平]</font>。它生成35条人名。是的，它支持中文，并且每个中文字都可以当成一个字符来应用。如果你家要添一个宝宝，可以将一些可能的字排列一下，看看哪些组合比较赏心、顺口，再从中选择一个。</li>
</ul>
<p>平心而论，上面的这些小应用，当然可以直接编程实现，限制更少，更灵活，更强大。但是有必要每次都开编译器么？尝试一下这款小程序，也挺有趣的。而且，上一节中提及的一些限制，其实也是蛮有道理的。毕竟从正则式反推文本，用不到大多数的零宽断言（不过<font class="Apple-style-span" color="#FF00FF">\1</font>这种反向引用应该挺常用的，却不支持）。当作一个小玩具就好。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/reextractor.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>效率问题</title>
		<link>http://iregex.org/blog/20091022-efficiency.html</link>
		<comments>http://iregex.org/blog/20091022-efficiency.html#comments</comments>
		<pubDate>Thu, 22 Oct 2009 12:13:46 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=70</guid>
		<description><![CDATA[上周发了篇《两条与密码验证相关的正则表达式问题》。今天看了些python的正则表达式，心血来潮，想看看这几种正则哪种效率较高。代码、运行结果见下。这是为什么呢？ 1234567891011121314151617... ]]></description>
			<content:encoded><![CDATA[<p>上周发了篇《<a href="http://iregex.org/blog/2-regex-problems-about-password-verification.html" target="_blank" title="我爱正则表达式|效率问题">两条与密码验证相关的正则表达式问题</a>》。今天看了些python的正则表达式，心血来潮，想看看这几种正则哪种效率较高。代码、运行结果见下。这是为什么呢？</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">time</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">fpformat</span><br />
Regex1 = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;^(?=[0-9a-zA-Z@%&amp;]*<span style="color: #000099; font-weight: bold;">\d</span>)(?=[0-9a-zA-Z@%&amp;]*[a-zA-Z])(?=[0-9a-zA-Z@%&amp;]*[@%&amp;])[0-9a-zA-Z@%&amp;]{8,}$&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">MULTILINE</span><span style="color: black;">&#41;</span><br />
Regex2 = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;^(?=.*<span style="color: #000099; font-weight: bold;">\d</span>)(?=.*[a-zA-Z])(?=.*[@%&amp;])[0-9a-zA-Z@%&amp;]{8,}$&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">MULTILINE</span><span style="color: black;">&#41;</span><br />
Regex3 = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;^(?![0-9a-z]{8,16}$|[@%&amp;]{8,16}$)[a-z0-9@%&amp;]{8,16}$&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">IGNORECASE</span> |re.<span style="color: black;">MULTILINE</span><span style="color: black;">&#41;</span> <br />
<br />
<br />
<br />
TimesToDo = <span style="color: #ff4500;">1250</span><span style="color: #66cc66;">;</span><br />
TestString = <span style="color: #483d8b;">&quot;&quot;</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">800</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; TestString += <span style="color: #483d8b;">&quot;aba134babdedfg@&amp;%<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><br />
<br />
StartTime = <span style="color: #dc143c;">time</span>.<span style="color: #dc143c;">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>TimesToDo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp;Regex1.<span style="color: black;">search</span><span style="color: black;">&#40;</span>TestString<span style="color: black;">&#41;</span><br />
Seconds = <span style="color: #dc143c;">time</span>.<span style="color: #dc143c;">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> - StartTime<br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;R1 takes &quot;</span> + <span style="color: #dc143c;">fpformat</span>.<span style="color: black;">fix</span><span style="color: black;">&#40;</span>Seconds,<span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span> + <span style="color: #483d8b;">&quot; seconds&quot;</span><br />
<br />
StartTime = <span style="color: #dc143c;">time</span>.<span style="color: #dc143c;">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>TimesToDo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp;Regex2.<span style="color: black;">search</span><span style="color: black;">&#40;</span>TestString<span style="color: black;">&#41;</span><br />
Seconds = <span style="color: #dc143c;">time</span>.<span style="color: #dc143c;">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> - StartTime<br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;R2 takes &quot;</span> + <span style="color: #dc143c;">fpformat</span>.<span style="color: black;">fix</span><span style="color: black;">&#40;</span>Seconds,<span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span> + <span style="color: #483d8b;">&quot; seconds&quot;</span><br />
<br />
StartTime = <span style="color: #dc143c;">time</span>.<span style="color: #dc143c;">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>TimesToDo<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp;Regex3.<span style="color: black;">search</span><span style="color: black;">&#40;</span>TestString<span style="color: black;">&#41;</span><br />
Seconds = <span style="color: #dc143c;">time</span>.<span style="color: #dc143c;">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> - StartTime<br />
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;R3 takes &quot;</span> + <span style="color: #dc143c;">fpformat</span>.<span style="color: black;">fix</span><span style="color: black;">&#40;</span>Seconds,<span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span> + <span style="color: #483d8b;">&quot; seconds&quot;</span></div></td></tr></tbody></table></div>
<p>运行结果：<br />
<a href="http://iregex.org/blog/20091022-efficiency.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20091022.png" border="0" alt="我爱正则表达式|效率问题"></a> </p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/20091022-efficiency.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>使用饭否新版API编写批量抓取饭否消息的程序</title>
		<link>http://iregex.org/blog/fanfou-msg-extractor-via-new-api.html</link>
		<comments>http://iregex.org/blog/fanfou-msg-extractor-via-new-api.html#comments</comments>
		<pubDate>Tue, 06 Jan 2009 02:27:50 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=52</guid>
		<description><![CDATA[我在断断续续地写一款抓饭程序。预想的功能包括：下载、更新饭否消息，搜索，统计。 近日饭否官方释出搜索功能，可以使用关键字搜索自己曾经发布的消息。作离线版的饭否消息管理工具... ]]></description>
			<content:encoded><![CDATA[<p><img style="display: inline; margin-left: 0px; margin-right: 0px" align="right" src="http://static.fanfou.com/img/fanfou.png"> 我在断断续续地写一款抓饭程序。预想的功能包括：下载、更新饭否消息，搜索，统计。 </p>
<p>近日饭否官方释出搜索功能，可以使用关键字搜索自己曾经发布的消息。作离线版的饭否消息管理工具，似乎没有必要。不过，有的网友习惯将饭否消息列到blog上，因此，我的程序还是有用的。 </p>
<p>我原来写的程序，时间都消耗在饭否消息的下载、解析上。好在饭否新版API提供了任意页码的饭否消息，大大简化了抓取难度，因此编写一款饭否消息管理工具不再是一件难事。以python语言为例，我把自己的思路写出来，供各位有类似兴趣的朋友参考。</p>
<p><span id="more-52"></span></p>
<ol>
<li><strong>两种导出方式：(网页解析|饭否API)的比较。</strong>
<ol>
<li><strong>难易度</strong>：使用网页解析的方式，无疑是比较复杂的，不论是使用正则表达式解析，还是使用XML方式解析。现在饭否提供完备的API，可以按页码导出近乎所有的饭否消息，将导出饭否消息程序的难度降至新低。
<li><strong>可靠性</strong>：我觉得使用手工的网页解析的方式，可以掌控每一个环节、细节，因此，得到的结果也最可靠。而使用API，经过实践，发现还存在漏消息的情况。
<li><strong>涵盖面</strong>：使用手工网页解析方式，可以抓取普通饭否消息、彩信、“饭否分享”消息等等，当然也可以只抓分享、只抓私信、@me消息，等等。而API方式只允许抓取普通饭否消息。 </li>
</ol>
<li><strong>饭否消息的下载。</strong>
<ol>
<li>
<p><strong>使用curl命令行模式。 <br /></strong>根据饭否官方API文档网页，（<a target="_blank" href="http://help.fanfou.com/api.html">旧版饭否API</a>，<a target="_blank" href="http://code.google.com/p/fanfou-api/wiki/ApiDocumentation">新版饭否API</a>），有这样一句话： </p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>如果你的系统中有 cURL，就可以通过非常简单的方式使用这些API了。 </p></blockquote>
<p>正是由于这句话的指引，我才认识了curl，并让它在我在程序中发挥了巨大的作用。cURL具有windows/linux版本，支持php/python/perl语言，是一种强烈推荐的下载利器。我习惯使用<a href="http://api.fanfou.com/statuses/user_timeline.[json|xml|rss">http://api.fanfou.com/statuses/user_timeline.[json|xml|rss</a>]这条api来下载饭否消息。由于它支持id、since_id、page，我只要使用下面的命令，就能下载自己的饭否消息：</p>
<div class="codecolorer-container txt mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><br /><strong>GeSHi Error:</strong> GeSHi could not find the language txt (using path /home/zhasm/www/iregex.org/wp-content/plugins/codecolorer/lib/geshi/) (code 2)<br /></div>
<p>它的作用是：下载id为zhasm的饭否消息，第1－180页，保存为&#8221;页码.xml&#8221;网页。第1页就是 1.xml，依次类推。 </p>
<p>之后，可以cat *.xml &gt;complete.xml，将所有的饭否消息合并到complete.xml文件中。就可以准备下一步的解析。 </p>
<li>
<p><strong>使用程序下载</strong> <br />python,perl,php，无甚区别。我还是习惯使用curl模块来实现。以python为例： </p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><br /><strong>GeSHi Error:</strong> GeSHi could not find the language txt (using path /home/zhasm/www/iregex.org/wp-content/plugins/codecolorer/lib/geshi/) (code 2)<br /></div>
<p>这个python函数能够接受饭友ID，页码page，以及其它参数，下载饭否消息页面。注意，它只是下载完整的页面，还不能解析。 </li>
</ol>
<li><strong>饭否消息的解析</strong>
<ol>
<li><strong>消息格式 <br /></strong>我们先观察一下饭否消息的格式，再来做“解剖”：
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;statuses<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;status<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;created_at<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Mon Jan 05 05:56:36 +0000 2009<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/created_at<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>M6pa52Ykb1s<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>[抓饭]由于饭否释出新的API，我用python重写了抓饭工具，共150行（包括注释）。功能：下载、同步、输出饭否消息（不重复下载旧消息；不处理彩信、分享）。命令行版已经写完。GUI太烦琐了。现在网速慢，今晚还要聚会，只好明晚上传程序。<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;source<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>网页<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/source<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;truncated<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/truncated<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;in_reply_to_status_id<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/in_reply_to_status_id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;in_reply_to_user_id<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/in_reply_to_user_id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;favorited<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/favorited<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;in_reply_to_screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/in_reply_to_screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;user<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>zhasm<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>.rex<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>.rex<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;location<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>北京<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/location<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>?【内测ing】好玩、有用的饭否批量处理程序： <br />
<br />
http://code.google.com/p/fanfoufans/?<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;profile_image_url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>http://avatar.fanfou.com/s0/00/57/sg.jpg?1225428475<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/profile_image_url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>http://fanfou.com/zhasm<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;protected<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/protected<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;followers_count<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>229<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/followers_count<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/user<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/status<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; ... <br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/statuses<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<li><strong>使用xml方式解析</strong> <br />这个相对简单，因为可以使用xpath技术。例如，如果找饭否消息，可以使用表达式//statuses/status/text，定位发送时间，可以用//statuses/status/created_at，诸如此类。
<li><strong>正则表达式（python版）</strong> <br />这个相对于xpath是复杂些，不过还算做是比较简单的正则表达式应用，因为所需解析的文本极其“正则”。正则式如下：
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">p=<span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; r<span style="color: #483d8b;">&quot;&quot;&quot;&lt;created_at&gt;([^&amp;lt;]+)&lt;/created_at&gt;<span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;id&gt;([^&amp;lt;]+)&lt;/id&gt;<span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;text&gt;(.*?)&lt;/text&gt;<span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;source&gt;([^&amp;lt;]+)&lt;/source&gt;<span style="color: #000099; font-weight: bold;">\s</span>*&quot;&quot;&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">DOTALL</span> | <span style="color: #dc143c;">re</span>.<span style="color: black;">VERBOSE</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p><strong>说明：</strong> </p>
<ul>
<li>使用了re.VERBOSE，来指定空格宽松模式，便于将一条长长的正则式折行来写；
<li>使用了re.DOTALL模式，来指定点号&#8221;.&#8221;可以匹配包括换行符在内的所有文本。饭否的text字段会出现特殊字符，正则式可以处理，xml却会折戟沉沙。以前我使用xpath解析时可费了不少力气处理特殊字符。而正则式一个点号就能解决。
<li>其它字段，例如created_at，source，来来回回就那几个可以预测的字符，我使用([^&lt;]+)来匹配和捕获。它表示，捕获在下一个&lt;之前的所有文本。
<li>由于&gt;和&lt;之间会有不定数量的（0个或多个）空白字符，我加入了\s*来匹配。 </li>
</ul>
<p>写好正则表达式后，解析只需要两行：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">p.<span style="color: black;">match</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">return</span> p.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</li>
</ol>
<li><strong>存储</strong>
<ol>
<li><strong>建立表格</strong> <br />我使用Sqlite库来处理数据。先存储，再输出。sqlite语句为：
<div class="codecolorer-container sql mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="sql codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">cu<span style="color: #66cc66;">.</span>execute<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #ff0000;">&quot;create table if not exists msg( <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; content Text, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid Varchar(12) NOT NULL PRIMARY KEY, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; time Time, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tool Text <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )&quot;</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #66cc66;">&#41;</span></div></td></tr></tbody></table></div>
<p>创建时先看一眼该表是否存在。如果不存在才创建。&nbsp;&nbsp; </p>
<li><strong>存储： </strong><br />每解析一页（20条消息），存储一次，再commit()一次，方便、高效。 
<li><strong>同步更新</strong><br />谁也不希望每次下载，都需要从第1条，一直下载到当前的第3333条；当你更新至第3344条时，其实只需更新最新的11条即可，没必要再重复下载前边的3333条。这一点对于用户来说，是节约下载时间；对于饭否官方服务器来说，是节省负荷。
<p>看一下饭否官方为此而新释出的api参数：since_id&nbsp;<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">* since_id (可选) &#8211; 仅返回比此 ID 大的消息。 示例： <a href="http://api.fanfou.com/statuses/user_timeline.xml?since_id=6IAZmgy1TzA1">http://api.fanfou.com/statuses/user_timeline.xml?since_id=6IAZmgy1TzA1</a></p></blockquote>
<p>有了这枚参数的支持，我们就很省事了：</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">curl -<span style="color: #666666; font-style: italic;">#1 http://api.fanfou.com/statuses/user_timeline.xml?id=zhasm&amp;amp;page=[N]&amp;amp;since_id=6IAZmgy1TzA1 (N可变；since_id不变。)</span></div></td></tr></tbody></table></div>
<p>这样，就可以持续下载，一直到上次更新的那条了。我设定的退出条件是，下载函数返回的条数为0。这时该页已经不再返回新的消息，视为结束。 <br />怎样找到上次更新的临界点呢？我用的sql语句是：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">select distinct uuid from msg order by time DESC limit 1 <br />
#在msg消息表中，以时间为序，找到1项最新的uuid，返回之。</div></td></tr></tbody></table></div>
<ul>
<li>如果存在（非空表），我就让它生成&amp;since_id=uuid格式的条件语句，加在curl的下载条件中。
<li>如果不存在（新建立的表），则上述的条件语句置空。&nbsp; </li>
</ul>
</li>
</ol>
<li><strong>细节</strong> <br />还有一些细节问题，需要编程者操心，你不能把这些问题留给程序的使用者。
<ol>
<li><strong>时区的转换</strong> <br />观察饭否API返回的文本，它的created_at字段给出的时间格式是这样的：<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>Mon Jan 05 11:35:27 +0000 2009 </p></blockquote>
<p>它表示的是，2009年1月5日11:35:27，周一。时区是0时区。 <br />可是绝大多数饭否用户使用的时区是东八区。上面的时间格式、时区，都需要调整。我写函数是：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">def</span> time_from_0_to_8<span style="color: black;">&#40;</span>timestr,timezone=<span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>: <br />
<br />
&nbsp; &nbsp; TIMEFORMAT=<span style="color: #483d8b;">&quot;%a %b %d %X +0000 %Y&quot;</span> <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#Sat Jan 03 23:08:54 +0000 2009 </span><br />
&nbsp; &nbsp; ISOTIMEFORMAT=<span style="color: #483d8b;">'%Y-%m-%d %X'</span> <br />
&nbsp; &nbsp; x=<span style="color: #dc143c;">time</span>.<span style="color: black;">strptime</span><span style="color: black;">&#40;</span>timestr, TIMEFORMAT<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; m=<span style="color: #dc143c;">time</span>.<span style="color: black;">mktime</span><span style="color: black;">&#40;</span>x<span style="color: black;">&#41;</span>+<span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span><span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span>timezone <br />
&nbsp; &nbsp; p=<span style="color: #dc143c;">time</span>.<span style="color: black;">strftime</span><span style="color: black;">&#40;</span>ISOTIMEFORMAT,<span style="color: #dc143c;">time</span>.<span style="color: black;">localtime</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> p</div></td></tr></tbody></table></div>
<p>其中timezone的默认值是8（for 东八区），如果你需要，当然你可以将其换成你需要的时间值。 </p>
<li><strong>escape编码</strong><br />为了让饭否消息更加安全（html语法上），许多字符都被转义为其对应的escape编码，例如小于号&lt;会被替换成&lt;，以免与网页格式所需要的&lt;混淆。我利用了这一点（而不是自己再转回来），将所输出的消息使用html方式输出，这样原来被转义的字符，在浏览器中还会显出原形。由于饭否消息默认的编码格式是UTF8，我当然也在输出页面加上：
<div class="codecolorer-container html4strict mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="html4strict codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">meta</span> <span style="color: #000066;">http-equiv</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;Content-Type&quot;</span> <span style="color: #000066;">content</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;text/html; charset=utf-8&quot;</span> <span style="color: #66cc66;">/</span>&gt;</span></div></td></tr></tbody></table></div>
</li>
</ol>
</li>
</ol>
<p>至此，解析、下载、输出的工作就都解释完毕。在饭否强大的API的支持下，编写饭否程序，尤其是以下载消息为基础的程序，其门槛已经降到新低。至于各位编程爱好者能做出什么应用，那就八仙过海，各显神通吧。我把自己的程序附在文后，以资参考。编译好的命令行版程序就先不发了。我目前在做GUI。 </p>
<p>附：python程序。需要安装若干调用模块，请自行下载。</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br />117<br />118<br />119<br />120<br />121<br />122<br />123<br />124<br />125<br />126<br />127<br />128<br />129<br />130<br />131<br />132<br />133<br />134<br />135<br />136<br />137<br />138<br />139<br />140<br />141<br />142<br />143<br />144<br />145<br />146<br />147<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/bin/env python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<span style="color: #008000;">reload</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span><span style="color: black;">&#41;</span><br />
<span style="color: #dc143c;">sys</span>.<span style="color: black;">setdefaultencoding</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#ensure the utf8 encoding</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> pysqlite2.<span style="color: black;">dbapi2</span> <span style="color: #ff7700;font-weight:bold;">as</span> sqlite &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#sqlite3 </span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#regular expression to parse msg</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> pycurl &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#downloading engine </span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">StringIO</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#to &nbsp;receive the downloaded text</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">time</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#time zone convertion</span><br />
<br />
<span style="color: #808080; font-style: italic;"># important regex to parse the xml file</span><br />
p=<span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; r<span style="color: #483d8b;">&quot;&quot;&quot;&lt;created_at&gt;([^&lt;]+)&lt;/created_at&gt;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;id&gt;([^&lt;]+)&lt;/id&gt;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;text&gt;(.*?)&lt;/text&gt;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;source&gt;([^&lt;]+)&lt;/source&gt;<span style="color: #000099; font-weight: bold;">\s</span>*&quot;&quot;&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">DOTALL</span> | <span style="color: #dc143c;">re</span>.<span style="color: black;">VERBOSE</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> time_from_0_to_8<span style="color: black;">&#40;</span>timestr,timezone=<span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'convert fanfou +0000 time string to locole chinese time string.<br />
&nbsp; &nbsp; &nbsp;if you live in another timezone, please modify the timezone parameter.<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; TIMEFORMAT=<span style="color: #483d8b;">&quot;%a %b %d %X +0000 %Y&quot;</span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#Sat Jan 03 23:08:54 +0000 2009</span><br />
&nbsp; &nbsp; ISOTIMEFORMAT=<span style="color: #483d8b;">'%Y-%m-%d %X'</span><br />
&nbsp; &nbsp; x=<span style="color: #dc143c;">time</span>.<span style="color: black;">strptime</span><span style="color: black;">&#40;</span>timestr, TIMEFORMAT<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; m=<span style="color: #dc143c;">time</span>.<span style="color: black;">mktime</span><span style="color: black;">&#40;</span>x<span style="color: black;">&#41;</span>+<span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span><span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span>timezone<br />
&nbsp; &nbsp; p=<span style="color: #dc143c;">time</span>.<span style="color: black;">strftime</span><span style="color: black;">&#40;</span>ISOTIMEFORMAT,<span style="color: #dc143c;">time</span>.<span style="color: black;">localtime</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> p <br />
<br />
<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> download<span style="color: black;">&#40;</span><span style="color: #008000;">id</span>,page=<span style="color: #ff4500;">1</span>,other=<span style="color: #483d8b;">&quot;&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;&quot;<br />
&nbsp; &nbsp; to download user id's message by page number. the default <br />
&nbsp; &nbsp; page is the 1st one. <br />
&nbsp; &nbsp; &quot;&quot;&quot;</span><br />
&nbsp; &nbsp; c = pycurl.<span style="color: black;">Curl</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; url=<span style="color: #483d8b;">&quot;http://api.fanfou.com/statuses/user_timeline.xml?id=%s%s&amp;page=%d&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span><span style="color: #008000;">id</span>,other,page<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; c.<span style="color: black;">setopt</span><span style="color: black;">&#40;</span>pycurl.<span style="color: black;">URL</span>, url<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">setopt</span><span style="color: black;">&#40;</span>pycurl.<span style="color: black;">HTTPHEADER</span>, <span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;Accept:&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; b = <span style="color: #dc143c;">StringIO</span>.<span style="color: #dc143c;">StringIO</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">setopt</span><span style="color: black;">&#40;</span>pycurl.<span style="color: black;">WRITEFUNCTION</span>, b.<span style="color: black;">write</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">perform</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> b.<span style="color: black;">getvalue</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> parsemsg<span style="color: black;">&#40;</span>text,p<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; parse all the messeges from the given text, <br />
&nbsp; &nbsp; return the message timestamp, msg tex, and uuid.<br />
&nbsp; &nbsp; the structure of the returned list:<br />
&nbsp; &nbsp; list[(time,id,msg,tool),(time,id,msg,tool)...]<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; p.<span style="color: black;">match</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> p.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> initdb<span style="color: black;">&#40;</span><span style="color: #008000;">id</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; &nbsp; &nbsp; init the database, create if not exists.<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; dbname=<span style="color: #008000;">id</span>+<span style="color: #483d8b;">'.db3'</span><br />
&nbsp; &nbsp; cx=sqlite.<span style="color: black;">connect</span><span style="color: black;">&#40;</span>dbname<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu=cx.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;&quot;create table if not exists msg(<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; content Text,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid Varchar(12) NOT NULL PRIMARY KEY,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; time Time,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tool Text<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )&quot;&quot;&quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> cx<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> latest_uid<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; cu=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'select distinct uuid from msg order by time DESC limit 1'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; rs=cu.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> rs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> rs<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">''</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> store<span style="color: black;">&#40;</span><span style="color: #008000;">list</span>,db<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; list[(time,id,msg,tool),(time,id,msg,tool)...]<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; cu=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; index=<span style="color: #ff4500;">0</span> <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> item <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">list</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">time</span>=time_from_0_to_8<span style="color: black;">&#40;</span>item<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">id</span>=item<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; msg=item<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; tool=item<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'insert into msg values(&quot;%s&quot;,&quot;%s&quot;,&quot;%s&quot;,&quot;%s&quot;)'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>msg,<span style="color: #008000;">id</span>,<span style="color: #dc143c;">time</span>,tool<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index+=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'insert error'</span> <br />
&nbsp; &nbsp; db.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%d messages parsed&quot;</span> <span style="color: #66cc66;">%</span> index<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> printmsg<span style="color: black;">&#40;</span>db,index,sep=<span style="color: #483d8b;">&quot;　&quot;</span><span style="color: black;">&#41;</span>: <br />
&nbsp; &nbsp; cu=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'select content, time from msg where 1 order by time'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; rs=cu.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; result=<span style="color: #483d8b;">&quot;&quot;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">while</span> rs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; result+=<span style="color: #008000;">str</span><span style="color: black;">&#40;</span>index<span style="color: black;">&#41;</span>+sep+rs<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>+sep+rs<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>+<span style="color: #483d8b;">&quot;&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; rs=cu.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; index+=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> result<br />
&nbsp;<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #008000;">id</span>=<span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">id</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">&lt;</span><span style="color: #ff4500;">2</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;please start this program with your id&quot;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;for example: ff.exe zhasm, where zhasm is the fanfou id&quot;</span><br />
&nbsp; &nbsp; exit<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
db=initdb<span style="color: black;">&#40;</span><span style="color: #008000;">id</span><span style="color: black;">&#41;</span><br />
since=latest_uid<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">if</span> since:<br />
&nbsp; &nbsp; condition=<span style="color: #483d8b;">&quot;&amp;since_id=&quot;</span>+since<br />
<span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; condition=<span style="color: #483d8b;">''</span><br />
page=<span style="color: #ff4500;">160</span><br />
<span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'downloading page'</span>,page<br />
&nbsp; &nbsp; msg=download<span style="color: black;">&#40;</span><span style="color: #008000;">id</span>,page,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #008000;">list</span>=parsemsg<span style="color: black;">&#40;</span>msg,p<span style="color: black;">&#41;</span> &nbsp; &nbsp;<br />
&nbsp; &nbsp; store<span style="color: black;">&#40;</span><span style="color: #008000;">list</span>,db<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">list</span><span style="color: black;">&#41;</span>==<span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">break</span><br />
&nbsp; &nbsp; page+=<span style="color: #ff4500;">1</span><br />
filename=<span style="color: #008000;">id</span>+<span style="color: #483d8b;">&quot;.html&quot;</span><br />
<span style="color: #008000;">file</span> = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span>filename,<span style="color: #483d8b;">&quot;w&quot;</span><span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&lt;html&gt;<br />
&lt;head&gt;<br />
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=utf-8&quot; /&gt;<br />
&lt;/head&gt;<br />
&lt;body&gt;'</span><span style="color: #483d8b;">''</span> <span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>printmsg<span style="color: black;">&#40;</span>db,<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; &lt;/body&gt;<br />
&nbsp; &nbsp; &lt;/html&gt;'</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">close</span><span style="color: black;">&#40;</span> <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-msg-extractor-via-new-api.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>饭否消息解析之从minidom到xpath</title>
		<link>http://iregex.org/blog/fanfou-message-extractor-from-minidom-to-xpath.html</link>
		<comments>http://iregex.org/blog/fanfou-message-extractor-from-minidom-to-xpath.html#comments</comments>
		<pubDate>Tue, 14 Oct 2008 10:00:58 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[firefox]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=35</guid>
		<description><![CDATA[抛板砖，引白玉：为何不用xpath，什么是xpath？ 最近拾起了以前的小项目，在完善上篇文章发布后，“那个谁”的回复让我很感兴趣。他问，“为什么不用xpath？” xpath是什么东东？我反问。反... ]]></description>
			<content:encoded><![CDATA[<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">抛板砖，引白玉：为何不用xpath，什么是xpath？</h2>
<p>最近拾起了以前的小项目，在完善<a href="http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html">上篇文章</a>发布后，“那个谁”的回复让我很感兴趣。他问，“为什么不用xpath？”</p>
<p>xpath是什么东东？我反问。反问之前，当然少不了先google一番，以免……那个啥。<br />
<span id="more-35"></span><br />
首先映入眼帘的是<a href="http://www.w3c.org/TR/xpath">w3c</a> ，对xpath的介绍如下：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer. </p></blockquote>
<p>直译为中文就是，</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>XPath 是一种语言，用于在XML文档中定位各部分内容，可由XSLT或XPointer调用。</p></blockquote>
<p>还搜索到<a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">xpath</a>的教程，在这里。草草看过，当时并未着意。</p>
<p>虽如此，但是python里的minidom模块，也有此功效呀。为什么非要使用xpath呢？尤其是考虑到在python中还需要额外安装，不如minidom之放之四海而皆可运行。</p>
<p>跟那个谁再交流，意见仍是“力荐”。还推荐我细读<a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">教程</a>，并在firefox里使用<a href="https://addons.mozilla.org/zh-CN/firefox/addon/1095">XPath Checker</a>插件。</p>
<p>于是就照办了。</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">发硎新试,其快可知</h2>
<p>一试XPath Checker，果然石破天惊。选中部分网页文字后，在右键菜单中选&#8221;View Xpath&#8221;，立即显示出该节点的XPath路径。层次清晰，定位精准。只是我对其语法尚未了了。于是细读教程，边学边用；半小时后，已经能够运用到之前写的饭否信息抓取程序上。虽然写代码还有些吃力，但是思路很清晰，不会纠缠于细节中无法脱身。</p>
<p>那个谁还提议，一般的html文档不是标准的xml文档，因此用xpath解析时，最好格式化一下。</p>
<p>我也注意到这个问题了。从饭否html中取出的有用内容，只占全文的一小部分；额外的部分白白拖慢速度，增强析取难度。</p>
<p>经过实验，我将原代码改进如下：</p>
<p>1. 仍用原来的minidom模块下载、分析文档，只取&lt;ol&gt;与&lt;/ol&gt;之间的部分。这部分保存成字符串格式，备用。只取需要的那部分，使结构清晰，层次浅显。</p>
<p>2. 使用xpath来解析上一步取出的字串。</p>
<p>到现在，/，//，@，[]，=，等等，每个符号都从原来的meaningless变成helpful，在我的工具箱中有了合适的位置，随取随用，十分方便。我已经成了xpath的受益者。现在才觉得学习xpath真是很有趣、有用。</p>
<p>目前还有个小问题，无法使用纯粹的xpath语法解决。问题描述如下：</p>
<p>xpath只能解析实体内容，不能&#8221;囫囵吞枣&#8221;地解析。例如：</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">'http://a.com'</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>hello world<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>在view xpath 下，使用/li/a，得到的是</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">'http://a.com'</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>hello world<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>全部内容；</p>
<p>但是在python下，使用</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">method=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/li/a)'</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>虽然，也能通过/li/a/@href得到&#8217;http://a.com&#8217;的内容。</p>
<p>却只能得到hello world。xpath把所有的&lt;&gt;之内的东西给消灭掉了。很诡异。</p>
<p>遇到这种情况，如果我想得到整条的信息，就使用list.childNodes[index-1].firstChild.toxml()[22:-7]这种变通方式。不过，之前的doc = Parse(str(list.toxml()))我觉得用得挺好，是自己的一个&#8221;创举&#8221;，在程序中再度使用一下传统的xml解析方式，也无可厚非。当然，如果能够在xpath下把上述所有的事情都处理掉，是最好的。</p>
<p>经过了一点点的修补、改进，最终的饭否消息程序如下（核心代码部分）：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> __getMsgByPage__<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>,page<span style="color: black;">&#41;</span>:<br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; url=<span style="color: #483d8b;">&quot;http://fanfou.com/&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: #dc143c;">user</span>+<span style="color: #483d8b;">&quot;/p.&quot;</span>+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span>page<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; node = minidom.<span style="color: black;">parse</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">list</span> = node.<span style="color: black;">getElementsByTagName</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ol&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; doc = Parse<span style="color: black;">&#40;</span><span style="color: #008000;">str</span><span style="color: black;">&#40;</span><span style="color: #008000;">list</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; cu=<span style="color: #008000;">self</span>.<span style="color: black;">sql</span>.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">max</span>=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'count(/ol/li)'</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">max</span>=<span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #008000;">max</span><span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">max</span>==<span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">max</span>=<span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #008000;">max</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> index <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>,<span style="color: #008000;">max</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; method=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//span[@class='</span>method<span style="color: #483d8b;">'])'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; method=method.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> method==<span style="color: #483d8b;">&quot;彩信&quot;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">time</span>=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//span[@class=&quot;time&quot;]/@title)'</span><span style="color: #483d8b;">''</span>\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//a[@class='</span>photo<span style="color: #483d8b;">']/@href)'</span><span style="color: #483d8b;">''</span>\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span> <span style="color: black;">&#91;</span>-<span style="color: #ff4500;">11</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">time</span>=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//a[@class='</span><span style="color: #dc143c;">time</span><span style="color: #483d8b;">']/@title)'</span><span style="color: #483d8b;">''</span>\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//a[@class='</span><span style="color: #dc143c;">time</span><span style="color: #483d8b;">']/@href)'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">11</span>:<span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; content = <span style="color: #008000;">list</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>index-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">firstChild</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">22</span>:-<span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># content, uuid, time, method are now available for further use.</span></div></td></tr></tbody></table></div>
<p>最关键的代码，只有几行而已。省掉了原来长篇累牍的coding。效率也错，我将自己近3000条饭否消息批量下载，共150余页，历时86秒。饭否服务器也很给面子，中途没有封锁我。</p>
<p><strong>总结一下</strong>：Xpath很适合在xml中定位各部分内容，定位精准，描述性极佳，是xml中的搜索利器。经常做xml解析的，不妨尝试一把。</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">个人感言</h2>
<p>从纯手工正则表达式解析，到使用minidom解析，再到使用xpath，看似弯路，其实蛮有收获。从自己事必躬亲精确控制每一个细节（用手工作），再到借助工具实现一部分功能（手脑并用），再到完全用合适的工具来处理全部事情（用脑工作），似乎正是良性的发展路径。自豪地说，由于我已经使用过纯手工正则表达式的解析，即使现有的工具不适合我，我进可攻，退可守；我知道解析的细节，现有的工具（好看的封装而已嘛）骗不了我，即使它包装得再好，还是正则表达式在作引擎（曾经读过python处理xml的相关库文件的python代码，感谢开源）；从追求实现(it works!)到追求卓越的实现(the excellent solution)，也是进步的必然。我不是说使用正则式就低级——我从来没有说过诸如此类的话，不论是对正则表达式，还是对正则表达式的使用者；事实上，正则表达式一直是我的箧中飞刃；我爱正则表达式！——只是说，不同的工具在合适的场合，有不同的效用。不单要知道某种工具的缺点以便能够避其短，更重要的是要知道它的优点以便扬其长。这样才能从容地调兵遣将，手下无不可用之工具。</p>
<p>相关链接：</p>
<ul>
<li><a href="http://www.w3.org/TR/xpath">W3C关于XPath的介绍</a></li>
<li><a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">xpath教程</a>，有中文版，图文并茂，清晰易懂。</li>
<li><a href="http://4suite.org">4suite</a>，python的xpath套件</li>
<li><a href="http://search.cpan.org/~samtregar/Class-XPath-1.4/XPath.pm">perl其实也有xpath的</a>。未测试试。</li>
<li><a href="https://addons.mozilla.org/zh-CN/firefox/addon/1095">XPath Checker</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-message-extractor-from-minidom-to-xpath.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>饭否消息析取之regex vs xml</title>
		<link>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html</link>
		<comments>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html#comments</comments>
		<pubDate>Wed, 08 Oct 2008 10:53:59 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=33</guid>
		<description><![CDATA[页内导航： 能否只用官方的API来获取全部饭否消息？ 饭否消息结构 使用regex解析饭否消息 使用xml解析饭否消息 两相比较 相关阅读 批量导出饭否程序的方法很多，但是基本思路都是先将该网... ]]></description>
			<content:encoded><![CDATA[<p>页内导航：</p>
<ul>
<li><a href="#xiaochaqu"><strong>能否只用官方的API来获取全部饭否消息？</strong></a></li>
<li><a href="#饭否消息结构"><strong>饭否消息结构</strong></a></li>
<li><a href="#regex"><strong>使用regex解析饭否消息</strong></a></li>
<li><a href="#python"><strong>使用xml解析饭否消息</strong></a></li>
<li><a href="#compare"><strong>两相比较</strong></a></li>
<li><a href="#xiangguan"><strong>相关阅读</strong></a></li>
</ul>
<p>
批量导出饭否程序的方法很多，但是基本思路都是先将该网页保存到本地，然后将有用的饭否消息析取出来。本文不讨论如何下载饭否网页了（使用迅雷、wget、curl等），重点讨论对于下载到本地的网页，如何将有用的饭否消息析取出来。
<p><span id="more-33"></span></p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;"><a href="#xiaochaqu"><strong><span style="color: #ff008c;">小插曲：能否只用官方的API来获取全部饭否消息？</span></strong></a></h2>
<p>您或许会提议为什么不使用饭否自身的API。是的，饭否的API更快捷方便，兼容性很强。只是，饭否官方只提供下载前20条饭否消息的API。如果纯粹使用饭否官方API来下载全部饭否消息的方法也不是没有，只是很邪恶：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>true<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; download <span style="color: #cc66cc;">20</span> messages via API<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; store them<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">delete</span> this <span style="color: #cc66cc;">20</span> messages via API<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</p>
<p>一边下载一边删除，确实总能得到全部消息。删除了前面的20条，能保证后面的20条以新消息的面目出现。这在理论上是行得通的。但是我们需要的是英雄Heroes里Peter那样无损的复制方式，而不是Sylar那样的残忍的剪切方式，呵呵。既然官方的API有限制，我们就自己动手了。请继续阅读本文。</p>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">饭否消息结构</h2>
<p>打开一个饭否消息网页的源代码，例如本人的<br />
<a name="饭否消息结构"></a><a title=" 我爱正则表达式" href="http://fanfou.com/regex" target="_blank">http://fanfou.com/regex/p.1</a>（其实http://fanfou.com/regex是http://fanfou.com/regex/p.1的快捷方式。这里使用完整的路径，以便体现其一般性。），观察可见，有用的饭否消息在这个框架里面：（代码较长，阅读请点击展开）</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"> <br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;ol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 代码非抄不能懂也。<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-03 12:07&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;/statuses/QD6qHiqUbeE&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-03 12:07<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过 <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://del.icio.us/fanfou/API%E5%BA%94%E7%94%A8&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; API<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 向自由的身心致敬！ - 早嗷嗷也盼~晚安安也盼~望穿安安双眼~~怎知道今日里打土匪进深山自己的队伍来哎到嗷~面安前安呐啊啊啊啊啊~~~ <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://fanfou.com/linkto/aHR0cDovL3d3dy5kb3ViYW4uY29tL2V2ZW50LzEwMjczNDg3Lw&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://www.douban.com/event/10273487/<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-06 14:07&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;/share/bd96z1U-gHw&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-06 14:07<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://help.fanfou.com/share_button.html&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 饭否分享<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;photo&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://fanfou.com/photo/8JsezhHM_VU&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;img</span> <span style="color: #000066;">src</span>=<span style="color: #ff0000;">&quot;http://photo.fanfou.com/m0/00/19/e2_36807.jpg&quot;</span> <span style="color: #000066;">alt</span>=<span style="color: #ff0000;">&quot;caixinceshi - no description&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 上传了新照片：caixinceshi - no description<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-03 11:33&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-03 11:33<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://help.fanfou.com/mobile_mms.html&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 彩信<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; #更多的<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>条目，每页最多20条。<br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/ol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>Tips：在分析饭否源代码时，饭否消息全在一行，不便于阅读。您可以拷贝所需要的代码（注意前后结构的匹配呼应）到vim中，执行<tt class="string">:%s/&gt;/&gt;\r/g</tt>(将每个&gt;后面加上一个换行符)，再按<tt class="string">ggvG</tt>全选，按<tt class="string">=</tt>格式代码，所有的代码就成了漂亮的缩进格式，便于阅读了。</p>
</p>
<p><a name="regex"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">使用regex解析饭否消息</h2>
<p> </a></p>
<p>下面是使用regex来解析饭否消息的代码（直接拷贝自本人原来的perl抓饭程序。）</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$ffmsg</span><span style="color: #339933;">=</span><span style="color: #000066;">qr</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/(?:statuses|share)/([-_a-zA-Z0-9]{11})&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;([-: 0-9]{16})&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;">&#40;</span>网页<span style="color: #339933;">|</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*&lt;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&gt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&gt;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:&lt;/</span>a<span style="color: #339933;">&gt;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>li<span style="color: #339933;">&gt;</span><br />
<span style="color: #ff0000;">' }xi;</span></div></td></tr></tbody></table></div>
</p>
<p>可以看出，使用正则表达式，能够比较真实地再现原网页代码的风貌。有几处小地方需要说明一下：</p>
<ul>
<li>在第一组小括号里，我使用了<tt class="regex">([^<]+?)</tt>来捕获消息正文（一条完整的消息可以分为：消息正文；发送时间；消息uuid例如QD6qHiqUbeE，发送方法，类型（彩信还是文本））。最初是使用<tt class="regex">.*?</tt>的。但是这样不精确，有时候两条消息竟然混合在一起。而<tt class="regex">([^<]+?)</tt>捕获的是从当前位置开始至下一个&lt;之前的所有内容。或许您会问，这不怕受到消息正文中可能出现的&lt;的影响吗？答案是：不会受到影响。因为饭否会把所有的&lt;以及其实有可能影响解析的字符，都转换成&lt;的形式了，因此它不影响解析。同时，<strong>使用精确的正则表达式有助于提高效率，让不匹配的正则式尽早失败。</strong></li>
<li><tt class="regex">(?:statuses|share)</tt>。这条正则表达式是用来捕获饭否的uuid。它不但能捕获以普通方法发布的消息（网页、短信、手机、API、IM工具等），还能捕获由“饭否分享”工具发布的消息。我不是很喜欢饭否分享这个工具。（或许改天有时间写篇文章，揭露它的缺点？）之所以把“饭否分享”消息和普通消息分开来说，是因为两者的结构是不一样的。</li>
<li>通过<tt class="regex">(网页|(?:\s*<[^>]+>)[^<]+(?:))</tt>这条正则式，既用了捕获型括号，又用了非捕获型括号。使用后者，能有效地避免程序太复杂，便于按序号引用（$1,$2等，如果越多则越混乱，修改正则式后，更是乱成一团遭），还能节省内存（如果程序中捕获了太多的内容，而不及时释放，或许会占尽资源。毕竟不是只捕获几十字节。要考虑到饭否用户或许有近十万条的饭否消息。指的是<a href="http://fanfou.com/appleice">苹果流冰</a>这样的“万玻南痨话”）</li>
<li><tt class="regex">xi</tt>选项：<tt class="regex">x</tt>是为了使用忽略空白字符和允许注释；<tt class="regex">i</tt>选项是忽略大小写。</li>
</ul>
<p>使用正则表达式来析取饭否消息文本，需要考虑的细节很多。一处不细致，程序运行起来就会给你难看。饭否彩信的格式就略过不分析了。道理相同，点到为止。</p>
</p>
<p><a name="python"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">使用xml解析饭否消息</h2>
<p></a></p>
<p>再来看一下在python下，使用xml来解析饭否消息。注：该程序参考了<a href="http://www.happysky.org/" target="_blank"><strong><span style="color: #ff008c;">ppip</span></strong</a>的<a href="http://code.google.com/p/pyfan/" target="_blank">pyfan</a>程序。<br />
 </p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">xml</span>.<span style="color: black;">dom</span> <span style="color: #ff7700;font-weight:bold;">import</span> minidom, Node <span style="color: #808080; font-style: italic;">#引人解析工具：xml小马驹！</span><br />
node = minidom.<span style="color: black;">parse</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;http://fanfou.com/zhasm/p.1&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<span style="color: #808080; font-style: italic;">#抓取页面http://fanfou.com/zhasm/p.1 的全部内容到变量node中</span><br />
l = node.<span style="color: black;">getElementsByTagName</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ol&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
<span style="color: #808080; font-style: italic;">#将饭否消息部分内容保存到变量l中</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> c <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, number<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># 时间</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">hasAttribute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;class&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">continue</span><br />
&nbsp; &nbsp; content = <span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#时间:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>.\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; firstChild.<span style="color: black;">getAttribute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;title&quot;</span><span style="color: black;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#消息正文 :</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; childNodes<span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">22</span>:-<span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#uuid</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>.<span style="color: black;">firstChild</span>.\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; getAttribute<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;href&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">10</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>在xml文件中，前后呼应的标签，成了鲜活的特征，这些特征可以被xml解析函数很容易地辨识出来，并提取出所需内容。</p>
<ul>
<li><strong>childNodes[c].childNodes[0].toxml()[22:-7]</strong>：这条语句的意思是，对于每一条饭否消息（childNodes[c]），其消息内容的第一个节点（childNodes[0]），截取其第23字节到倒数第7字节的内容。它是指哪一段呢？其实就是每一对&lt;span class=&#8221;content&#8221;&gt;&#8230;&lt;/span&gt;之间点号所示的内容。</li>
<li>每条消息的发送时间、正文、uuid，保存在tuple中。</li>
</ul>
<p>取得了内容之后，至于之后的煎炒烹炸，就悉听尊便了。</p>
<p>值得一提的是，本人在大量下载饭否消息时，不止一次遇到过饭否页面无法访问的情况。问了饭否郭万怀，答曰为了减轻服务器负载，每个IP地址下每分钟允许访问100个页面。超过此数就会自动屏蔽。我测试的结果是少于100页。比较靠谱的间隔是，每析取一页，sleep(15)。是有些慢了。没办法。当然，也有人说，执行本人以前写的抓饭程序，一次下载几百页，并没有遇到当机情况。那我只能说是您的RP高、运气好了。</p>
</p>
<p><a name="compare"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">两者比较</h2>
<p></a> </p>
<p>个人认为，xml与regex相比，有如下特点：</p>
<ul>
<li><strong>通用性：</strong>xml具有通用性，不单单能解析饭否消息，其它符合规则的html文本，同样能够较少地改动代码，即可解析；而正则表达式则具有专用性，不能放之四海而皆准。当饭否的界面、框架有微调时，估计使用正则表达式解析的工具首先倒下。</li>
<li><strong>可读性：</strong>有人说perl是只写语言，regex尤甚。这是在说perl或regex代码在编写时性之所至，酣畅淋漓，执行也很高效。只是，如果代码格式混乱且无注释文档的话，隔数日、数月再读，仿佛读天书一般。而使用xml库来解析的python语言，则由于代码格式整齐，库函数见名知意，因而具有较强的可读性。这样说，总体是这样。不过我们可以尽可能把代码（即使是perl或regex的）写的整齐已读，尤其是考虑到perl支持<tt class="regex">/x</tt>选项。</li>
<li><strong>效率：</strong>良好编译的正则式，其执行效率应该优于xml解析。但是，使用xml能够节省编程时间；使用正则式牺牲一部分的编程时间，理论上能提高一点点效率。有兴趣的读者可以编写一段程序，循环个成千上万次，比较一下平均时间。</li>
</ul>
<p>
写到这里，对照金庸先生在《鹿鼎记》第五章：“金戈运启驱除会，玉匣书留想象间”两种武功的比较，颇有意味：<br />
“大慈大悲千叶手”招式太多，记起来麻烦。而“八卦游龙掌”只有八八六十四式，但反复变化，尽可敌得住千叶手。那么哪一门功夫厉害些？这两门都是上乘掌法，说不上哪一门功夫厉害。谁的功夫深，用得巧妙，谁就胜了。
<p>以本文来看，regex就相当于是大慈大悲千叶手了，需要留意的细节太多；xml方式呢，就相当于只有八八六十四式的“八卦游龙掌”。两种工具都很有用。</p>
</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">呼吁官方提供更多功能</h2>
<p>离题了。这里顺便发发牢骚而已，与xml、regex无关。我不止一次地在饭否和本人blog中抱怨，使用上面这足粗笨的方法下载、解析，是最无奈的应用。最便捷的方式，应该是官方提供批量导出程序，只要执行一条数据库查询导出即可实现我们辛辛苦苦半天才能以变通的方式实现的功能。或许是饭否官方的人员都在忙着增强和美化海内吧，饭否自生自长，长时间没有更新，任凭jiwai.de、zuosa等推出一项又一项的新功能。 </p>
</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">扩展</h2>
<p>本文的思路，对twitter同样适用。但是twitter越来越慢了。有一段时间好像还不支持查看历史页面。</p>
</p>
<p><a name="xiangguan"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">相关阅读</h2>
<p></a></p>
<ul>
<li><a href="http://iregex.org/blog/fanfou-private-message-format-analysis.html" target="_blank">饭否私信格式分析</a></li>
<li><a href="http://zhasm.com/blog/fanfou-msg-grabber-limitation-and-suggestion-on-sharing-msg.html">关于饭否消息打包下载的限制以及对于饭否分享功能的建议</a></li>
<li><a href="http://zhasm.com/blog/about-my-fanfou-applications.html">关于本人编写的饭否应用的三言两语</a></li>
<li><a href="http://zhasm.com/blog/comments-on-fanfou.html">饭否，尚能饭否？</a></li>
<li><a href="http://zhasm.com/blog/uuid-in-twitter-and-fanfou.html">uuid in twitter and fanfou</a></li>
<li><a href="http://zhasm.com/blog/fanfou-message-grabber.html">批量抓饭脚本：一次性打包输出自己全部的饭否消息！</a></li>
<li><a href="http://zhasm.com/blog/fanfou-vs-twitter-base64-vs-tinyurl.html">fanfou vs twitter, base64 vs tinyurl?</a></li>
</ul>
<p><span style="color: #ffffff;">验证码：BANG1F79A9FAD20225BEA7FE397AXIANGUO e8da37692b5b030cbefb9956e3bdb9cc</span></p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>匹配中文的正则表达式</title>
		<link>http://iregex.org/blog/regex-to-match-chinese.html</link>
		<comments>http://iregex.org/blog/regex-to-match-chinese.html#comments</comments>
		<pubDate>Mon, 02 Jun 2008 06:23:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=14</guid>
		<description><![CDATA[以前在编写linux下的scim郑码码表时，就跟正则式的中文匹配问题打过交道。当时总结了这样一条经验，utf8编码格式下，中文正则式应该这样书写： 1&#91;\x80-\xff&#93;&#123;3&#125; 当然，这与语言无... ]]></description>
			<content:encoded><![CDATA[<p>以前在编写linux下的scim郑码码表时，就跟正则式的中文匹配问题打过交道。当时总结了这样一条经验，utf8编码格式下，中文正则式应该这样书写：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">\x80</span><span style="color: #339933;">-</span><span style="color: #0000ff;">\xff</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>当然，这与语言无关。在perl与python中，都是一样的。</p>
<p>现在，这条正则式又派上用场了。正在编写的一个小程序<a href="http://code.google.com/p/fanfoufans/wiki/MiniBlogsUpdater" target="_blank" title="一次输入，五处更新！同时更新twitter,海内，叽歪的，做啥，饭否的微博客。">MiniBlogs Updater</a>中，需要计算用户所输入的文字字数。因为中英文字符编码长度不一，如果直接使用python中的len()函数，它计算的是该字串的实际长度，一个中文字并非等同于一个英文字母的。因此，需要把中文字当成英文字母来处理。</p>
<p>我写了这样一条语句来处理：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">length=<span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'[<span style="color: #000099; font-weight: bold;">\x</span>80-<span style="color: #000099; font-weight: bold;">\x</span>ff]{3}'</span>,<span style="color: #483d8b;">'a'</span>,msg<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>它的意思是，把所有的中文都替换成英文字母a，然后再统计字数。（只是统计而已，不修改源字串。）这条语句在windows下utf8文件中能够正常工作。</p>
<p>再分享两则与匹配中文的正则表达式有用的链接：</p>
<ul>
<li><a href="http://bbs.chinaunix.net/viewthread.php?tid=975358" target="_blank">常见中文正则表达式匹配结果比较</a></li>
<li><a href="http://bbs.chinaunix.net/viewthread.php?tid=907172" target="_blank">[分享]对各字符集编码范围的总结[更新日期2007-03-12]</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/regex-to-match-chinese.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>python正则表达式链接</title>
		<link>http://iregex.org/blog/python-regular-expression.html</link>
		<comments>http://iregex.org/blog/python-regular-expression.html#comments</comments>
		<pubDate>Mon, 26 May 2008 15:43:24 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[url]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=12</guid>
		<description><![CDATA[最近迷上了python，对它的三重引号赞不绝口，原来在Perl中一直困扰我的utf8字串问题，在python中得到圆满解决。我指的是一直在写的fanfou应用程序中，发送私信的编码问题。调用饭否API向饭否发... ]]></description>
			<content:encoded><![CDATA[<p>最近迷上了python，对它的三重引号赞不绝口，原来在Perl中一直困扰我的utf8字串问题，在python中得到圆满解决。我指的是一直在写的<a href="http://code.google.com/p/fanfoufans/" target="_blank">fanfou应用程序</a>中，发送私信的编码问题。调用饭否API向饭否发送普通消息没有问题，因为它兼容utf8与gb2312；而发送私时，却只允许使用utf8编码。最见效的例子是发“<font color="#ff0084">联通</font>”两个字。</p>
<p>闲话打住，切入正题，说一说python中的正则式。推荐两个网址：</p>
<ol>
<li><a target="_blank" href="http://wiki.ubuntu.org.cn/Python%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%93%8D%E4%BD%9C%E6%8C%87%E5%8D%97">Python正则表达式操作指南</a>，由A.M. Kuchling（amk@amk.ca）原创，由FireHare翻译，发布在<a target="_blank" href="http://forum.ubuntu.org.cn">ubuntu.org.cn</a>上。该文档可以当作手册来查阅。</li>
<li><a target="_blank" href="http://www.woodpecker.org.cn/diveintopython/regular_expressions/index.html">Dive into python中的正则式在线文档</a>，发布在<a target="_blank" href="http://www.woodpecker.org.cn"><span class="trans">啄木鸟</span></a>上。该文档深入浅出，以例子入手，适合当作自学教材。</li>
</ol>
<p>在《mastering regular expressions》一书中perl与php都拿出整整一章来讲解，唯独没有python的单独章节。好在既然已经知道了正则式的大概，剩下的只是查语法就是了，上面的第一个链接足矣。</p>
<p>另外再引用一下《mastering regular expressions》中的原话：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>I <b><i><span class="docEmphasis">thought</span> </i></b>I knew regular expressions until I read  <i><span class="docEmphasis">Mastering Regular Expressions. <b>Now</b></span></i><b> </b>I do.</p></blockquote>
<p>神往<strong>精通</strong>正则表达式的境界。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/python-regular-expression.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
