<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; fanfou</title>
	<atom:link href="http://iregex.org/blog/tag/fanfou/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Sun, 27 Jun 2010 04:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>使用饭否新版API编写批量抓取饭否消息的程序</title>
		<link>http://iregex.org/blog/fanfou-msg-extractor-via-new-api.html</link>
		<comments>http://iregex.org/blog/fanfou-msg-extractor-via-new-api.html#comments</comments>
		<pubDate>Tue, 06 Jan 2009 02:27:50 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=52</guid>
		<description><![CDATA[我在断断续续地写一款抓饭程序。预想的功能包括：下载、更新饭否消息，搜索，统计。 近日饭否官方释出搜索功能，可以使用关键字搜索自己曾经发布的消息。作离线版的饭否消息管理工具... ]]></description>
			<content:encoded><![CDATA[<p><img style="display: inline; margin-left: 0px; margin-right: 0px" align="right" src="http://static.fanfou.com/img/fanfou.png"> 我在断断续续地写一款抓饭程序。预想的功能包括：下载、更新饭否消息，搜索，统计。 </p>
<p>近日饭否官方释出搜索功能，可以使用关键字搜索自己曾经发布的消息。作离线版的饭否消息管理工具，似乎没有必要。不过，有的网友习惯将饭否消息列到blog上，因此，我的程序还是有用的。 </p>
<p>我原来写的程序，时间都消耗在饭否消息的下载、解析上。好在饭否新版API提供了任意页码的饭否消息，大大简化了抓取难度，因此编写一款饭否消息管理工具不再是一件难事。以python语言为例，我把自己的思路写出来，供各位有类似兴趣的朋友参考。</p>
<p><span id="more-52"></span></p>
<ol>
<li><strong>两种导出方式：(网页解析|饭否API)的比较。</strong>
<ol>
<li><strong>难易度</strong>：使用网页解析的方式，无疑是比较复杂的，不论是使用正则表达式解析，还是使用XML方式解析。现在饭否提供完备的API，可以按页码导出近乎所有的饭否消息，将导出饭否消息程序的难度降至新低。
<li><strong>可靠性</strong>：我觉得使用手工的网页解析的方式，可以掌控每一个环节、细节，因此，得到的结果也最可靠。而使用API，经过实践，发现还存在漏消息的情况。
<li><strong>涵盖面</strong>：使用手工网页解析方式，可以抓取普通饭否消息、彩信、“饭否分享”消息等等，当然也可以只抓分享、只抓私信、@me消息，等等。而API方式只允许抓取普通饭否消息。 </li>
</ol>
<li><strong>饭否消息的下载。</strong>
<ol>
<li>
<p><strong>使用curl命令行模式。 <br /></strong>根据饭否官方API文档网页，（<a target="_blank" href="http://help.fanfou.com/api.html">旧版饭否API</a>，<a target="_blank" href="http://code.google.com/p/fanfou-api/wiki/ApiDocumentation">新版饭否API</a>），有这样一句话： </p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>如果你的系统中有 cURL，就可以通过非常简单的方式使用这些API了。 </p></blockquote>
<p>正是由于这句话的指引，我才认识了curl，并让它在我在程序中发挥了巨大的作用。cURL具有windows/linux版本，支持php/python/perl语言，是一种强烈推荐的下载利器。我习惯使用<a href="http://api.fanfou.com/statuses/user_timeline.[json|xml|rss">http://api.fanfou.com/statuses/user_timeline.[json|xml|rss</a>]这条api来下载饭否消息。由于它支持id、since_id、page，我只要使用下面的命令，就能下载自己的饭否消息：</p>
<div class="codecolorer-container txt mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><br /><strong>GeSHi Error:</strong> GeSHi could not find the language txt (using path /home/zhasm/www/iregex.org/wp-content/plugins/codecolorer/lib/geshi/) (code 2)<br /></div>
<p>它的作用是：下载id为zhasm的饭否消息，第1－180页，保存为&#8221;页码.xml&#8221;网页。第1页就是 1.xml，依次类推。 </p>
<p>之后，可以cat *.xml &gt;complete.xml，将所有的饭否消息合并到complete.xml文件中。就可以准备下一步的解析。 </p>
<li>
<p><strong>使用程序下载</strong> <br />python,perl,php，无甚区别。我还是习惯使用curl模块来实现。以python为例： </p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><br /><strong>GeSHi Error:</strong> GeSHi could not find the language txt (using path /home/zhasm/www/iregex.org/wp-content/plugins/codecolorer/lib/geshi/) (code 2)<br /></div>
<p>这个python函数能够接受饭友ID，页码page，以及其它参数，下载饭否消息页面。注意，它只是下载完整的页面，还不能解析。 </li>
</ol>
<li><strong>饭否消息的解析</strong>
<ol>
<li><strong>消息格式 <br /></strong>我们先观察一下饭否消息的格式，再来做“解剖”：
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;statuses<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;status<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;created_at<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Mon Jan 05 05:56:36 +0000 2009<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/created_at<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>M6pa52Ykb1s<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>[抓饭]由于饭否释出新的API，我用python重写了抓饭工具，共150行（包括注释）。功能：下载、同步、输出饭否消息（不重复下载旧消息；不处理彩信、分享）。命令行版已经写完。GUI太烦琐了。现在网速慢，今晚还要聚会，只好明晚上传程序。<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/text<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;source<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>网页<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/source<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;truncated<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/truncated<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;in_reply_to_status_id<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/in_reply_to_status_id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;in_reply_to_user_id<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/in_reply_to_user_id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;favorited<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/favorited<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;in_reply_to_screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/in_reply_to_screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;user<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>zhasm<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/id<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>.rex<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>.rex<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/screen_name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;location<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>北京<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/location<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>?【内测ing】好玩、有用的饭否批量处理程序： <br />
<br />
http://code.google.com/p/fanfoufans/?<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/description<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;profile_image_url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>http://avatar.fanfou.com/s0/00/57/sg.jpg?1225428475<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/profile_image_url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>http://fanfou.com/zhasm<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/url<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;protected<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>false<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/protected<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;followers_count<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>229<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/followers_count<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/user<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/status<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> <br />
&nbsp; &nbsp; ... <br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/statuses<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<li><strong>使用xml方式解析</strong> <br />这个相对简单，因为可以使用xpath技术。例如，如果找饭否消息，可以使用表达式//statuses/status/text，定位发送时间，可以用//statuses/status/created_at，诸如此类。
<li><strong>正则表达式（python版）</strong> <br />这个相对于xpath是复杂些，不过还算做是比较简单的正则表达式应用，因为所需解析的文本极其“正则”。正则式如下：
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">p=<span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; r<span style="color: #483d8b;">&quot;&quot;&quot;&lt;created_at&gt;([^&amp;lt;]+)&lt;/created_at&gt;<span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;id&gt;([^&amp;lt;]+)&lt;/id&gt;<span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;text&gt;(.*?)&lt;/text&gt;<span style="color: #000099; font-weight: bold;">\s</span>* <br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;source&gt;([^&amp;lt;]+)&lt;/source&gt;<span style="color: #000099; font-weight: bold;">\s</span>*&quot;&quot;&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">DOTALL</span> | <span style="color: #dc143c;">re</span>.<span style="color: black;">VERBOSE</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p><strong>说明：</strong> </p>
<ul>
<li>使用了re.VERBOSE，来指定空格宽松模式，便于将一条长长的正则式折行来写；
<li>使用了re.DOTALL模式，来指定点号&#8221;.&#8221;可以匹配包括换行符在内的所有文本。饭否的text字段会出现特殊字符，正则式可以处理，xml却会折戟沉沙。以前我使用xpath解析时可费了不少力气处理特殊字符。而正则式一个点号就能解决。
<li>其它字段，例如created_at，source，来来回回就那几个可以预测的字符，我使用([^&lt;]+)来匹配和捕获。它表示，捕获在下一个&lt;之前的所有文本。
<li>由于&gt;和&lt;之间会有不定数量的（0个或多个）空白字符，我加入了\s*来匹配。 </li>
</ul>
<p>写好正则表达式后，解析只需要两行：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">p.<span style="color: black;">match</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">return</span> p.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
</li>
</ol>
<li><strong>存储</strong>
<ol>
<li><strong>建立表格</strong> <br />我使用Sqlite库来处理数据。先存储，再输出。sqlite语句为：
<div class="codecolorer-container sql mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="sql codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">cu<span style="color: #66cc66;">.</span>execute<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #ff0000;">&quot;create table if not exists msg( <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; content Text, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid Varchar(12) NOT NULL PRIMARY KEY, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; time Time, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tool Text <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )&quot;</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #66cc66;">&#41;</span></div></td></tr></tbody></table></div>
<p>创建时先看一眼该表是否存在。如果不存在才创建。&nbsp;&nbsp; </p>
<li><strong>存储： </strong><br />每解析一页（20条消息），存储一次，再commit()一次，方便、高效。 
<li><strong>同步更新</strong><br />谁也不希望每次下载，都需要从第1条，一直下载到当前的第3333条；当你更新至第3344条时，其实只需更新最新的11条即可，没必要再重复下载前边的3333条。这一点对于用户来说，是节约下载时间；对于饭否官方服务器来说，是节省负荷。
<p>看一下饭否官方为此而新释出的api参数：since_id&nbsp;<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">* since_id (可选) &#8211; 仅返回比此 ID 大的消息。 示例： <a href="http://api.fanfou.com/statuses/user_timeline.xml?since_id=6IAZmgy1TzA1">http://api.fanfou.com/statuses/user_timeline.xml?since_id=6IAZmgy1TzA1</a></p></blockquote>
<p>有了这枚参数的支持，我们就很省事了：</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">curl -<span style="color: #666666; font-style: italic;">#1 http://api.fanfou.com/statuses/user_timeline.xml?id=zhasm&amp;amp;page=[N]&amp;amp;since_id=6IAZmgy1TzA1 (N可变；since_id不变。)</span></div></td></tr></tbody></table></div>
<p>这样，就可以持续下载，一直到上次更新的那条了。我设定的退出条件是，下载函数返回的条数为0。这时该页已经不再返回新的消息，视为结束。 <br />怎样找到上次更新的临界点呢？我用的sql语句是：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">select distinct uuid from msg order by time DESC limit 1 <br />
#在msg消息表中，以时间为序，找到1项最新的uuid，返回之。</div></td></tr></tbody></table></div>
<ul>
<li>如果存在（非空表），我就让它生成&amp;since_id=uuid格式的条件语句，加在curl的下载条件中。
<li>如果不存在（新建立的表），则上述的条件语句置空。&nbsp; </li>
</ul>
</li>
</ol>
<li><strong>细节</strong> <br />还有一些细节问题，需要编程者操心，你不能把这些问题留给程序的使用者。
<ol>
<li><strong>时区的转换</strong> <br />观察饭否API返回的文本，它的created_at字段给出的时间格式是这样的：<br />
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>Mon Jan 05 11:35:27 +0000 2009 </p></blockquote>
<p>它表示的是，2009年1月5日11:35:27，周一。时区是0时区。 <br />可是绝大多数饭否用户使用的时区是东八区。上面的时间格式、时区，都需要调整。我写函数是：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">def</span> time_from_0_to_8<span style="color: black;">&#40;</span>timestr,timezone=<span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>: <br />
<br />
&nbsp; &nbsp; TIMEFORMAT=<span style="color: #483d8b;">&quot;%a %b %d %X +0000 %Y&quot;</span> <br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#Sat Jan 03 23:08:54 +0000 2009 </span><br />
&nbsp; &nbsp; ISOTIMEFORMAT=<span style="color: #483d8b;">'%Y-%m-%d %X'</span> <br />
&nbsp; &nbsp; x=<span style="color: #dc143c;">time</span>.<span style="color: black;">strptime</span><span style="color: black;">&#40;</span>timestr, TIMEFORMAT<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; m=<span style="color: #dc143c;">time</span>.<span style="color: black;">mktime</span><span style="color: black;">&#40;</span>x<span style="color: black;">&#41;</span>+<span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span><span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span>timezone <br />
&nbsp; &nbsp; p=<span style="color: #dc143c;">time</span>.<span style="color: black;">strftime</span><span style="color: black;">&#40;</span>ISOTIMEFORMAT,<span style="color: #dc143c;">time</span>.<span style="color: black;">localtime</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> p</div></td></tr></tbody></table></div>
<p>其中timezone的默认值是8（for 东八区），如果你需要，当然你可以将其换成你需要的时间值。 </p>
<li><strong>escape编码</strong><br />为了让饭否消息更加安全（html语法上），许多字符都被转义为其对应的escape编码，例如小于号&lt;会被替换成&lt;，以免与网页格式所需要的&lt;混淆。我利用了这一点（而不是自己再转回来），将所输出的消息使用html方式输出，这样原来被转义的字符，在浏览器中还会显出原形。由于饭否消息默认的编码格式是UTF8，我当然也在输出页面加上：
<div class="codecolorer-container html4strict mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="html4strict codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">meta</span> <span style="color: #000066;">http-equiv</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;Content-Type&quot;</span> <span style="color: #000066;">content</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;text/html; charset=utf-8&quot;</span> <span style="color: #66cc66;">/</span>&gt;</span></div></td></tr></tbody></table></div>
</li>
</ol>
</li>
</ol>
<p>至此，解析、下载、输出的工作就都解释完毕。在饭否强大的API的支持下，编写饭否程序，尤其是以下载消息为基础的程序，其门槛已经降到新低。至于各位编程爱好者能做出什么应用，那就八仙过海，各显神通吧。我把自己的程序附在文后，以资参考。编译好的命令行版程序就先不发了。我目前在做GUI。 </p>
<p>附：python程序。需要安装若干调用模块，请自行下载。</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br />117<br />118<br />119<br />120<br />121<br />122<br />123<br />124<br />125<br />126<br />127<br />128<br />129<br />130<br />131<br />132<br />133<br />134<br />135<br />136<br />137<br />138<br />139<br />140<br />141<br />142<br />143<br />144<br />145<br />146<br />147<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">#!/bin/env python</span><br />
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span><br />
<span style="color: #008000;">reload</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span><span style="color: black;">&#41;</span><br />
<span style="color: #dc143c;">sys</span>.<span style="color: black;">setdefaultencoding</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#ensure the utf8 encoding</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> pysqlite2.<span style="color: black;">dbapi2</span> <span style="color: #ff7700;font-weight:bold;">as</span> sqlite &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#sqlite3 </span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#regular expression to parse msg</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> pycurl &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#downloading engine </span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">StringIO</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#to &nbsp;receive the downloaded text</span><br />
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">time</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#time zone convertion</span><br />
<br />
<span style="color: #808080; font-style: italic;"># important regex to parse the xml file</span><br />
p=<span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; r<span style="color: #483d8b;">&quot;&quot;&quot;&lt;created_at&gt;([^&lt;]+)&lt;/created_at&gt;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;id&gt;([^&lt;]+)&lt;/id&gt;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;text&gt;(.*?)&lt;/text&gt;<span style="color: #000099; font-weight: bold;">\s</span>*<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;source&gt;([^&lt;]+)&lt;/source&gt;<span style="color: #000099; font-weight: bold;">\s</span>*&quot;&quot;&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">DOTALL</span> | <span style="color: #dc143c;">re</span>.<span style="color: black;">VERBOSE</span><span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> time_from_0_to_8<span style="color: black;">&#40;</span>timestr,timezone=<span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'convert fanfou +0000 time string to locole chinese time string.<br />
&nbsp; &nbsp; &nbsp;if you live in another timezone, please modify the timezone parameter.<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; TIMEFORMAT=<span style="color: #483d8b;">&quot;%a %b %d %X +0000 %Y&quot;</span><br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#Sat Jan 03 23:08:54 +0000 2009</span><br />
&nbsp; &nbsp; ISOTIMEFORMAT=<span style="color: #483d8b;">'%Y-%m-%d %X'</span><br />
&nbsp; &nbsp; x=<span style="color: #dc143c;">time</span>.<span style="color: black;">strptime</span><span style="color: black;">&#40;</span>timestr, TIMEFORMAT<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; m=<span style="color: #dc143c;">time</span>.<span style="color: black;">mktime</span><span style="color: black;">&#40;</span>x<span style="color: black;">&#41;</span>+<span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span><span style="color: #ff4500;">60</span><span style="color: #66cc66;">*</span>timezone<br />
&nbsp; &nbsp; p=<span style="color: #dc143c;">time</span>.<span style="color: black;">strftime</span><span style="color: black;">&#40;</span>ISOTIMEFORMAT,<span style="color: #dc143c;">time</span>.<span style="color: black;">localtime</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> p <br />
<br />
<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> download<span style="color: black;">&#40;</span><span style="color: #008000;">id</span>,page=<span style="color: #ff4500;">1</span>,other=<span style="color: #483d8b;">&quot;&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;&quot;<br />
&nbsp; &nbsp; to download user id's message by page number. the default <br />
&nbsp; &nbsp; page is the 1st one. <br />
&nbsp; &nbsp; &quot;&quot;&quot;</span><br />
&nbsp; &nbsp; c = pycurl.<span style="color: black;">Curl</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; url=<span style="color: #483d8b;">&quot;http://api.fanfou.com/statuses/user_timeline.xml?id=%s%s&amp;page=%d&quot;</span><span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span><span style="color: #008000;">id</span>,other,page<span style="color: black;">&#41;</span> <br />
&nbsp; &nbsp; c.<span style="color: black;">setopt</span><span style="color: black;">&#40;</span>pycurl.<span style="color: black;">URL</span>, url<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">setopt</span><span style="color: black;">&#40;</span>pycurl.<span style="color: black;">HTTPHEADER</span>, <span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;Accept:&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; b = <span style="color: #dc143c;">StringIO</span>.<span style="color: #dc143c;">StringIO</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">setopt</span><span style="color: black;">&#40;</span>pycurl.<span style="color: black;">WRITEFUNCTION</span>, b.<span style="color: black;">write</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; c.<span style="color: black;">perform</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> b.<span style="color: black;">getvalue</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> parsemsg<span style="color: black;">&#40;</span>text,p<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; parse all the messeges from the given text, <br />
&nbsp; &nbsp; return the message timestamp, msg tex, and uuid.<br />
&nbsp; &nbsp; the structure of the returned list:<br />
&nbsp; &nbsp; list[(time,id,msg,tool),(time,id,msg,tool)...]<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; p.<span style="color: black;">match</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> p.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> initdb<span style="color: black;">&#40;</span><span style="color: #008000;">id</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; &nbsp; &nbsp; init the database, create if not exists.<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; dbname=<span style="color: #008000;">id</span>+<span style="color: #483d8b;">'.db3'</span><br />
&nbsp; &nbsp; cx=sqlite.<span style="color: black;">connect</span><span style="color: black;">&#40;</span>dbname<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu=cx.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;&quot;create table if not exists msg(<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; content Text,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid Varchar(12) NOT NULL PRIMARY KEY,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; time Time,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tool Text<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )&quot;&quot;&quot;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> cx<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> latest_uid<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; cu=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'select distinct uuid from msg order by time DESC limit 1'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; rs=cu.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> rs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> rs<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">''</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> store<span style="color: black;">&#40;</span><span style="color: #008000;">list</span>,db<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; list[(time,id,msg,tool),(time,id,msg,tool)...]<br />
&nbsp; &nbsp; '</span><span style="color: #483d8b;">''</span><br />
&nbsp; &nbsp; cu=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; index=<span style="color: #ff4500;">0</span> <br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> item <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">list</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">time</span>=time_from_0_to_8<span style="color: black;">&#40;</span>item<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">id</span>=item<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; msg=item<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; tool=item<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'insert into msg values(&quot;%s&quot;,&quot;%s&quot;,&quot;%s&quot;,&quot;%s&quot;)'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>msg,<span style="color: #008000;">id</span>,<span style="color: #dc143c;">time</span>,tool<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index+=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">except</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'insert error'</span> <br />
&nbsp; &nbsp; db.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%d messages parsed&quot;</span> <span style="color: #66cc66;">%</span> index<br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> printmsg<span style="color: black;">&#40;</span>db,index,sep=<span style="color: #483d8b;">&quot;　&quot;</span><span style="color: black;">&#41;</span>: <br />
&nbsp; &nbsp; cu=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; cu.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'select content, time from msg where 1 order by time'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; rs=cu.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; result=<span style="color: #483d8b;">&quot;&quot;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">while</span> rs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; result+=<span style="color: #008000;">str</span><span style="color: black;">&#40;</span>index<span style="color: black;">&#41;</span>+sep+rs<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>+sep+rs<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>+<span style="color: #483d8b;">&quot;&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; rs=cu.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; index+=<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> result<br />
&nbsp;<span style="color: #808080; font-style: italic;">###############################################################################</span><br />
<span style="color: #008000;">id</span>=<span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><br />
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">id</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">&lt;</span><span style="color: #ff4500;">2</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;please start this program with your id&quot;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;for example: ff.exe zhasm, where zhasm is the fanfou id&quot;</span><br />
&nbsp; &nbsp; exit<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
<br />
db=initdb<span style="color: black;">&#40;</span><span style="color: #008000;">id</span><span style="color: black;">&#41;</span><br />
since=latest_uid<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">if</span> since:<br />
&nbsp; &nbsp; condition=<span style="color: #483d8b;">&quot;&amp;since_id=&quot;</span>+since<br />
<span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; condition=<span style="color: #483d8b;">''</span><br />
page=<span style="color: #ff4500;">160</span><br />
<span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'downloading page'</span>,page<br />
&nbsp; &nbsp; msg=download<span style="color: black;">&#40;</span><span style="color: #008000;">id</span>,page,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #008000;">list</span>=parsemsg<span style="color: black;">&#40;</span>msg,p<span style="color: black;">&#41;</span> &nbsp; &nbsp;<br />
&nbsp; &nbsp; store<span style="color: black;">&#40;</span><span style="color: #008000;">list</span>,db<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">list</span><span style="color: black;">&#41;</span>==<span style="color: #ff4500;">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">break</span><br />
&nbsp; &nbsp; page+=<span style="color: #ff4500;">1</span><br />
filename=<span style="color: #008000;">id</span>+<span style="color: #483d8b;">&quot;.html&quot;</span><br />
<span style="color: #008000;">file</span> = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span>filename,<span style="color: #483d8b;">&quot;w&quot;</span><span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&lt;html&gt;<br />
&lt;head&gt;<br />
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=utf-8&quot; /&gt;<br />
&lt;/head&gt;<br />
&lt;body&gt;'</span><span style="color: #483d8b;">''</span> <span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>printmsg<span style="color: black;">&#40;</span>db,<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'<br />
&nbsp; &nbsp; &lt;/body&gt;<br />
&nbsp; &nbsp; &lt;/html&gt;'</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
<span style="color: #008000;">file</span>.<span style="color: black;">close</span><span style="color: black;">&#40;</span> <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-msg-extractor-via-new-api.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>饭否消息解析之从minidom到xpath</title>
		<link>http://iregex.org/blog/fanfou-message-extractor-from-minidom-to-xpath.html</link>
		<comments>http://iregex.org/blog/fanfou-message-extractor-from-minidom-to-xpath.html#comments</comments>
		<pubDate>Tue, 14 Oct 2008 10:00:58 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[firefox]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=35</guid>
		<description><![CDATA[抛板砖，引白玉：为何不用xpath，什么是xpath？ 最近拾起了以前的小项目，在完善上篇文章发布后，“那个谁”的回复让我很感兴趣。他问，“为什么不用xpath？” xpath是什么东东？我反问。反... ]]></description>
			<content:encoded><![CDATA[<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">抛板砖，引白玉：为何不用xpath，什么是xpath？</h2>
<p>最近拾起了以前的小项目，在完善<a href="http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html">上篇文章</a>发布后，“那个谁”的回复让我很感兴趣。他问，“为什么不用xpath？”</p>
<p>xpath是什么东东？我反问。反问之前，当然少不了先google一番，以免……那个啥。<br />
<span id="more-35"></span><br />
首先映入眼帘的是<a href="http://www.w3c.org/TR/xpath">w3c</a> ，对xpath的介绍如下：</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer. </p></blockquote>
<p>直译为中文就是，</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>XPath 是一种语言，用于在XML文档中定位各部分内容，可由XSLT或XPointer调用。</p></blockquote>
<p>还搜索到<a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">xpath</a>的教程，在这里。草草看过，当时并未着意。</p>
<p>虽如此，但是python里的minidom模块，也有此功效呀。为什么非要使用xpath呢？尤其是考虑到在python中还需要额外安装，不如minidom之放之四海而皆可运行。</p>
<p>跟那个谁再交流，意见仍是“力荐”。还推荐我细读<a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">教程</a>，并在firefox里使用<a href="https://addons.mozilla.org/zh-CN/firefox/addon/1095">XPath Checker</a>插件。</p>
<p>于是就照办了。</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">发硎新试,其快可知</h2>
<p>一试XPath Checker，果然石破天惊。选中部分网页文字后，在右键菜单中选&#8221;View Xpath&#8221;，立即显示出该节点的XPath路径。层次清晰，定位精准。只是我对其语法尚未了了。于是细读教程，边学边用；半小时后，已经能够运用到之前写的饭否信息抓取程序上。虽然写代码还有些吃力，但是思路很清晰，不会纠缠于细节中无法脱身。</p>
<p>那个谁还提议，一般的html文档不是标准的xml文档，因此用xpath解析时，最好格式化一下。</p>
<p>我也注意到这个问题了。从饭否html中取出的有用内容，只占全文的一小部分；额外的部分白白拖慢速度，增强析取难度。</p>
<p>经过实验，我将原代码改进如下：</p>
<p>1. 仍用原来的minidom模块下载、分析文档，只取&lt;ol&gt;与&lt;/ol&gt;之间的部分。这部分保存成字符串格式，备用。只取需要的那部分，使结构清晰，层次浅显。</p>
<p>2. 使用xpath来解析上一步取出的字串。</p>
<p>到现在，/，//，@，[]，=，等等，每个符号都从原来的meaningless变成helpful，在我的工具箱中有了合适的位置，随取随用，十分方便。我已经成了xpath的受益者。现在才觉得学习xpath真是很有趣、有用。</p>
<p>目前还有个小问题，无法使用纯粹的xpath语法解决。问题描述如下：</p>
<p>xpath只能解析实体内容，不能&#8221;囫囵吞枣&#8221;地解析。例如：</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">'http://a.com'</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>hello world<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>在view xpath 下，使用/li/a，得到的是</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">'http://a.com'</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>hello world<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>全部内容；</p>
<p>但是在python下，使用</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">method=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/li/a)'</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>虽然，也能通过/li/a/@href得到&#8217;http://a.com&#8217;的内容。</p>
<p>却只能得到hello world。xpath把所有的&lt;&gt;之内的东西给消灭掉了。很诡异。</p>
<p>遇到这种情况，如果我想得到整条的信息，就使用list.childNodes[index-1].firstChild.toxml()[22:-7]这种变通方式。不过，之前的doc = Parse(str(list.toxml()))我觉得用得挺好，是自己的一个&#8221;创举&#8221;，在程序中再度使用一下传统的xml解析方式，也无可厚非。当然，如果能够在xpath下把上述所有的事情都处理掉，是最好的。</p>
<p>经过了一点点的修补、改进，最终的饭否消息程序如下（核心代码部分）：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> __getMsgByPage__<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>,page<span style="color: black;">&#41;</span>:<br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; url=<span style="color: #483d8b;">&quot;http://fanfou.com/&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: #dc143c;">user</span>+<span style="color: #483d8b;">&quot;/p.&quot;</span>+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span>page<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; node = minidom.<span style="color: black;">parse</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">list</span> = node.<span style="color: black;">getElementsByTagName</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ol&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; doc = Parse<span style="color: black;">&#40;</span><span style="color: #008000;">str</span><span style="color: black;">&#40;</span><span style="color: #008000;">list</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; cu=<span style="color: #008000;">self</span>.<span style="color: black;">sql</span>.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">max</span>=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'count(/ol/li)'</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">max</span>=<span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #008000;">max</span><span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">max</span>==<span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">max</span>=<span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #008000;">max</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> index <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>,<span style="color: #008000;">max</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; method=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//span[@class='</span>method<span style="color: #483d8b;">'])'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; method=method.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> method==<span style="color: #483d8b;">&quot;彩信&quot;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">time</span>=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//span[@class=&quot;time&quot;]/@title)'</span><span style="color: #483d8b;">''</span>\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//a[@class='</span>photo<span style="color: #483d8b;">']/@href)'</span><span style="color: #483d8b;">''</span>\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span> <span style="color: black;">&#91;</span>-<span style="color: #ff4500;">11</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">time</span>=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//a[@class='</span><span style="color: #dc143c;">time</span><span style="color: #483d8b;">']/@title)'</span><span style="color: #483d8b;">''</span>\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uuid=doc.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span>u<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'string(/ol/li[%d]//a[@class='</span><span style="color: #dc143c;">time</span><span style="color: #483d8b;">']/@href)'</span><span style="color: #483d8b;">''</span> <span style="color: #66cc66;">%</span> index<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">11</span>:<span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; content = <span style="color: #008000;">list</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>index-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">firstChild</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">22</span>:-<span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># content, uuid, time, method are now available for further use.</span></div></td></tr></tbody></table></div>
<p>最关键的代码，只有几行而已。省掉了原来长篇累牍的coding。效率也错，我将自己近3000条饭否消息批量下载，共150余页，历时86秒。饭否服务器也很给面子，中途没有封锁我。</p>
<p><strong>总结一下</strong>：Xpath很适合在xml中定位各部分内容，定位精准，描述性极佳，是xml中的搜索利器。经常做xml解析的，不妨尝试一把。</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">个人感言</h2>
<p>从纯手工正则表达式解析，到使用minidom解析，再到使用xpath，看似弯路，其实蛮有收获。从自己事必躬亲精确控制每一个细节（用手工作），再到借助工具实现一部分功能（手脑并用），再到完全用合适的工具来处理全部事情（用脑工作），似乎正是良性的发展路径。自豪地说，由于我已经使用过纯手工正则表达式的解析，即使现有的工具不适合我，我进可攻，退可守；我知道解析的细节，现有的工具（好看的封装而已嘛）骗不了我，即使它包装得再好，还是正则表达式在作引擎（曾经读过python处理xml的相关库文件的python代码，感谢开源）；从追求实现(it works!)到追求卓越的实现(the excellent solution)，也是进步的必然。我不是说使用正则式就低级——我从来没有说过诸如此类的话，不论是对正则表达式，还是对正则表达式的使用者；事实上，正则表达式一直是我的箧中飞刃；我爱正则表达式！——只是说，不同的工具在合适的场合，有不同的效用。不单要知道某种工具的缺点以便能够避其短，更重要的是要知道它的优点以便扬其长。这样才能从容地调兵遣将，手下无不可用之工具。</p>
<p>相关链接：</p>
<ul>
<li><a href="http://www.w3.org/TR/xpath">W3C关于XPath的介绍</a></li>
<li><a href="http://www.zvon.org/xxl/XPathTutorial/General/examples.html">xpath教程</a>，有中文版，图文并茂，清晰易懂。</li>
<li><a href="http://4suite.org">4suite</a>，python的xpath套件</li>
<li><a href="http://search.cpan.org/~samtregar/Class-XPath-1.4/XPath.pm">perl其实也有xpath的</a>。未测试试。</li>
<li><a href="https://addons.mozilla.org/zh-CN/firefox/addon/1095">XPath Checker</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-message-extractor-from-minidom-to-xpath.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>饭否消息析取之regex vs xml</title>
		<link>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html</link>
		<comments>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html#comments</comments>
		<pubDate>Wed, 08 Oct 2008 10:53:59 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[教程]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=33</guid>
		<description><![CDATA[页内导航： 能否只用官方的API来获取全部饭否消息？ 饭否消息结构 使用regex解析饭否消息 使用xml解析饭否消息 两相比较 相关阅读 批量导出饭否程序的方法很多，但是基本思路都是先将该网... ]]></description>
			<content:encoded><![CDATA[<p>页内导航：</p>
<ul>
<li><a href="#xiaochaqu"><strong>能否只用官方的API来获取全部饭否消息？</strong></a></li>
<li><a href="#饭否消息结构"><strong>饭否消息结构</strong></a></li>
<li><a href="#regex"><strong>使用regex解析饭否消息</strong></a></li>
<li><a href="#python"><strong>使用xml解析饭否消息</strong></a></li>
<li><a href="#compare"><strong>两相比较</strong></a></li>
<li><a href="#xiangguan"><strong>相关阅读</strong></a></li>
</ul>
<p>
批量导出饭否程序的方法很多，但是基本思路都是先将该网页保存到本地，然后将有用的饭否消息析取出来。本文不讨论如何下载饭否网页了（使用迅雷、wget、curl等），重点讨论对于下载到本地的网页，如何将有用的饭否消息析取出来。
<p><span id="more-33"></span></p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;"><a href="#xiaochaqu"><strong><span style="color: #ff008c;">小插曲：能否只用官方的API来获取全部饭否消息？</span></strong></a></h2>
<p>您或许会提议为什么不使用饭否自身的API。是的，饭否的API更快捷方便，兼容性很强。只是，饭否官方只提供下载前20条饭否消息的API。如果纯粹使用饭否官方API来下载全部饭否消息的方法也不是没有，只是很邪恶：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>true<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; download <span style="color: #cc66cc;">20</span> messages via API<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; store them<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">delete</span> this <span style="color: #cc66cc;">20</span> messages via API<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</p>
<p>一边下载一边删除，确实总能得到全部消息。删除了前面的20条，能保证后面的20条以新消息的面目出现。这在理论上是行得通的。但是我们需要的是英雄Heroes里Peter那样无损的复制方式，而不是Sylar那样的残忍的剪切方式，呵呵。既然官方的API有限制，我们就自己动手了。请继续阅读本文。</p>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">饭否消息结构</h2>
<p>打开一个饭否消息网页的源代码，例如本人的<br />
<a name="饭否消息结构"></a><a title=" 我爱正则表达式" href="http://fanfou.com/regex" target="_blank">http://fanfou.com/regex/p.1</a>（其实http://fanfou.com/regex是http://fanfou.com/regex/p.1的快捷方式。这里使用完整的路径，以便体现其一般性。），观察可见，有用的饭否消息在这个框架里面：（代码较长，阅读请点击展开）</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"> <br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;ol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 代码非抄不能懂也。<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-03 12:07&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;/statuses/QD6qHiqUbeE&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-03 12:07<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过 <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://del.icio.us/fanfou/API%E5%BA%94%E7%94%A8&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; API<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 向自由的身心致敬！ - 早嗷嗷也盼~晚安安也盼~望穿安安双眼~~怎知道今日里打土匪进深山自己的队伍来哎到嗷~面安前安呐啊啊啊啊啊~~~ <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://fanfou.com/linkto/aHR0cDovL3d3dy5kb3ViYW4uY29tL2V2ZW50LzEwMjczNDg3Lw&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; http://www.douban.com/event/10273487/<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-06 14:07&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;/share/bd96z1U-gHw&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-06 14:07<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://help.fanfou.com/share_button.html&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 饭否分享<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;photo&quot;</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://fanfou.com/photo/8JsezhHM_VU&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;img</span> <span style="color: #000066;">src</span>=<span style="color: #ff0000;">&quot;http://photo.fanfou.com/m0/00/19/e2_36807.jpg&quot;</span> <span style="color: #000066;">alt</span>=<span style="color: #ff0000;">&quot;caixinceshi - no description&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 上传了新照片：caixinceshi - no description<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;time&quot;</span> <span style="color: #000066;">title</span>=<span style="color: #ff0000;">&quot;2008-10-03 11:33&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2008-10-03 11:33<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;span</span> <span style="color: #000066;">class</span>=<span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;a</span> <span style="color: #000066;">href</span>=<span style="color: #ff0000;">&quot;http://help.fanfou.com/mobile_mms.html&quot;</span> <span style="color: #000066;">target</span>=<span style="color: #ff0000;">&quot;_blank&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 彩信<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/a<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/span<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; #更多的<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/li<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>条目，每页最多20条。<br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/ol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>Tips：在分析饭否源代码时，饭否消息全在一行，不便于阅读。您可以拷贝所需要的代码（注意前后结构的匹配呼应）到vim中，执行<tt class="string">:%s/&gt;/&gt;\r/g</tt>(将每个&gt;后面加上一个换行符)，再按<tt class="string">ggvG</tt>全选，按<tt class="string">=</tt>格式代码，所有的代码就成了漂亮的缩进格式，便于阅读了。</p>
</p>
<p><a name="regex"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">使用regex解析饭否消息</h2>
<p> </a></p>
<p>下面是使用regex来解析饭否消息的代码（直接拷贝自本人原来的perl抓饭程序。）</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$ffmsg</span><span style="color: #339933;">=</span><span style="color: #000066;">qr</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/(?:statuses|share)/([-_a-zA-Z0-9]{11})&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;([-: 0-9]{16})&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;method&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 通过<span style="color: #009900;">&#40;</span>网页<span style="color: #339933;">|</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*&lt;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&gt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&gt;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:&lt;/</span>a<span style="color: #339933;">&gt;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;/</span>li<span style="color: #339933;">&gt;</span><br />
<span style="color: #ff0000;">' }xi;</span></div></td></tr></tbody></table></div>
</p>
<p>可以看出，使用正则表达式，能够比较真实地再现原网页代码的风貌。有几处小地方需要说明一下：</p>
<ul>
<li>在第一组小括号里，我使用了<tt class="regex">([^<]+?)</tt>来捕获消息正文（一条完整的消息可以分为：消息正文；发送时间；消息uuid例如QD6qHiqUbeE，发送方法，类型（彩信还是文本））。最初是使用<tt class="regex">.*?</tt>的。但是这样不精确，有时候两条消息竟然混合在一起。而<tt class="regex">([^<]+?)</tt>捕获的是从当前位置开始至下一个&lt;之前的所有内容。或许您会问，这不怕受到消息正文中可能出现的&lt;的影响吗？答案是：不会受到影响。因为饭否会把所有的&lt;以及其实有可能影响解析的字符，都转换成&lt;的形式了，因此它不影响解析。同时，<strong>使用精确的正则表达式有助于提高效率，让不匹配的正则式尽早失败。</strong></li>
<li><tt class="regex">(?:statuses|share)</tt>。这条正则表达式是用来捕获饭否的uuid。它不但能捕获以普通方法发布的消息（网页、短信、手机、API、IM工具等），还能捕获由“饭否分享”工具发布的消息。我不是很喜欢饭否分享这个工具。（或许改天有时间写篇文章，揭露它的缺点？）之所以把“饭否分享”消息和普通消息分开来说，是因为两者的结构是不一样的。</li>
<li>通过<tt class="regex">(网页|(?:\s*<[^>]+>)[^<]+(?:))</tt>这条正则式，既用了捕获型括号，又用了非捕获型括号。使用后者，能有效地避免程序太复杂，便于按序号引用（$1,$2等，如果越多则越混乱，修改正则式后，更是乱成一团遭），还能节省内存（如果程序中捕获了太多的内容，而不及时释放，或许会占尽资源。毕竟不是只捕获几十字节。要考虑到饭否用户或许有近十万条的饭否消息。指的是<a href="http://fanfou.com/appleice">苹果流冰</a>这样的“万玻南痨话”）</li>
<li><tt class="regex">xi</tt>选项：<tt class="regex">x</tt>是为了使用忽略空白字符和允许注释；<tt class="regex">i</tt>选项是忽略大小写。</li>
</ul>
<p>使用正则表达式来析取饭否消息文本，需要考虑的细节很多。一处不细致，程序运行起来就会给你难看。饭否彩信的格式就略过不分析了。道理相同，点到为止。</p>
</p>
<p><a name="python"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">使用xml解析饭否消息</h2>
<p></a></p>
<p>再来看一下在python下，使用xml来解析饭否消息。注：该程序参考了<a href="http://www.happysky.org/" target="_blank"><strong><span style="color: #ff008c;">ppip</span></strong</a>的<a href="http://code.google.com/p/pyfan/" target="_blank">pyfan</a>程序。<br />
 </p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">xml</span>.<span style="color: black;">dom</span> <span style="color: #ff7700;font-weight:bold;">import</span> minidom, Node <span style="color: #808080; font-style: italic;">#引人解析工具：xml小马驹！</span><br />
node = minidom.<span style="color: black;">parse</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;http://fanfou.com/zhasm/p.1&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><br />
<span style="color: #808080; font-style: italic;">#抓取页面http://fanfou.com/zhasm/p.1 的全部内容到变量node中</span><br />
l = node.<span style="color: black;">getElementsByTagName</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ol&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><br />
<span style="color: #808080; font-style: italic;">#将饭否消息部分内容保存到变量l中</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> c <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, number<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># 时间</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">hasAttribute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;class&quot;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">continue</span><br />
&nbsp; &nbsp; content = <span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#时间:</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>.\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; firstChild.<span style="color: black;">getAttribute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;title&quot;</span><span style="color: black;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#消息正文 :</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; childNodes<span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">toxml</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">22</span>:-<span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#uuid</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>.<span style="color: black;">childNodes</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>.<span style="color: black;">firstChild</span>.\<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; getAttribute<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;href&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">10</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>在xml文件中，前后呼应的标签，成了鲜活的特征，这些特征可以被xml解析函数很容易地辨识出来，并提取出所需内容。</p>
<ul>
<li><strong>childNodes[c].childNodes[0].toxml()[22:-7]</strong>：这条语句的意思是，对于每一条饭否消息（childNodes[c]），其消息内容的第一个节点（childNodes[0]），截取其第23字节到倒数第7字节的内容。它是指哪一段呢？其实就是每一对&lt;span class=&#8221;content&#8221;&gt;&#8230;&lt;/span&gt;之间点号所示的内容。</li>
<li>每条消息的发送时间、正文、uuid，保存在tuple中。</li>
</ul>
<p>取得了内容之后，至于之后的煎炒烹炸，就悉听尊便了。</p>
<p>值得一提的是，本人在大量下载饭否消息时，不止一次遇到过饭否页面无法访问的情况。问了饭否郭万怀，答曰为了减轻服务器负载，每个IP地址下每分钟允许访问100个页面。超过此数就会自动屏蔽。我测试的结果是少于100页。比较靠谱的间隔是，每析取一页，sleep(15)。是有些慢了。没办法。当然，也有人说，执行本人以前写的抓饭程序，一次下载几百页，并没有遇到当机情况。那我只能说是您的RP高、运气好了。</p>
</p>
<p><a name="compare"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">两者比较</h2>
<p></a> </p>
<p>个人认为，xml与regex相比，有如下特点：</p>
<ul>
<li><strong>通用性：</strong>xml具有通用性，不单单能解析饭否消息，其它符合规则的html文本，同样能够较少地改动代码，即可解析；而正则表达式则具有专用性，不能放之四海而皆准。当饭否的界面、框架有微调时，估计使用正则表达式解析的工具首先倒下。</li>
<li><strong>可读性：</strong>有人说perl是只写语言，regex尤甚。这是在说perl或regex代码在编写时性之所至，酣畅淋漓，执行也很高效。只是，如果代码格式混乱且无注释文档的话，隔数日、数月再读，仿佛读天书一般。而使用xml库来解析的python语言，则由于代码格式整齐，库函数见名知意，因而具有较强的可读性。这样说，总体是这样。不过我们可以尽可能把代码（即使是perl或regex的）写的整齐已读，尤其是考虑到perl支持<tt class="regex">/x</tt>选项。</li>
<li><strong>效率：</strong>良好编译的正则式，其执行效率应该优于xml解析。但是，使用xml能够节省编程时间；使用正则式牺牲一部分的编程时间，理论上能提高一点点效率。有兴趣的读者可以编写一段程序，循环个成千上万次，比较一下平均时间。</li>
</ul>
<p>
写到这里，对照金庸先生在《鹿鼎记》第五章：“金戈运启驱除会，玉匣书留想象间”两种武功的比较，颇有意味：<br />
“大慈大悲千叶手”招式太多，记起来麻烦。而“八卦游龙掌”只有八八六十四式，但反复变化，尽可敌得住千叶手。那么哪一门功夫厉害些？这两门都是上乘掌法，说不上哪一门功夫厉害。谁的功夫深，用得巧妙，谁就胜了。
<p>以本文来看，regex就相当于是大慈大悲千叶手了，需要留意的细节太多；xml方式呢，就相当于只有八八六十四式的“八卦游龙掌”。两种工具都很有用。</p>
</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">呼吁官方提供更多功能</h2>
<p>离题了。这里顺便发发牢骚而已，与xml、regex无关。我不止一次地在饭否和本人blog中抱怨，使用上面这足粗笨的方法下载、解析，是最无奈的应用。最便捷的方式，应该是官方提供批量导出程序，只要执行一条数据库查询导出即可实现我们辛辛苦苦半天才能以变通的方式实现的功能。或许是饭否官方的人员都在忙着增强和美化海内吧，饭否自生自长，长时间没有更新，任凭jiwai.de、zuosa等推出一项又一项的新功能。 </p>
</p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">扩展</h2>
<p>本文的思路，对twitter同样适用。但是twitter越来越慢了。有一段时间好像还不支持查看历史页面。</p>
</p>
<p><a name="xiangguan"><br />
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">相关阅读</h2>
<p></a></p>
<ul>
<li><a href="http://iregex.org/blog/fanfou-private-message-format-analysis.html" target="_blank">饭否私信格式分析</a></li>
<li><a href="http://zhasm.com/blog/fanfou-msg-grabber-limitation-and-suggestion-on-sharing-msg.html">关于饭否消息打包下载的限制以及对于饭否分享功能的建议</a></li>
<li><a href="http://zhasm.com/blog/about-my-fanfou-applications.html">关于本人编写的饭否应用的三言两语</a></li>
<li><a href="http://zhasm.com/blog/comments-on-fanfou.html">饭否，尚能饭否？</a></li>
<li><a href="http://zhasm.com/blog/uuid-in-twitter-and-fanfou.html">uuid in twitter and fanfou</a></li>
<li><a href="http://zhasm.com/blog/fanfou-message-grabber.html">批量抓饭脚本：一次性打包输出自己全部的饭否消息！</a></li>
<li><a href="http://zhasm.com/blog/fanfou-vs-twitter-base64-vs-tinyurl.html">fanfou vs twitter, base64 vs tinyurl?</a></li>
</ul>
<p><span style="color: #ffffff;">验证码：BANG1F79A9FAD20225BEA7FE397AXIANGUO e8da37692b5b030cbefb9956e3bdb9cc</span></p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-message-extractor-regex-vs-xml.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>饭否私信格式分析</title>
		<link>http://iregex.org/blog/fanfou-private-message-format-analysis.html</link>
		<comments>http://iregex.org/blog/fanfou-private-message-format-analysis.html#comments</comments>
		<pubDate>Sat, 31 May 2008 04:07:38 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[杂项]]></category>
		<category><![CDATA[fanfou]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[regex]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=13</guid>
		<description><![CDATA[URL 饭否私信分为两种，一种是我收到的私信，一种是我发出的私信。 我收到的私信：http://fanfou.com/privatemsg/p.(1-N) 我发出的私信：http://fanfou.com/privatemsg/sent/p.(1-N) 上面的地址中不含饭否ID；需要... ]]></description>
			<content:encoded><![CDATA[<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">URL</h3>
<p>饭否私信分为两种，一种是我收到的私信，一种是我发出的私信。</p>
<ul>
<li> 我收到的私信：http://fanfou.com/privatemsg/p.(1-N)</li>
<li> 我发出的私信：http://fanfou.com/privatemsg/sent/p.(1-N)</li>
</ul>
<p>上面的地址中不含饭否ID；需要cookie验证。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">结束标志</h3>
<p>通过cookie验证后，可以使用数字获得对应页码的私信内容。什么时候是结束呢？假如您的收件箱有1000条私信，每页显示20条，那么当你您输入http://fanfou.com/privatemsg/p.51时，就得不到任何有效的内容了。作为程序，它是寻找如下标志：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #339933;">&lt;</span>ol class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;wa&quot;</span><span style="color: #339933;">&gt;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*&lt;/</span>ol<span style="color: #339933;">&gt;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">好友列表</h3>
<p>在页面代码中，每页都有一个“向XXX发送私信”的combox列表，条目以<font color="#ff0084">昵称+ID</font>组成。如果你的好友很多的话（500+），每条好友（昵称+ID）需要20字节（估算）的话，20*500=10K，大约需要多抓取10K的字节量。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">收件箱饭否私信的结构</h3>
<p>收到的私信分为两种，一种是有回复信息的（回复原文:…），一种是没有回复的。先从简单的入手，看没有回复的。<br />
所有的私信都在<font color="#ff0084">&lt;ol class=&#8221;wa&#8221;&gt;…&lt;/ol&gt;</font>之内，以<font color="#ff0084">&lt;li&gt;&lt;/li&gt;</font>分隔</p>
<p>例如下面这一条，就是一则很规范的私信（与发件人相关的信息都使以正则式表示）：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; title=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;avatar&quot;</span><span style="color: #339933;">&gt;&lt;</span>img src<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; alt=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> <span style="color: #339933;">/&gt;&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; 来自<span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot;&gt;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&lt;/</span>a<span style="color: #339933;">&gt;</span>：<br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span>没法比较啊<span style="color: #339933;">,</span>你得说个具体的值<span style="color: #339933;">,</span>比如<span style="color: #cc66cc;">100</span>条以下的算少<span style="color: #339933;">,</span><span style="color: #cc66cc;">1000</span>条以上的算多……<span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;2008-05-30 17:25&quot;</span><span style="color: #339933;">&gt;</span>约 <span style="color: #cc66cc;">15</span> 小时前<span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;op&quot;</span><span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/privatemsg.reply/583520&quot;</span><span style="color: #339933;">&gt;</span>回复<span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/privatemsg.del/583520&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;post_act&quot;</span><span style="color: #339933;">&gt;</span>删除<span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
<span style="color: #339933;">&lt;/</span>li<span style="color: #339933;">&gt;</span></div></td></tr></tbody></table></div>
<p>其中，需要记录的信息有：</p>
<ul>
<li>发件人名字；</li>
<li>发件人ID；</li>
<li>私信内容；</li>
<li>时间；</li>
<li>私信ID；（便于作删除、回复处理）。</li>
</ul>
<p>根据以上需求，将上面的私信代码作处理：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot; title=&quot;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+)&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;avatar&quot;</span><span style="color: #339933;">&gt;&lt;</span>img src<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; alt=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> <span style="color: #339933;">/&gt;&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; 来自<span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot;&gt;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&lt;/</span>a<span style="color: #339933;">&gt;</span>：<br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot;&gt;[^&lt;]+&lt;/span&gt;<br />
&nbsp; &nbsp; &lt;span class=&quot;</span>op<span style="color: #ff0000;">&quot;&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;a href=&quot;</span><span style="color: #339933;">/</span>privatemsg<span style="color: #339933;">.</span>reply<span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot;&gt;回复&lt;/a&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;a href=&quot;</span><span style="color: #339933;">/</span>privatemsg<span style="color: #339933;">.</span>del<span style="color: #339933;">/</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; class=&quot;</span>post_act<span style="color: #ff0000;">&quot;&gt;删除&lt;/a&gt;<br />
&nbsp; &nbsp; &lt;/span&gt;<br />
&lt;/li&gt;</span></div></td></tr></tbody></table></div>
<p>从而得到：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$1</span><span style="color: #339933;">=</span>fanfou ID<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$2</span><span style="color: #339933;">=</span>fanfou name<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$3</span><span style="color: #339933;">=</span>private msg<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$4</span><span style="color: #339933;">=</span>msg <span style="color: #000066;">time</span><span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$5</span><span style="color: #339933;">=</span>msg ID<span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>再看一下包含“回复原文”的私信的结构(部分内容已作正则处理)：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot; title=&quot;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+)&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;avatar&quot;</span><span style="color: #339933;">&gt;&lt;</span>img src<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; alt=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> <span style="color: #339933;">/&gt;&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; 来自<span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot;&gt;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&lt;/</span>a<span style="color: #339933;">&gt;</span>：<br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot;&gt;[^&lt;]+&lt;/span&gt;<br />
&nbsp; &nbsp; &lt;span class=&quot;</span>op<span style="color: #ff0000;">&quot;&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;a href=&quot;</span><span style="color: #339933;">/</span>privatemsg<span style="color: #339933;">.</span>reply<span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot;&gt;回复&lt;/a&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;a href=&quot;</span><span style="color: #339933;">/</span>privatemsg<span style="color: #339933;">.</span>del<span style="color: #339933;">/</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; class=&quot;</span>post_act<span style="color: #ff0000;">&quot;&gt;删除&lt;/a&gt;<br />
&nbsp; &nbsp; &lt;/span&gt;<br />
&nbsp; &nbsp; &lt;p class=&quot;</span>pm<span style="color: #339933;">-</span>parent<span style="color: #ff0000;">&quot;&gt;回复原文: 有兴趣就有动力&lt;/p&gt;<br />
&lt;/li&gt;</span></div></td></tr></tbody></table></div>
<p>与前者相比，只是多了<font color="#ff0084">&lt;p class=&#8221;pm-parent&#8221;&gt;.*?&lt;/p&gt;</font>这一段。这是不足为虑的。只要整体加上<font color="#ff0084">?</font>这个强有力的正则符号，就能与上面的代码片段归纳到一起。两者结合合的代码如下：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot; title=&quot;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+)&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;avatar&quot;</span><span style="color: #339933;">&gt;&lt;</span>img src<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; alt=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> <span style="color: #339933;">/&gt;&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; 来自<span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot;&gt;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&lt;/</span>a<span style="color: #339933;">&gt;</span>：<br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^&lt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot;&gt;[^&lt;]+&lt;/span&gt;<br />
&nbsp; &nbsp; &lt;span class=&quot;</span>op<span style="color: #ff0000;">&quot;&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;a href=&quot;</span><span style="color: #339933;">/</span>privatemsg<span style="color: #339933;">.</span>reply<span style="color: #339933;">/</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot;&gt;回复&lt;/a&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;a href=&quot;</span><span style="color: #339933;">/</span>privatemsg<span style="color: #339933;">.</span>del<span style="color: #339933;">/</span><span style="color: #0000ff;">\d</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; class=&quot;</span>post_act<span style="color: #ff0000;">&quot;&gt;删除&lt;/a&gt;&lt;/span&gt;<br />
&nbsp; &nbsp; (?:&lt;p class=&quot;</span>pm<span style="color: #339933;">-</span>parent<span style="color: #ff0000;">&quot;&gt;([^&lt;]+)&lt;/p&gt;)?<br />
&lt;/li&gt;</span></div></td></tr></tbody></table></div>
<p>得到的变量为：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000ff;">$1</span><span style="color: #339933;">=</span>fanfou ID<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$2</span><span style="color: #339933;">=</span>fanfou name<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$3</span><span style="color: #339933;">=</span>private msg<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$4</span><span style="color: #339933;">=</span>msg <span style="color: #000066;">time</span><span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$5</span><span style="color: #339933;">=</span>msg ID<span style="color: #339933;">;</span><br />
<span style="color: #0000ff;">$6</span><span style="color: #339933;">=</span>parent msg<span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">#回复原文。</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">发件箱饭否私信的结构</h3>
<p>抄代码：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot; title=&quot;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+)&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;avatar&quot;</span><span style="color: #339933;">&gt;&lt;</span>img src<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; alt=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> <span style="color: #339933;">/&gt;&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; 发给<span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot;&gt;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&lt;/</span>a<span style="color: #339933;">&gt;</span>：<br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span>海内的像片是真的。<span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;2008-05-28 20:24&quot;</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">2008</span><span style="color: #339933;">-</span>05<span style="color: #339933;">-</span><span style="color: #cc66cc;">28</span> <span style="color: #cc66cc;">20</span><span style="color: #339933;">:</span><span style="color: #cc66cc;">24</span><span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;op&quot;</span><span style="color: #339933;">&gt;&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/privatemsg.del/576827&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;post_act&quot;</span><span style="color: #339933;">&gt;</span>删除<span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
<span style="color: #339933;">&lt;/</span>li<span style="color: #339933;">&gt;</span></div></td></tr></tbody></table></div>
<p>这与“我收到的私信”的结构完全一致，只是将原来的“来自”改为“发给”而已。</p>
<p>不出意外，带有“回复原文”的“我收到的私信”的结构是这样的：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009999;">&lt;li&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/([^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #ff0000;">&quot; title=&quot;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+)&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;avatar&quot;</span><span style="color: #339933;">&gt;&lt;</span>img src<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot; alt=&quot;</span><span style="color: #009900;">&#91;</span><span style="color: #339933;">^</span><span style="color: #ff0000;">&quot;]+&quot;</span> <span style="color: #339933;">/&gt;&lt;/</span>a<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; 发给<span style="color: #339933;">&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #ff0000;">&quot;&gt;[^&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">+&lt;/</span>a<span style="color: #339933;">&gt;</span>：<br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;content&quot;</span><span style="color: #339933;">&gt;</span>在自述部分显示的那个网上。<span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;stamp time&quot;</span> title<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;2008-05-29 17:00&quot;</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">2008</span><span style="color: #339933;">-</span>05<span style="color: #339933;">-</span><span style="color: #cc66cc;">29</span> <span style="color: #cc66cc;">17</span><span style="color: #339933;">:</span>00<span style="color: #339933;">&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>span class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;op&quot;</span><span style="color: #339933;">&gt;&lt;</span>a href<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;/privatemsg.del/579918&quot;</span> class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;post_act&quot;</span><span style="color: #339933;">&gt;</span>删除<span style="color: #339933;">&lt;/</span>a<span style="color: #339933;">&gt;&lt;/</span>span<span style="color: #339933;">&gt;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">&lt;</span>p class<span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;pm-parent&quot;</span><span style="color: #339933;">&gt;</span>回复原文<span style="color: #339933;">:</span> 我也要试一试。<span style="color: #339933;">&lt;/</span>p<span style="color: #339933;">&gt;</span><br />
<span style="color: #339933;">&lt;/</span>li<span style="color: #339933;">&gt;</span></div></td></tr></tbody></table></div>
<p>我们从饭否私信代码上得到的信息就这些。遗憾的是，饭否私信中，关于“回复原文”是以明文内容形式出现，而不是以原私信ID的形式出现。后期处理时通过搜索功能解决此问题并非不能，只是如果饭否官方能够再将此功能完善的话，会省整理者不少力气。</p>
<p>饭否的私信源码分析完毕。至于如何读写cookie，如何写代码，如何以数据库的形式来管理下载的数据，是第二阶段的事情了。待我一一实现。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/fanfou-private-message-format-analysis.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
