<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; 应用</title>
	<atom:link href="http://iregex.org/blog/category/%e5%ba%94%e7%94%a8/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Sun, 27 Jun 2010 04:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>	
<!-- Start Of Script Generated By WP-PostViews Plus -->
<script type='text/javascript' src='http://iregex.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
</script>
<!-- End Of Script Generated By WP-PostViews Plus -->
	<item>
		<title>打造自己的正则表达式助手程序</title>
		<link>http://iregex.org/blog/diy-regexbuddy.html</link>
		<comments>http://iregex.org/blog/diy-regexbuddy.html#comments</comments>
		<pubDate>Wed, 12 May 2010 05:32:37 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[cgi]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[regexbuddy]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=115</guid>
		<description><![CDATA[其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出... ]]></description>
			<content:encoded><![CDATA[<p>其实RegexBuddy挺好用的，我一直用它。它的用法、好处，可以写好多文字，本站也做过介绍；不过，也有理由不用它，同时这也是撰写本文的一个原因。我动了动脑筋，花了一点时间，已经做出雏形。现在将思路公布在这里，与各位交流一下。</p>
<p><span id="more-115"></span></p>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">缘由</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">为什么不用RegexBuddy了</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>它是收费软件，价格不算便宜。$39.95。Google一下，或有惊喜。</li>
<li>它只能用于Windows平台。虽然在ubuntu下，我会额外安装wine，仅仅是为了驱动RegexBuddy。</li>
<li>Mac下无法使用RegexBuddy。近来我开始使用Mac环境了，不想再为windows软件单独运行环境了。regexbuddy似乎要失之交臂了。搜索了一下，<a href="http://search.macupdate.com/search.php?keywords=regex&#038;os=mac" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，<a href="http://www.apple.com/search/?q=regex&#038;sec=downloads" title="我爱正则表达式|打造自己的正则表达式助手程序">这里</a>，找到的软件聊聊无几，性能也乏善可陈：大多仅支持JavaScript这样比较朴素的正则，缺乏多语言、多选项的支持。&#8211;RegexBuddy出色的表现，已经将我对正则辅助软件的期望值训练得极为挑剔，一般软件难以落入老夫的法眼了，呵呵。</li>
</ul>
<p>没有现成的解决方案，我就考虑，如何自己DIY一个了。</p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我理想中的正则辅助软件</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>像RegexBuddy一样，支持以下属性：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>支持多语言正则。至少要支持Perl, Python, PHP, JavaScript吧。.Net的用得不多（只在回答别人问题时用过，不算），可以无视；</li>
<li>支持匹配、替换、分割(split)；</li>
<li>支持生成代码片段；这一点很重要。我平常不会死背硬记一些电脑可以代劳的冬冬，除非经常用&#8211;经常用的，慢慢也就变成肌肉记忆了。</li>
</ol>
</blockquote>
<li>除此之外，它最好还能：</li>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ol>
<li>兼容于各种常见平台。我指的是，Win/Lin/Mac。</li>
<li>对于语言的支持要原生。说实话，我怀疑RegexBuddy还在用Perl5.8风格的正则。5.10中的许多新奇好用的特性，还没有在RegexBuddy中得到支持。究其原因，RegexBuddy的作者大概是自行从头构建的Perl等正则引擎，在细节、版本上，与最新版有所差异。说到语言，想起余晟老师的一点意见，就是思考正则问题时，先不要考虑是什么语言、版本的正则，心中要有统一的语法。我同意余老师的观点，但是也觉得，在了解了貌似通用的正则语法基础之后，应该比较清晰地了解自己最常用的正则语言的语法细节，以及与其它语言的差异，以避免似是而非。跑题，打住。</li>
<li>开源，正版，免费。我们向其他人介绍正则，总得有一款可以拿得出手的工具吧？免费这条倒是不苛求，话说好软件还是应该有所回报的。</li>
</ol>
</blockquote>
</ol>
<p>问题是，这么好的软件，到那里去找呢？找不到的话，自己想从头实现，该如何动手呢？ </p>
</blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">我的思路历程</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>使用Objective-C来实现。不过，这想法没多久就像萝莉一样被推倒了。Obj-C固然是要学的，但我等不及了。RegexBuddy这类的软件我是天天都在用。这个目标似乎比上一条还要临渴掘井。为mac平台开发了，代码至少还要为win/lin单独编译吧？再者，如果用了Obj-C，正则引擎怎么办？从头实现？xiaofei说，要实现一个好用的正则引擎，要一个优秀的团队半年的时间。当然，Obj-C也可以调用现成的模块，这也引出了我现在的思路。</li>
<li>做成网页程序，前端接收用户输入，后端使用CGI调用服务器上的原生正则引擎（perl、python），匹配、替换后展现在前端。它最大的好处是，语言百分百原生，Native；只要网络在，打开浏览器就能用；即使没有网络，本机localhost也可用，而且更快。JavaScript/PHP就不必劳驾CGI了，原汤化原食就可以。</li>
</ul>
<p>                             话说我已经选择了第二套方案，于是就着手实现。
                        </p></blockquote>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">目前的进度</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li>已经使用HTML+jQuery画出了简单的界面，实现了perl 5.10版的CGI程序，能够进行匹配、替换、分割（Split)。</li>
<li>未实现的功能：代码Snippets自动生成；其它语言版本的实现。</li>
<li>对于我自己来说，基本上已经可以使用了。我现在就正在 eat my own dog food，一边用它，一边完善它。不过要想发布出来供大家使用，还需要旷日持久的功能完善、界面美化。</li>
<li>截图见文章末尾。<br/>
</ul>
</blockquote>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">Perl CGI 代码以及简要说明</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">代码</h3>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #000000; font-weight: bold;">use</span> CGI<span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> cl <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;#0ff&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">==-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> h_color<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #000066;">shift</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">*=-</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$color</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$counter</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">?</span> <span style="color: #ff0000;">&quot;#ff0&quot;</span> <span style="color: #339933;">:</span> <span style="color: #ff0000;">&quot;#0ff&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">&quot;&lt;span style='background-color:$color'&gt;&quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$a</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;&lt;/span&gt;&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">=</span>CGI<span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">die</span> <span style="color: #ff0000;">&quot;$!&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">header</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">-</span>type<span style="color: #339933;">=&gt;</span><span style="color: #ff0000;">&quot;text/html; charset=UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;regex&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;">#quit immediatly if no $regex input</span><br />
<span style="color: #000066;">die</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$regex</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;text&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$mode</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;mode&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;space&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$action</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;action&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #0000ff;">$regex</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\s+//g</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$x</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;match&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'$text =~ s@$regex'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@&amp;h_color($&amp;)'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span><span style="color: #339933;">.=</span><span style="color: #ff0000;">'@eg'</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$mode</span><span style="color: #339933;">.</span><span style="color: #ff0000;">';'</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br /&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$text</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$replace</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$q</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">param</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;replace&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\$</span>text =~ s:<span style="color: #000099; font-weight: bold;">\$</span>regex:$replace:g;&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #ff0000;">&quot;$code&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">s</span><span style="color: #666666; font-style: italic;">#\n#&lt;br/&gt;#g;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;&lt;pre&gt;$text&lt;/pre&gt;&quot;</span> <span style="color: #b1b100;">unless</span> <span style="color: #0000ff;">$@</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">elsif</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$action</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;split&quot;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#@result=split(m@$regex@mode, $text);</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=split(m@$regex@'</span><span style="color: #339933;">.</span> <span style="color: #0000ff;">$mode</span> <span style="color: #339933;">.</span> <span style="color: #ff0000;">', $text);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'@result=grep /\S/, @result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'my $count=@result;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;font color=\&quot;#ff008c\&quot;&gt;$count&lt;/font&gt; record(s) returned:&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;ol&gt;&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;li&gt;&quot;.&amp;h_color($_).&quot;&lt;/li&gt;&quot; foreach (@result);'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$code</span> <span style="color: #339933;">.=</span> <span style="color: #ff0000;">'print &quot;&lt;/ol&quot;;'</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">eval</span> <span style="color: #0000ff;">$code</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
</blockquote>
<p>代码…还算简洁。主要就是接收并简单处理一下各个参数，然后按照不同的动作要求（match/replace/splie）进行相应的动态代码生成，然后eval执行结果，返回输出。在match/split中，还插入了代码高亮的小功能。基于perl代码的高效紧凑，实现起来倒也不至于冗长。感谢<a href="http://twitter.com/cnhacktnt">cnhacktnt</a>的协助。</p>
</blockquote>
<h2 style="background-color:#99CC00; border:1px solid #666666;color:#000000;font-size:21px;line-height:35px;padding-top:3px;text-indent:6px;">截图</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<ul>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/match_cn.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/replace.png" border="0" alt="Photobucket"></a></li>
<li><a href="http://iregex.org/blog/diy-regexbuddy.html" target="_blank" title="我爱正则表达式|打造自己的正则表达式助手程序"><img src="http://i293.photobucket.com/albums/mm60/zhasm/split_cn.png" border="0" alt="Photobucket"></a></li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/diy-regexbuddy.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>better downloader in perl</title>
		<link>http://iregex.org/blog/better-downloader-in-perl.html</link>
		<comments>http://iregex.org/blog/better-downloader-in-perl.html#comments</comments>
		<pubDate>Sun, 14 Mar 2010 16:08:12 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl curl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=80</guid>
		<description><![CDATA[这两天写了一段perl程序，输入url地址，下载其中的文件，计算MD5值，查该文件是否是病毒；如果无记录，则调用另外一个perl脚本将其上传到某一网站作详细测试。我想说的是，这些功能可能使... ]]></description>
			<content:encoded><![CDATA[<p>这两天写了一段perl程序，输入url地址，下载其中的文件，计算MD5值，查该文件是否是病毒；如果无记录，则调用另外一个perl脚本将其上传到某一网站作详细测试。我想说的是，这些功能可能使用bash来编程，会更直接；用perl来做bash的事情总有些越俎代庖。不过，目前我对perl极感兴趣；有机会就用它；另外，将不成熟的代码贴出来，留给未来的自己一个鄙视现在的自己的机会也好:)</p>
<p>一个小小的发现是，<a href="http://perldoc.perl.org/perlop.html#Quote-Like-Operators">qx</a>可以将所包含的语句当作bash命令来执行，并把结果返回。另外书中交待，<a href="http://perldoc.perl.org/functions/eval.html">eval</a>也是极有用的，不过这次没用上，下次找机会牛刀小试一把。</p>
<p>之所以没有使用curl, md5等等模块，而是使用shell命令，是因为我所用的虚拟机里没有安装，但是它们是bash下标准的可执行文件。这样写来，效率会有点折扣，但是绿色便携。</p>
<p>无废话，贴代码。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br />78<br />79<br />80<br />81<br />82<br />83<br />84<br />85<br />86<br />87<br />88<br />89<br />90<br />91<br />92<br />93<br />94<br />95<br />96<br />97<br />98<br />99<br />100<br />101<br />102<br />103<br />104<br />105<br />106<br />107<br />108<br />109<br />110<br />111<br />112<br />113<br />114<br />115<br />116<br />117<br />118<br />119<br />120<br />121<br />122<br />123<br />124<br />125<br />126<br />127<br />128<br />129<br />130<br />131<br />132<br />133<br />134<br />135<br />136<br />137<br />138<br />139<br />140<br />141<br />142<br />143<br />144<br />145<br />146<br />147<br />148<br />149<br />150<br />151<br />152<br />153<br />154<br />155<br />156<br />157<br />158<br />159<br />160<br />161<br />162<br />163<br />164<br />165<br />166<br />167<br />168<br />169<br />170<br />171<br />172<br />173<br />174<br />175<br />176<br />177<br />178<br />179<br />180<br />181<br />182<br />183<br />184<br />185<br />186<br />187<br />188<br />189<br />190<br />191<br />192<br />193<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<br />
<span style="color: #666666; font-style: italic;">#this script offers a better download service</span><br />
<span style="color: #666666; font-style: italic;">#integreting virustotal infor, send-sample.pl</span><br />
<span style="color: #666666; font-style: italic;">#by rex.zhang </span><br />
<span style="color: #666666; font-style: italic;">#on 03-11-2010 in Shanghai</span><br />
<span style="color: #666666; font-style: italic;">#updated @003122010;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$ARGV</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#it is assumed that the filename is the last part of the url, </span><br />
<span style="color: #666666; font-style: italic;">#just after the last / and before the $</span><br />
main<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #000000; font-weight: bold;">sub</span> main<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#print help message and quit if no url input;</span><br />
&nbsp; &nbsp; help<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$md5</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$size</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$name</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$url</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/\/([^\/]+)$/</span><span style="color: #339933;">;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the filename from the url; </span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$name</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get filesize from the url; if no size got, quit.</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$size</span><span style="color: #339933;">=</span>get_filesize<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">exit</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$size</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#try 5 times at most to get the file.</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$try</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">5</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$ok</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$try</span><span style="color: #339933;">--</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$ok</span><span style="color: #339933;">=</span>download_file<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">last</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$ok</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$ok</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;can not download the file, quit!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the md5 locally</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$md5</span><span style="color: #339933;">=</span>get_md5<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$name</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#and the url link from $virustotal;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span>get_vt_link<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#and even the virus infor;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$info</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span>get_vt_info<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$info</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; v_test<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$name</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>Sample has been sent to vtest. <span style="color: #000099; font-weight: bold;">\n</span>thanks for using.<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#return the md5 value of the file.</span><br />
<span style="color: #666666; font-style: italic;">#the filename is in the current directory</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_md5<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$filename</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$md5</span><span style="color: #339933;">=</span><span style="color: #ff0000;">`md5sum $filename`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$md5</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/^(\w{32})/</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>the md5 of the $filename is:<span style="color: #000099; font-weight: bold;">\t</span> $1.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#get the virustotal link with the given md5 value;</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_vt_link<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #ff0000;">`curl -s -e &quot;https://www.virustotal.com&quot; -d &quot;x=80&amp;y=23&amp;hash=$md5&quot; &quot;http://www.virustotal.com/vt/en/consultamd5&quot; | grep href`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$bool</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span> <span style="color: #0000ff;">$link</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/href=&quot;([^&quot;]+)&quot;/i</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$bool</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #ff0000;">&quot;http://www.virustotal.com&quot;</span><span style="color: #339933;">.</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#get the virus infor according to a virus total infor link </span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_vt_info<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$sophos</span> <span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#if sophos has no detection, do the vtest.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">qx</span><span style="color: #009900;">&#123;</span>curl <span style="color: #339933;">-</span><span style="color: #000066;">s</span> <span style="color: #0000ff;">$url</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$line</span><span style="color: #339933;">.=</span><span style="color: #0000ff;">$_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #339933;">!&lt;</span>tr<span style="color: #009900;">&#91;</span><span style="color: #339933;">^&gt;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">*&gt;</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009999;">&lt;td&gt;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>Sophos<span style="color: #339933;">|</span>Symantec<span style="color: #339933;">|</span>TrendMicro<span style="color: #339933;">|</span>McAfee<span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;/</span>td<span style="color: #339933;">&gt;.*?&lt;/</span>tr<span style="color: #339933;">&gt;!</span>sig<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>No record in VirusTotal. Sending v-test...<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>virustotal record found as following:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@result</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$tmp</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$tmp</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(&lt;[^&gt;]+&gt;\s*)+/\t/sig</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$tmp</span><span style="color: #339933;">.</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$tmp</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">m/(?:Sophos[.\s0-9]+)(?!-)(\S+)\s*$/i</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>Sophos has detection as $sophos, no v-test needed.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;you can read the virus details here:<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>$url<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$sophos</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#get the filesize by using the curl -I option.</span><br />
<span style="color: #666666; font-style: italic;">#print the filesize if it is greater than a given value, 2MB by default.</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> get_filesize<span style="color: #009900;">&#123;</span> <br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$size</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$unit</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">qx</span><span style="color: #009900;">&#123;</span>curl <span style="color: #339933;">-</span><span style="color: #000066;">s</span> <span style="color: #339933;">-</span>I <span style="color: #0000ff;">$url</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/Content-Length:\s(\d+)/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$size</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/Accept-Ranges:\s(\w+)/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$unit</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$size</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;can not get the length of the file, exit!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$size</span><span style="color: #339933;">=</span><span style="color: #000066;">int</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$size</span> <span style="color: #339933;">/</span> <span style="color: #cc66cc;">1024</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$flag</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#if flag=1 it is too large ; </span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$size</span><span style="color: #339933;">&gt;</span><span style="color: #cc66cc;">1024</span> <span style="color: #339933;">*</span> <span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$flag</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;the file is ${size}KB and greater than 2MB!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;it is too large to be a virus. exit.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>The file size is $size kb, downloading...<span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$flag</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#simply download the file</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> download_file<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$url</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/([^\/]+)$/</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$filename</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #ff0000;">`curl -O $url`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #b1b100;">not</span> <span style="color: #339933;">-</span>e <span style="color: #0000ff;">$filename</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;no filename, retrying...<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">return</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>file is downloaded and saved as $filename.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#print the help message and exit if no url input</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> help<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;Usage: ./dl http://.../file.exe<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>the last part of the url is regarded as filename.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#send the v-test email, with md5 value in the subject.</span><br />
<span style="color: #000000; font-weight: bold;">sub</span> v_test<br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$file</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$md5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">`zip $file.zip $file`</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #ff0000;">`send-sample.pl -a $file.zip -s &quot;$file.zip URL MD5 $md5&quot;`</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/better-downloader-in-perl.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>关于从普通文本提取正则表达式的再思考</title>
		<link>http://iregex.org/blog/text-2-regular-expressions-again.html</link>
		<comments>http://iregex.org/blog/text-2-regular-expressions-again.html#comments</comments>
		<pubDate>Mon, 08 Mar 2010 18:32:19 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[array]]></category>
		<category><![CDATA[engine]]></category>
		<category><![CDATA[hash]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[recursive]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=79</guid>
		<description><![CDATA[rex按： 写完上一篇文章之后，一直在考虑如何真正实现从普通文本中归纳正则表达式的实现。走了许多弯路，也学了不少知识。例如，perl黑豹书上复杂的数据结构、匿名散列和数组、refenrence... ]]></description>
			<content:encoded><![CDATA[<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">rex按：</h3>
<p>写完<a id="vt:b" title="个人应用之明文字串到正则" href="http://iregex.org/blog/literal-text-to-regex.html"><span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">上一篇</span></span></a><span class="Apple-style-span" style="font-style: normal;">文章之后，一直在考虑如何真正实现从普通文本中归纳正则表达式的实现。走了许多弯路，也学了不少知识。例如，perl黑豹书上复杂的数据结构、匿名散列和数组、refenrence；紫龙书上的状态机的构造，数据结构上图论的知识，都是很有用的。另外还新学了</span><a id="mk58" title="graphviz" href="http://www.graphviz.org/"><span class="Apple-style-span" style="font-style: normal;">graphviz</span></a><span class="Apple-style-span" style="font-style: normal;">的用法。以前觉得很神秘，不过一用才发现很直观。本文的插图是使用</span><a id="a1la" title="online版本的graphviz" href="http://graph.gafol.net/create"><span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">online版本的graphviz</span></span></a><span class="Apple-style-span" style="font-style: normal;">画的。</span></p>
<p>除了本文的这种实现方法（基于图），我还使用另一种方式实现了，很简单：基于关键词。具体作法是，逐一读取每一行文本，使用\s+等将其split开，形成array；然后再对所有的array进行求交集的操作（使用hash），得到每一行都有的关键词；然后按从左到右的顺序建立这覆的正则式^(.*?)keyword1(.*?)keyword2&#8230;.keywordN(.*?)$，再分别匹配每一行文本，得到hash的hash表，或者array的array，转置，并列输出，得到^(option1|option2&#8230;)keyword1(option..)&#8230;$这样的正则式。最后作为验证，再将所最终生成的正则与每一行匹配测试一下。</p>
<p>这样以词为单位做完之后，再逐个字母地分隔开来，递归地处理<span class="Apple-style-span" style="color: #474747;"><span class="Apple-style-span" style="font-style: normal;">(option1|option2&#8230;)的部分。先是单词级，再是字母级，有利于先在最大程度上找出重复的内容；而且粗化和细化的处理过程，思路是一致的，粒度不同罢了。</span></span></p>
<p><span class="Apple-style-span" style="color: #474747;"> 新手请自重，高手请赐教，我的思路未必是正确或最优的。</span></p></blockquote>
<p><span id="more-79"></span></p>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">问题</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>有文本文件text.txt，内容如下：</div>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;">
<div>this is a red fox</div>
<div>this is a blue firefox</div>
<div>this is a pig</div>
<div>a red fox</div>
</blockquote>
<div>请写一则程序，根据文本内容，自动构造（比较合理的）正则表达式，使之能够匹配文件中<strong>每一行</strong>文本。</div>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">标准正则</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>有两种极端的解法是不可取的：</p>
<ol>
<li><span class="Apple-style-span" style="color: #ff00ff;">^.*$</span></li>
<li><span class="Apple-style-span" style="color: #ff00ff;">^(this is a red fox|this is a blue firefox|this is a pig|a red fox)$</span></li>
</ol>
<p>第一种失之于太宽泛，第二种失之于太狭隘。太宽泛则泥沙俱下，无论什么文本都能匹配；太狭隘则僵化死板，缺乏灵活性。好的正则表达式源于例文本（从例文本中提取规律），又高于例文本（能匹配同规律的其它文本）。匹配什么，排除什么，都有定则，所谓“君子有所为而有所不为”，指的就是这种情况（貌似跑题了:)）。</p>
<div>那么，如何是比较靠谱的正则表达式呢？以上文的例子而言，可以是：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">^(this is )?a (red fox|blue firefox|pig)$</span></div>
<div>现在我们向着标准答案出发。</div>
</blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">思路</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>任何复杂的电路图，都可以拆分为三种简单的关系：串联，并联，短路。正则表达式也同理。</div>
<p>既然是一条正则匹配所有的文本，那么这条正则（记为<span class="Apple-style-span" style="color: #ff00ff;">$re</span>）也应该匹配第一行文本。</p>
<p>第一行文本为this is a red fox。那么，从<span class="Apple-style-span" style="color: #ff00ff;">^this is a red fox$</span>应该是<span class="Apple-style-span" style="color: #ff00ff;">$re</span>的一个（真）子集。它的路径为：<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a-&gt;red-&gt;fox-&gt;&#8221;$&#8221;</span>。全部节点之间，是串联关系，从左到右依次排列即可。</p>
<p>示意图如下(可以点击看全尺寸图，下同)：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001329.png" target="_blank"><img style="margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001329.png" border="0" alt="Photobucket" /></a></p>
<p>同理，第二行文本也应该是<span class="Apple-style-span" style="color: #ff00ff;">$re</span>的子集。不过，由于已经存在了由<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;this-&gt;is-&gt;a</span>的路径，到a时出现支路，<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;blue-&gt;firefox-&gt;$</span>；</p>
<p>将此路径添加到示意图上，得到：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001747.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309001747.png" border="0" alt="Photobucket" /></a></p>
<p>显而易见，这两条并列的支路，始于a，终于$，可以使用|来并列之。</p>
<p>好了，我们总结一下规律：</p>
<div><span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;"><strong><span class="Apple-style-span"><span class="Apple-style-span" style="background-color: #6fa8dc;">并列</span></span></strong>：如果存在A-&gt;B-&gt;C，且同时存在A-&gt;D-&gt;C，则B与D之间是并联关系。即出发点相同，结束点相同，且出发点与结束点之间各有一个以上的节点。并列使用括号来表示，之间以|分隔。例如，对于<span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;">A-&gt;B-&gt;C，A-&gt;D-&gt;C，则可以使用A(B|D)C来表示其正则关系。</span></span></span></span></div>
<div><span class="Apple-style-span" style="font-family: arial,sans-serif;"><span class="Apple-style-span" style="color: #000000;">为什么要强调是一个以上节点呢？这里先卖个关子。请继续阅读。</span></span></div>
<p>再往下，this is a pig，同理，只需要在原图基础上添加<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;pig-&gt;$</span>的支路即可。此时图示如下：</p>
<p><a style="color: #0071bb; margin-left: 0px; margin-right: 0px;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309002851.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309002851.png" border="0" alt="Photobucket" /></a></p>
<div>
最后一条，a red fox。这条貌似复杂，但是只需在<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;a</span>之间新添加了一条路径而已；<span class="Apple-style-span" style="color: #ff00ff;">a-&gt;red-&gt;fox-&gt;$</span>之间原有路径，可以继续使用。此时，得到完整的示意图如下：</div>
<p><a style="color: #ed1e24; margin-left: 0px; margin-right: 0px; text-decoration: none;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" target="_blank"><img style="border-color: initial; border-style: initial; margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" border="0" alt="Photobucket" /></a></p>
<p>此时，观察可知，一种新的情况出现了。同时存在<span class="Apple-style-span" style="color: #ff00ff;">^-&gt;a</span>，和<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a</span>两条路径。想一下初中物理电路图，我们可以将这种情况称为“短路”，即，<span class="Apple-style-span" style="color: #ff00ff;">&#8220;^&#8221;-&gt;this-&gt;is-&gt;a</span>这个线路的^、a两个节点之间，添加了一条无障碍通道，它能无视this、is的存在，因此，让<span class="Apple-style-span" style="color: #ff00ff;">this-&gt;is</span>这条路径成为<strong>可选项</strong>。再总结一下规律：</p>
<p>如果有A-&gt;B-&gt;&#8230;C-&gt;D的路径，且有A-&gt;D的路径，则称A-&gt;D之间存在短路，此时,B-&gt;&#8230;-&gt;C可以用(B-&gt;&#8230;-&gt;C)?来表示(就是用括号来表示被短路的部分，问号表示短路之)。</p>
<div>顶点A,D之间，最多存在一个短路关系。但是可以有1或更多条并列的关系存在。</div>
<div>好了，分析结束，得到这样的正则式：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">^(this is )?a (red fox|blue firefox|pig)$</span></div>
<p>这也就是为什么上文要强调是一个节点的缘故。</p>
<div>
<p>如果我们再精益求精的话，可以对<span class="Apple-style-span" style="color: #ff00ff;">red fox|blue firefox|pig</span>这部分<strong><span class="Apple-style-span" style="color: #ff00ff;">递归地</span></strong>进行上述分析过程，进而得到<span class="Apple-style-span" style="color: #ff00ff;"> (red |blue fire)fox|pig</span>这样的结果。</p></blockquote>
<h2 style="background-color:#99CC00; font-size:14px; padding-bottom:3px; padding-left:10px; padding-top:3px;  line-height:1.5em; margin:1.5em 0 1em;">实现</h2>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><div>思路有了，编程就简单了。perl中，固然可以使用比较简洁的hash表来表示链表之间的关系：</div>
<div>例如：</div>
<div>my $hash;</div>
<div style="margin-left: 0px; margin-right: 0px;">$$hash{&#8220;^&#8221;}{&#8220;this&#8221;}{&#8220;is&#8221;}{&#8220;a&#8221;}{&#8220;red&#8221;}{&#8220;fox&#8221;}{&#8220;\$&#8221;}=&#8221;";</div>
<div>$$hash{&#8220;^&#8221;}{&#8220;this&#8221;}{&#8220;is&#8221;}{&#8220;a&#8221;}{&#8220;blue&#8221;}{&#8220;firefox&#8221;}{&#8220;\$&#8221;}=&#8221;";</div>
<p>&#8230;</p>
<div>但是，节点的增删修改都是麻烦事。（我在hash迷宫中lost了很久才爬出来）</div>
<div>抽空补了一下<strong>有向图</strong>的知识，觉得可以简化问题如下。</div>
<p><a style="color: #ed1e24; margin-left: 0px; margin-right: 0px; text-decoration: none;" href="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" target="_blank"><img style="margin-left: 0px; margin-right: 0px;" src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/20100309003225.png" border="0" alt="Photobucket" /></a></p>
<p>上图其实是一个有向图，只需记录所有的顶点集合，路径集合，再来求各路径之间的关系；最后打印输出，即是所求。</p>
<div>顶点集合为：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">(^, this, is, a, red, fox, blue, firefox, pig, $);</span></div>
<div>通路关系集合为：</div>
<div><span class="Apple-style-span" style="color: #ff00ff;">(^-&gt;this, this-&gt;is,&#8230;)</span></div>
<p>这两个集合在读取文本文件行的时候可以一次性建立。不复杂。关键是关系的确立。</p>
<div>再次总结，如下：</div>
<ul>
<li>从一个顶点A出发的N条支路必定汇合（只是有时是同一个点，有时不在同一点而已。本文给出的例子是最简单的情况，这里可以假设为汇合到同一点）于M点。</li>
<li>这N条路中，每一条路径的长短以经过的节点个数来计算。例如上图中，^到a有一条路，上面的路径为2，下面的路径为0。</li>
<li>短的支路决定了这N条支路的关系。</li>
<li>长度为任意两点之间，最多只可能有一条长度为0的边。</li>
<li>如果存在长度为0的边，则其余的同级的支路被短路。</li>
<li>长度不为0的N-1条支路之间是并列关系。</li>
<li>整个图始于^，终于$。</li>
</ul>
<div>这些条件、判断，均可以细化为函数。具体的程序从略。</div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/text-2-regular-expressions-again.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>个人应用之明文字串到正则</title>
		<link>http://iregex.org/blog/literal-text-to-regex.html</link>
		<comments>http://iregex.org/blog/literal-text-to-regex.html#comments</comments>
		<pubDate>Wed, 10 Feb 2010 08:50:15 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=78</guid>
		<description><![CDATA[近来工作中需要将某种明文字串转为简单的正则式。手动做当然可以，但是大量重复性的劳动，自然是交给机器处理为好。昨晚写了一款这样的脚本，放在这里。因为是处理我自己的工作的脚... ]]></description>
			<content:encoded><![CDATA[<p>近来工作中需要将某种明文字串转为简单的正则式。手动做当然可以，但是大量重复性的劳动，自然是交给机器处理为好。昨晚写了一款这样的脚本，放在这里。因为是处理我自己的工作的脚本，贴在这里仅作记录和存档之用，可能对别人没什么实际作用。当然，从现有的明文字串到正则式的转换，应该是个不错的题目，有兴趣朋友的可以深究。</p>
<p>值得一提的是，代码中用了<font color="#FF00FF">$&#038;, (?{})</font> 这样的<font color="#FF00FF">perl only</font>的东东，明晰了思路，简化了代码。如果不使用这种特性的话，代码要<strong>长5倍</strong>。另外，据说从效率上来说，<font color="#FF00FF">use English</font>之后，使用<font color="#FF00FF">$MATCH</font>比直接使用<font color="#FF00FF">$&#038;</font><strong>快5倍</strong>。但是对于即输入即执行的命令行程序来说，<font color="#FF00FF">$&#038;</font>已经足够好。</p>
<p><span id="more-78"></span></p>
<p>实际应用一例：</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #c20cb9; font-weight: bold;">perl</span> hash2re.pl H:aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA0.zip<span style="color: #000000; font-weight: bold;">/</span>H:aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span><span style="color: #000000; font-weight: bold;">/</span>aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span><span style="color: #000000; font-weight: bold;">/</span>aaa<span style="color: #000000; font-weight: bold;">/</span>Aaaaa<span style="color: #000000; font-weight: bold;">/</span>aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-<span style="color: #000000;">0</span>.exe<br />
RE <span style="color: #000000;">1</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>\.zip$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA0.zip&quot;</span><br />
<br />
RE <span style="color: #000000;">2</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0&quot;</span><br />
<br />
RE <span style="color: #000000;">3</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0&quot;</span><br />
<br />
RE <span style="color: #000000;">4</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa&quot;</span><br />
<br />
RE <span style="color: #000000;">5</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">4</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;Aaaaa&quot;</span><br />
<br />
RE <span style="color: #000000;">6</span>: &nbsp; ^<span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">6</span><span style="color: #7a0874; font-weight: bold;">&#125;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#91;</span>a-z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">7</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span>A-Z<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #000000;">3</span><span style="color: #7a0874; font-weight: bold;">&#125;</span>-<span style="color: #7a0874; font-weight: bold;">&#91;</span><span style="color: #000000;">0</span>-<span style="color: #000000;">9</span><span style="color: #7a0874; font-weight: bold;">&#93;</span>\.exe$<br />
&nbsp; &nbsp; &nbsp; &nbsp; Matches: <span style="color: #ff0000;">&quot;aaa-Aaaa-AaaaAaaaaaaAaaaaaaa-AAA-0.exe&quot;</span></div></td></tr></tbody></table></div>
<p>源码：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<br />
<span style="color: #666666; font-style: italic;"># &nbsp; by rex zhang </span><br />
<span style="color: #666666; font-style: italic;"># &nbsp; Feb 09 2010 in Shanghai</span><br />
<br />
<span style="color: #666666; font-style: italic;"># &nbsp; usage: split and regexize hashed filename</span><br />
<span style="color: #666666; font-style: italic;">#</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$lines</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$ARGV</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$lines</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><span style="color: #666666; font-style: italic;">#(C:[^/]+)#)</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$c</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$lines</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/$c//</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;ClearText Filename Ignored:<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\&quot;</span>$c<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@array</span><span style="color: #339933;">=</span><span style="color: #000066;">split</span><span style="color: #009900;">&#40;</span><span style="color: #000066;">m</span><span style="color: #339933;">!</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">?:</span>\<span style="color: #339933;">/|</span>H<span style="color: #339933;">:</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*!,</span> <span style="color: #0000ff;">$lines</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$counter</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$line</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@array</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">next</span> <span style="color: #b1b100;">if</span> <span style="color: #b1b100;">not</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$re</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$line</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">local</span> <span style="color: #0000ff;">$len</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<br />
<br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(?=[.\[\]()])/\\/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\?/./g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/0+(?{ $len=length($&amp;)})/[0-9]\{$len\}/g</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/A+(?{ $len=length($&amp;)})/[A-Z]\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/a+(?{ $len=length($&amp;)})/[a-z]\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/(.)\1+(?{ $len=length($&amp;)})/$1\{$len\}/g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/\{1\}//g</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #0000ff;">$re</span> <span style="color: #339933;">=</span> &nbsp;<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\^</span>$re<span style="color: #000099; font-weight: bold;">\$</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #0000ff;">$counter</span><span style="color: #339933;">++;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$line</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/$re/</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;RE $counter:<span style="color: #000099; font-weight: bold;">\t</span>$re<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>Matches: <span style="color: #000099; font-weight: bold;">\&quot;</span>$line<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">else</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;RE $counter:<span style="color: #000099; font-weight: bold;">\t</span>$re<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>Failed: <span style="color: #000099; font-weight: bold;">\&quot;</span>$line<span style="color: #000099; font-weight: bold;">\&quot;</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
&nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/literal-text-to-regex.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>统计重复文本行的两种方法</title>
		<link>http://iregex.org/blog/get-duplicated-lines.html</link>
		<comments>http://iregex.org/blog/get-duplicated-lines.html#comments</comments>
		<pubDate>Sat, 06 Feb 2010 07:09:43 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=77</guid>
		<description><![CDATA[假设样本文件a.txt内容如下： 123456hello world! hello world! I love regex. hello world! I love regex. hello world! 简单观察可知，hello world!共重复4行；I love regex.重复2行。如何使用正则表达式来写一个程序，统计... ]]></description>
			<content:encoded><![CDATA[<p>假设样本文件<font color="#FF00FF">a.txt</font>内容如下：</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">hello world!<br />
hello world!<br />
I love regex.<br />
hello world!<br />
I love regex.<br />
hello world!</div></td></tr></tbody></table></div>
<p>简单观察可知，<font color="#FF00FF">hello world!</font>共重复4行；<font color="#FF00FF">I love regex.</font>重复2行。如何使用正则表达式来写一个程序，统计这些数据呢？因为现实中需要统计的文件，绝非是只凭肉眼就能观察出来。我想到了两种方法，第一种方法，是依赖于正则表达式（否则这篇文章也不会贴在这里）；第二种，hash表做主角，正则表达式作绿叶。<span id="more-77"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">正则表达式的解法</h3>
<p>思路是：对于任何一行文本，如果后面若干行[0～EOF）之后，如果存在相同的文本行，则记下该行内容，统计出现次数；然后删除这样的文本行，再进行下一行的统计。输出统计结果。</p>
<p>下面是相应的perl程序，附注释。</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl </span><br />
<span style="color: #666666; font-style: italic;">#usage: &nbsp;./dup_re.pl &lt;a.txt</span><br />
<br />
<span style="color: #000066;">undef</span> <span style="color: #0000ff;">$/</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># enable &quot;slurp&quot; mode</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$file</span> <span style="color: #339933;">=</span> <span style="color: #009999;">&lt;STDIN&gt;</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;"># whole file now here</span><br />
<br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$file</span> <span style="color: #339933;">=~</span> <span style="color: #000066;">m</span><br />
&nbsp; &nbsp; <span style="color: #339933;">/</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#for each line;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore the whitespaces at both ends; </span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">\S</span><span style="color: #339933;">.*?</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#get the line content, save to $1;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; \<span style="color: #000066;">s</span><span style="color: #339933;">*</span>$ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines by using \S</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">.*?</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#check if there is the same pattern of $1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #339933;">^</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span>\<span style="color: #cc66cc;">1</span>\<span style="color: #000066;">s</span><span style="color: #339933;">*</span>$ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#after 0 or more lines;</span><br />
&nbsp; &nbsp; <span style="color: #339933;">/</span>smx<span style="color: #009900;">&#41;</span> <br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$line</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$1</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br />
&nbsp; &nbsp; <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$count</span><span style="color: #339933;">=</span> <span style="color: #0000ff;">$file</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">s/$line//g</span><span style="color: #339933;">;</span> &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#delete the duplicated lines</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save the number to $count;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000066;">print</span> <span style="color: #0000ff;">$count</span><span style="color: #339933;">,</span><span style="color: #ff0000;">&quot;times:<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$line</span><span style="color: #339933;">,</span><span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">Hash表解法</h3>
<p>这种方法，受益于perl语言本身的强大的hash表功能。思路如下：</p>
<ul>
<li>建立空的hash表；</li>
<li>逐行读取文件；</li>
<li>以文本内容为key，插入到表中来。如果是首次出现，value为0，否则value++。</li>
<li>输出hash表中value>=2的记录。</li>
</ul>
<p>Perl程序：</p>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl</span><br />
<span style="color: #666666; font-style: italic;">#usage: &nbsp;./dup_hash.pl a.txt</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%hash</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">&lt;&gt;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp;<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/^\s*(\S.*?)\s*$/</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#ignore whitespaces at both ends; </span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#123;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#ignore empty lines by using \S</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">++;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save the line to $1, and count the time it appears</span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<br />
<span style="color: #666666; font-style: italic;">#sort the hash by values; </span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$key</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">sort</span> <span style="color: #009900;">&#123;</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$b</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">&lt;=&gt;</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$a</span><span style="color: #009900;">&#125;</span> <span style="color: #009900;">&#125;</span> <span style="color: #000066;">keys</span> <span style="color: #0000ff;">%hash</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$key</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">&gt;=</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#only print the lines that duplicates;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#for all results, just remove the 'if' line</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #000066;">printf</span> <span style="color: #ff0000;">&quot;%d times:<span style="color: #000099; font-weight: bold;">\t</span>%s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$hash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$key</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$key</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">结果</h3>
<p>上面的程序分别保存为dup_re.pl，dup_hash.pl。由于程序对于外部文件的读取的方法不同，运行方式也有差别，详见下图：<br />
<img src="http://public.bay.livefilestore.com/y1p84mh-sSb8s2jIOokB1tAnVJQnNdmS1ir1v9A0nRbWPPZ6AdIQV896FPpKr_LNzQvJ6kJQ-Ue94wHK8LVscG8uQ/20100206_144726.png" alt="我爱正则表达式|统计重复文本行的两种方法" /></p>
<h4>Update</h4>
<p>忽然想到，如果要让这脚本更有效，可以指定忽略大小写，忽略单词间多个空格的情况，使得<font color="#FF00FF">Hello world!</font>与<font color="#FF00FF">      　　hello　　       WORLd!   </font>被视为重复行。测试了一下，正则式没让我失望。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/get-duplicated-lines.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>由正则式反推文本:REExtractor</title>
		<link>http://iregex.org/blog/reextractor.html</link>
		<comments>http://iregex.org/blog/reextractor.html#comments</comments>
		<pubDate>Tue, 02 Feb 2010 09:12:35 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[REExtractor]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=76</guid>
		<description><![CDATA[发现一款简单有趣的正则表达式应用：REExtractor，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是 Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性... ]]></description>
			<content:encoded><![CDATA[<p>发现一款简单有趣的正则表达式应用：<a id="f-4f" href="http://re2form.appspot.com/" title="我爱正则表达式|由正则式反推文本">REExtractor</a>，作用是输入正则表达式，输出符合正则式描述的文本。作者给的介绍是<br />
Generate all possibilities of Regular Expression，即生成正则表达式的所有可能性。不过，理论上可以做到，执行时却有限制。<span id="more-76"></span></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">一些限制</h3>
<ol>
<li>平台是GAE，语言是python，因此用的是python正则。或需代理才能访问使用。</li>
<li>支持的元字符或缩写：<font class="Apple-style-span" color="#FF00FF">(), [],{m,n},{n},|,\w,\d</font>。如果需要用到这些字符的字面值，请使用反斜线转义之。其中这里的\w等同于<font class="Apple-style-span" color="#FF00FF">[a-zA-Z0-9]</font>，为62个字符之一，而不是通常意义上的包括下划线在内的<font class="Apple-style-span" color="#FF00FF">[_a-zA_Z0-9]</font>，63字符之一。但是可以用<font class="Apple-style-span" color="#FF00FF">[_\w]</font>来代替，没问题的。</li>
<li>不支持的元字符：<font class="Apple-style-span" color="#FF00FF">.(点号),^,$,\b,\D,\W,\1&#8230;（后向引用）, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>等。</li>
<ul>
<li>如果出现<font class="Apple-style-span" color="#FF00FF">.</font>点号，则直接输出。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">^, $, \b, \1, (?=&#8230;), (?!&#8230;), (?&lt;=&#8230;), (?&lt;!&#8230;)</font>， 程序无视之。</li>
<li>如果使用<font class="Apple-style-span" color="#FF00FF">\D或\b或[^]</font>，则程序会报错。原因是范围太宽。</li>
</ul>
<li>不支持可能性在1000条以上结果的正则表达式。例如，<font class="Apple-style-span" color="#FF00FF">\w{2}</font>，因为它的可能性是62×62。但是你可以使用\w\d，因为它的可能性是62×10。</li>
</ol>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">它能做什么</h3>
<p><a href="http://iregex.org/blog/REExtractor.html" target="_blank" title="我爱正则表达式|由正则式反推文本"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100202170343.jpg" border="0" alt="我爱正则表达式|由正则式反推文本"></a><br />
好吧，虽然限制多多，但是你仍然可以拿它来做一些有趣的应用。下面略举两例。</p>
<ul>
<li>生成一些简单的邮箱地址。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">[abc]{3}\d@1(26|63).com</font> ，它生成540条邮箱地址。</li>
<li>生成一些人名。试一下这条正则式：<font class="Apple-style-span" color="#FF00FF">张[小大勇赞强战海][虎猫龙彪平]</font>。它生成35条人名。是的，它支持中文，并且每个中文字都可以当成一个字符来应用。如果你家要添一个宝宝，可以将一些可能的字排列一下，看看哪些组合比较赏心、顺口，再从中选择一个。</li>
</ul>
<p>平心而论，上面的这些小应用，当然可以直接编程实现，限制更少，更灵活，更强大。但是有必要每次都开编译器么？尝试一下这款小程序，也挺有趣的。而且，上一节中提及的一些限制，其实也是蛮有道理的。毕竟从正则式反推文本，用不到大多数的零宽断言（不过<font class="Apple-style-span" color="#FF00FF">\1</font>这种反向引用应该挺常用的，却不支持）。当作一个小玩具就好。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/reextractor.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>skydrive外链mp3方案</title>
		<link>http://iregex.org/blog/skydrive-mp3-with-google-player.html</link>
		<comments>http://iregex.org/blog/skydrive-mp3-with-google-player.html#comments</comments>
		<pubDate>Sun, 10 Jan 2010 12:29:48 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[mp3]]></category>
		<category><![CDATA[skydrive]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=74</guid>
		<description><![CDATA[使用合租空间的独立博客，例如本人，有时想在自己的空间上传mp3，又有版权（美国空间要求趋严）、流量（被迅雷爬到后果很严重）的担心。经过比较，觉得skydrive的空间挺不错的，25G空间，... ]]></description>
			<content:encoded><![CDATA[<p>使用合租空间的独立博客，例如本人，有时想在自己的空间上传mp3，又有版权（美国空间要求趋严）、流量（被迅雷爬到后果很严重）的担心。经过比较，觉得skydrive的空间挺不错的，25G空间，可支持外链。唯一不足之处是操作比较复杂，使用普通的方法不容易批量提取mp3的外链。今天下午做出一种简单易行的方法，可以直接抓取skydrive的公开文件夹里的mp3音乐文件绝对地址并生成Google Player播放代码（因此您就不需要再安装播放mp3的wordpress各种插件了）。所写的php源码一并贴出，有兴趣的自行研究。如果是<a title="我爱正则表达式" target="_blank" href="http://iregex.org" id="n2fe">正则表达式</a>方面的讨论，欢迎跟贴；其它问题恕不回复，见谅。<br />
<span id="more-74"></span><br />
最终效果如下图：</p>
<p><a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110_190109.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<p><a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110_183937.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">上传</h3>
<p>使用您的liveID在<a title="Skydrive, 25G space!" target="_blank" href="http://skydrive.live.com/" id="nu96">这里</a>登录，然后新建一个<span style="color: rgb(255, 0, 255);">公开</span>的文件夹。之所以要公开，是因为您的mp3是要放在博客上播放的，如果设为私密型，别人就无法欣赏到了。</p>
<p>修改权限的方法见贴图：<br />
<a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110_184249.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<p><a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110_184319.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a><br />
上传时，如果是在IE浏览器下，会有提示安装插件，建议安装。这样就可以将待上传的文件批量拖过来上传了。每个文件不超过50M。总文件的大小没有限制。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">指定需要外链的文件地址<br />
</h3>
<p>您可以指定为某个文件夹生成代码，也可以指定文件生成代码。无论哪种方式，都是一个文件对应一段代码，而不是将所有的播放文件生成一个播放列表。您需要先记下该文件的页面地址，然后根据该地址生成代码。</p>
<p>获得单个文件的地址：<br />
<a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110_185001.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<p>获取文件夹的地址：<br />
<a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110_184937.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<p>拷贝好页面地址备用。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">生成播放代码</h3>
<p>请移步到这里：<br />
<a title="我爱正则表达式" target="_blank" href="http://zh-en.org/livemp3/" id="ikzb">http://zh-en.org/livemp3/</a></p>
<p><a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110201229.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<p>输入上一步得到的页面地址，点击OK，大约2秒钟之后，就会看到这样的内容了：</p>
<p>&nbsp;<br />
<a href="http://iregex.org/blog/skydrive-mp3-with-google-player.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/20100110201150.png" alt="我爱正则表式|mp3+Skydrive+GooglePlayer" border="0"></a></p>
<p>将生成的源代码拷贝到wordpress中，就能看到播放器了。</p>
<h3 style="color: #127ADB; font-size:14px; padding-bottom:3px; padding-top:3px; margin:1.5em 0 1em;">源代码 </h3>
<p>程序很简单，获得页面地址，使用curl来下载页面，然后使用正则表达式来析取绝对地址，然后生成播放代码，如此而已。其中google player的代码，我是在google reader中读《<a href="http://www.baibanbao.net/">白板报</a>》的海盗电台时发现的。</p>
<p>如果您感兴趣，还可以将此方案扩展，做skydrive图床，原理一致。不赘述。<br />
<br />
php代码如下：</p>
<div class="codecolorer-container php mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br />57<br />58<br />59<br />60<br />61<br />62<br />63<br />64<br />65<br />66<br />67<br />68<br />69<br />70<br />71<br />72<br />73<br />74<br />75<br />76<br />77<br /></div></td><td><div class="php codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #000000; font-weight: bold;">&lt;?php</span><br />
<span style="color: #666666; font-style: italic;">//use: &nbsp;get mp3, wam, wmv direct links from skydrive's public folder, and generate google player code for that.</span><br />
<span style="color: #666666; font-style: italic;">//author's email&amp;gtalk: &nbsp; rex [at] zhasm [dot] com</span><br />
<span style="color: #666666; font-style: italic;">//last edit: &nbsp; &nbsp;20100110 18:14</span><br />
<br />
<span style="color: #666666; font-style: italic;">//get the curl handle</span><br />
<span style="color: #666666; font-style: italic;">//</span><br />
<span style="color: #000000; font-weight: bold;">function</span> init_curl<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$ch</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_init</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_RETURNTRANSFER<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_BINARYTRANSFER<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_REFERER<span style="color: #339933;">,</span><span style="color: #0000ff;">&quot;http://skydrive.live.com/&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">//curl_setopt($ch, CURLOPT_POST, 1);</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #000088;">$ch</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #666666; font-style: italic;">// extract mp3 from the given root page;</span><br />
<span style="color: #000000; font-weight: bold;">function</span> get_list<span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; extract_mp3<span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">//echo $url;</span><br />
&nbsp; &nbsp; <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_URL<span style="color: #339933;">,</span> <span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$output</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_exec</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #666666; font-style: italic;">//trim the unnecessary parts, for safety</span><br />
<br />
&nbsp; &nbsp; <span style="color: #000088;">$links</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/^.*?(?=&lt;div id=[\'&quot;]tileView[\'&quot;] class=[\'&quot;]tvContainer[\'&quot;]&gt;)|&lt;div class=&quot;bpViewPermissionsLink&quot;.*$/si'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">,</span> <span style="color: #000088;">$output</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">//you can add your own music filter </span><br />
&nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/(?&lt;=&lt;a class=&quot;tvLink&quot;)[^&lt;&gt;]+href=&quot;([^&quot;]+)(?&lt;=mp3|wav|wmv)&quot;/si'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$links</span><span style="color: #339933;">,</span> <span style="color: #000088;">$result</span><span style="color: #339933;">,</span> PREG_PATTERN_ORDER<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$result</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$result</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$r</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; extract_mp3<span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span><span style="color: #000088;">$r</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
&nbsp;<br />
<span style="color: #666666; font-style: italic;">// extract mp3 from the given sub page, generate output code.</span><br />
<span style="color: #000000; font-weight: bold;">function</span> extract_mp3<span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span><span style="color: #000088;">$link</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_URL<span style="color: #339933;">,</span> <span style="color: #000088;">$link</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$output</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_exec</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #666666; font-style: italic;">//&nbsp; &lt;a id=&quot;spPreviewLink&quot; href=</span><br />
&nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'%(?&lt;=&lt;title&gt;)[^&gt;&lt;]+(?= - Windows Live&lt;/title&gt;)%'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$output</span><span style="color: #339933;">,</span> <span style="color: #000088;">$result</span><span style="color: #339933;">,</span> PREG_PATTERN_ORDER<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$title</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$result</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/(?&lt;=&lt;a\sid=&quot;spPreviewLink&quot;\shref=&quot;)[^&quot;]+(?=&amp;#63;download&quot;)/'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$output</span><span style="color: #339933;">,</span> <span style="color: #000088;">$result</span><span style="color: #339933;">,</span> PREG_PATTERN_ORDER<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$result</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> <br />
&nbsp; &nbsp; <span style="color: #b1b100;">foreach</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$result</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$r</span><span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#123;</span> <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #000088;">$demo</span><span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot; &nbsp; &nbsp;&lt;div class=<span style="color: #000099; font-weight: bold;">\&quot;</span>audio-player-placeholder<span style="color: #000099; font-weight: bold;">\&quot;</span>&gt;<br />
&nbsp; &nbsp; &lt;embed classname=<span style="color: #000099; font-weight: bold;">\&quot;</span>audio-player-embed<span style="color: #000099; font-weight: bold;">\&quot;</span> type=<span style="color: #000099; font-weight: bold;">\&quot;</span>application/x-shockwave-flash<span style="color: #000099; font-weight: bold;">\&quot;</span> src=<span style="color: #000099; font-weight: bold;">\&quot;</span>https://www.google.com/reader/ui/3247397568-audio-player.swf?audioUrl=&quot;</span><span style="color: #339933;">.</span><span style="color: #000088;">$r</span><span style="color: #339933;">.</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\&quot;</span> allowscriptaccess=<span style="color: #000099; font-weight: bold;">\&quot;</span>never<span style="color: #000099; font-weight: bold;">\&quot;</span> allowfullscreen=<span style="color: #000099; font-weight: bold;">\&quot;</span>true<span style="color: #000099; font-weight: bold;">\&quot;</span> quality=<span style="color: #000099; font-weight: bold;">\&quot;</span>best<span style="color: #000099; font-weight: bold;">\&quot;</span> bgcolor=<span style="color: #000099; font-weight: bold;">\&quot;</span>#ffffff<span style="color: #000099; font-weight: bold;">\&quot;</span> wmode=<span style="color: #000099; font-weight: bold;">\&quot;</span>transparent<span style="color: #000099; font-weight: bold;">\&quot;</span> flashvars=<span style="color: #000099; font-weight: bold;">\&quot;</span>playerMode=embedded<span style="color: #000099; font-weight: bold;">\&quot;</span> pluginspage=<span style="color: #000099; font-weight: bold;">\&quot;</span>http://www.macromedia.com/go/getflashplayer<span style="color: #000099; font-weight: bold;">\&quot;</span> height=<span style="color: #000099; font-weight: bold;">\&quot;</span>27px<span style="color: #000099; font-weight: bold;">\&quot;</span> width=<span style="color: #000099; font-weight: bold;">\&quot;</span>400px<span style="color: #000099; font-weight: bold;">\&quot;</span>&gt;<br />
&nbsp; &nbsp; &lt;/div&gt;&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;文件：&quot;</span><span style="color: #339933;">,</span><span style="color: #000088;">$title</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span><span style="color: #0000ff;">&quot;&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;效果：&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #000088;">$demo</span><span style="color: #339933;">,</span><span style="color: #0000ff;">&quot;&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">&quot;代码：&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">'&lt;textarea cols=&quot;50&quot; rows=&quot;10&quot;&gt;'</span><span style="color: #339933;">,</span><span style="color: #000088;">$demo</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'&lt;/textarea&gt;'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">&quot;&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
&nbsp;<br />
<span style="color: #666666; font-style: italic;">//get user input</span><br />
<span style="color: #000088;">$url</span><span style="color: #339933;">=@</span><span style="color: #000088;">$_GET</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">&quot;url&quot;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #990000;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span><br />
&nbsp;<br />
&nbsp;<br />
<span style="color: #000088;">$ch</span><span style="color: #339933;">=</span>init_curl<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
<span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">'&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html;charset=utf-8&quot; /&gt;'</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #000088;">$info</span><span style="color: #339933;">=</span>get_list<span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <br />
&nbsp;<br />
<span style="color: #000000; font-weight: bold;">?&gt;</span></div></td></tr></tbody></table></div>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/skydrive-mp3-with-google-player.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>抓取页面图片的单行命令</title>
		<link>http://iregex.org/blog/download-images-with-single-line-command.html</link>
		<comments>http://iregex.org/blog/download-images-with-single-line-command.html#comments</comments>
		<pubDate>Mon, 07 Dec 2009 11:58:42 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[cmdline]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[wget]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=72</guid>
		<description><![CDATA[命令如下： 1curl -s $URL &#124;perl -nle &#34;print for m{http://[^\&#34;]+(?:jpg&#124;png&#124;gif)}g;&#34;&#124;sort -u &#124;xargs wget 流程： 将包含图片链接的页面（例如http://www.flickr.com/photos/anyaanja/4165312465/sizes/o/ 下... ]]></description>
			<content:encoded><![CDATA[<div>命令如下：<br />
</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">curl <span style="color: #660033;">-s</span> <span style="color: #007800;">$URL</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">perl</span> <span style="color: #660033;">-nle</span> <span style="color: #ff0000;">&quot;print for m{http://[^<span style="color: #000099; font-weight: bold;">\&quot;</span>]+(?:jpg|png|gif)}g;&quot;</span><span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sort</span> <span style="color: #660033;">-u</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">xargs</span> <span style="color: #c20cb9; font-weight: bold;">wget</span></div></td></tr></tbody></table></div>
</div>
<p><span id="more-72"></span><br />
流程：</p>
<ul>
<li>将包含图片链接的页面（例如http://www.flickr.com/photos/anyaanja/4165312465/sizes/o/ 下载下来，以便析取图片地址。使用的命令是curl -s $URL。这里的地址需要手动替换为你所需要的地址。curl 的-s选项是表明使用silent模式，避免任何输出。
</li>
<li>使用perl解析刚刚下载的页面，找到以http开头，以jpg、png、gif结尾的图片地址。这里的图片类型任意，只要按照类似的语法可以扩展或缩减。perl的-nle选项表示循环读入输入行，搜索相应匹配行，输出相应部分。详细参见<a title="perl one liners" target="_blank" href="http://sial.org/howto/perl/one-liner/" id="x85m">perl one liners</a>。</li>
<li>perl在这里起解析网页的作用。awk应该也有同样的功效，只是个人感觉awk的<a href="http://iregex.org/blog/download-images-with-single-line-command.html">正则表达式</a>功能太弱较弱。
</li>
<li>使用sort -u将生成的url排序。如果有重复项，只保留其一，以免重复下载。</li>
<li>使用wget来下载这些图片到当前目录。由于wget 默认无法接收standard input的输入，因此使用xargs作为中转。</li>
</ul>
<p>
<span style="color: rgb(255, 0, 255);">2009120</span>更新：</p>
<ul>
<li>使用</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">curl <span style="color: #660033;">-s</span> <span style="color: #007800;">$URL</span> <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">grep</span> <span style="color: #660033;">-o</span> <span style="color: #ff0000;">&quot;http://.*\?\(png\|jpg\)&quot;</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sort</span> <span style="color: #660033;">-u</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">xargs</span> <span style="color: #c20cb9; font-weight: bold;">wget</span></div></td></tr></tbody></table></div>
<p>or</p>
<div class="codecolorer-container bash mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">curl <span style="color: #660033;">-s</span> <span style="color: #007800;">$URL</span> <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">grep</span> <span style="color: #660033;">-E</span> <span style="color: #660033;">-o</span> <span style="color: #ff0000;">&quot;http://.*?(png|jpg)&quot;</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sort</span> <span style="color: #660033;">-u</span> <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">xargs</span> <span style="color: #c20cb9; font-weight: bold;">wget</span></div></td></tr></tbody></table></div>
<p>能实现同样的作用。其中，-o是表示只显示匹配部分，而不必显示整行文本（默认情况下是显示整行文本）；-E 是扩展模式的正则，在此模式下问号、括号、竖线都可直接使用，不必在前边加反斜杠。</li>
<li>使用perl的话，正则表达式部分比较强大，只是命令臃肿；使用grep，灵活小巧，但是有可能无法使用复杂的正则表达式。</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/download-images-with-single-line-command.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>使用正则表达式转换手机联系人名单</title>
		<link>http://iregex.org/blog/convert-contact-format-from-htc-to-nokia.html</link>
		<comments>http://iregex.org/blog/convert-contact-format-from-htc-to-nokia.html#comments</comments>
		<pubDate>Sun, 05 Jul 2009 10:17:51 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[应用]]></category>
		<category><![CDATA[csv]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[手机]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=62</guid>
		<description><![CDATA[今天换了手机，第一件要紧事就是恢复联系人名单。旧手机是HTC c720W，新手机是Nokia 5320XM，其联系人名单格式不一。难道要我一条条输入吗？阿弥陀佛，几百组联系人的信息，手动操作会死人... ]]></description>
			<content:encoded><![CDATA[<p>今天换了手机，第一件要紧事就是恢复联系人名单。旧手机是HTC c720W，新手机是Nokia 5320XM，其联系人名单格式不一。难道要我一条条输入吗？阿弥陀佛，几百组联系人的信息，手动操作会死人的。</p>
<p>我看了一下以前备份的HTC c720W联系人名单，（说到这里，表扬一下自己：勤于备份真是好习惯，万一某天灾难降临，你还有个指望），发现其格式是这样的：</p>
<div class="codecolorer-container xml mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br />49<br />50<br />51<br />52<br />53<br />54<br />55<br />56<br /></div></td><td><div class="xml codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;contacts</span> <span style="color: #000066;">location</span>=<span style="color: #ff0000;">&quot;0&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;item<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;oid<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>-2147483070<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/oid<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;location<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>0<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/location<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businessfaxnumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;companyname</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;department</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email1address</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;mobiletelephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;officelocation</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businesstelephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;jobtitle</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;hometelephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email2address</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email3address</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;home2telephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;homefaxnumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;categories</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;business2telephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;firstname<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>SomeBody<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/firstname<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;middlename</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lastname<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1355***9214<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/lastname<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businessaddresspostalcode</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;pagernumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;spouse</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;cartelephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;assistantname</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;assistanttelephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;children</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;webpage</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;radiotelephonenumber</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileas<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>***<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/fileas<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;title</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;suffix</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;homeaddressstreet</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;homeaddresscity</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;homeaddressstate</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;homeaddresspostalcode</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;homeaddresscountry</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;otheraddressstreet</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;otheraddresscity</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;otheraddressstate</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;otheraddresspostalcode</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;otheraddresscountry</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businessaddressstreet</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businessaddresscity</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businessaddressstate</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;businessaddresscountry</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;anniversary<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1899-12-30 00:00:00<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/anniversary<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;birthday<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1899-12-30 00:00:00<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/birthday<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/item<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;item<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
&nbsp; &nbsp; ...<br />
&nbsp; &nbsp; <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/item<span style="color: #000000; font-weight: bold;">&gt;</span></span></span><br />
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/contacts<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></div></td></tr></tbody></table></div>
<p>而诺基亚的通讯录格式是这样的：</p>
<div class="codecolorer-container csv mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><br /><strong>GeSHi Error:</strong> GeSHi could not find the language csv (using path /home/zhasm/www/iregex.org/wp-content/plugins/codecolorer/lib/geshi/) (code 2)<br /></div>
<p>本想偷懒，去找个xml2csv什么的，不过不太好用。还是请“正则表达式”这个老朋友帮忙吧！</p>
<p>分析xml文件，发现只有<br />
mobiletelephonenumber，businesstelephonenumber，hometelephonenumber，firstname，lastname这5个字段是有用的；对应的csv字段名称是：<br />
&#8220;名&#8221;,&#8221;姓&#8221;,&#8221;常用手机&#8221;,&#8221;常用电话&#8221;,&#8221;公司电话&#8221;</p>
<p>其余的字段大可放心删除。</p>
<p>祭出RegexBuddy，写了这样一条正则表达式：</p>
<div class="codecolorer-container python mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">result = <span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><br />
&nbsp; &nbsp; r<span style="color: #483d8b;">&quot;&quot;&quot;(?smx)(?#&quot;名&quot;,&quot;姓&quot;,&quot;常用手机&quot;,&quot;常用电话&quot;,&quot;公司电话&quot;)<br />
&nbsp; &nbsp; ^<span style="color: #000099; font-weight: bold;">\s</span>*&lt;item&gt;.*?<br />
&nbsp; &nbsp; &lt;mobiletelephonenumber/?&gt;(?P&lt;mobile&gt;[^<span style="color: #000099; font-weight: bold;">\s</span>&lt;]*).*?<br />
&nbsp; &nbsp; &lt;businesstelephonenumber/?&gt;(?P&lt;business&gt;[^<span style="color: #000099; font-weight: bold;">\s</span>&lt;]*).*?<br />
&nbsp; &nbsp; &lt;hometelephonenumber/?&gt;(?P&lt;home&gt;[^<span style="color: #000099; font-weight: bold;">\s</span>&lt;]*).*?<br />
&nbsp; &nbsp; &lt;firstname/?&gt;(?P&lt;first&gt;[^<span style="color: #000099; font-weight: bold;">\s</span>&lt;]*).*?<br />
&nbsp; &nbsp; &lt;lastname/?&gt;(?P&lt;lastname&gt;[^<span style="color: #000099; font-weight: bold;">\s</span>&lt;]*).*?<br />
&nbsp; &nbsp; &lt;/item&gt;&quot;&quot;&quot;</span>, <br />
&nbsp; &nbsp; r<span style="color: #483d8b;">'&quot;<span style="color: #000099; font-weight: bold;">\g</span>&lt;first&gt;&quot;,&quot;<span style="color: #000099; font-weight: bold;">\g</span>&lt;lastname&gt;&quot;,&quot;<span style="color: #000099; font-weight: bold;">\g</span>&lt;mobile&gt;&quot;,&quot;<span style="color: #000099; font-weight: bold;">\g</span>&lt;home&gt;&quot;,&quot;<span style="color: #000099; font-weight: bold;">\g</span>&lt;business&gt;&quot;'</span>, subject<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>之所以使用python格式来写正则，是因为它支持命名捕获，看起人直观一些。其实整个替换过程是在RegexBuddy中进行的。</p>
<p>替换后，保存为CSV文件，使用诺基亚自带的软件导入，几百个联系人就又重新归位了。大爽。</p>
<p>附图：<br />
<a href="http://iregex.org/blog/convert-contact-format-from-htc-to-nokia.html" target="_blank"><img src="http://i293.photobucket.com/albums/mm60/zhasm/iregex/regex_contact_manager.png" border="0" alt="Photobucket"></a><br />
<font color="#ffffff">25f30c5f</font></p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/convert-contact-format-from-htc-to-nokia.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
