<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>我爱正则表达式 &#187; root</title>
	<atom:link href="http://iregex.org/blog/tag/root/feed" rel="self" type="application/rss+xml" />
	<link>http://iregex.org</link>
	<description>原创、翻译、转载关于正则表达式的文章</description>
	<lastBuildDate>Tue, 29 Mar 2011 05:04:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://blogsearch.google.com/ping/RPC2"/><atom:link rel="hub" href="http://blog.yodao.com/ping/RPC2"/><atom:link rel="hub" href="http://www.feedsky.com/api/RPC2"/><atom:link rel="hub" href="http://www.xianguo.com/xmlrpc/ping.php"/><atom:link rel="hub" href="http://www.zhuaxia.com/rpc/server.php"/><atom:link rel="hub" href="http://rpc.technorati.com/rpc/ping"/><atom:link rel="hub" href="http://rpc.pingomatic.com/"/>		<item>
		<title>[老贴整理]如何使用正则式从英文句子里提取词根</title>
		<link>http://iregex.org/blog/%e8%80%81%e8%b4%b4%e6%95%b4%e7%90%86%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e6%ad%a3%e5%88%99%e5%bc%8f%e4%bb%8e%e8%8b%b1%e6%96%87%e5%8f%a5%e5%ad%90%e9%87%8c%e6%8f%90%e5%8f%96%e8%af%8d%e6%a0%b9.html</link>
		<comments>http://iregex.org/blog/%e8%80%81%e8%b4%b4%e6%95%b4%e7%90%86%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e6%ad%a3%e5%88%99%e5%bc%8f%e4%bb%8e%e8%8b%b1%e6%96%87%e5%8f%a5%e5%ad%90%e9%87%8c%e6%8f%90%e5%8f%96%e8%af%8d%e6%a0%b9.html#comments</comments>
		<pubDate>Fri, 25 Apr 2008 08:07:09 +0000</pubDate>
		<dc:creator>rex</dc:creator>
				<category><![CDATA[问答]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[root]]></category>

		<guid isPermaLink="false">http://iregex.org/?p=3</guid>
		<description><![CDATA[以前在chinaunix回答过这样一个问题，用到了正则表达式（而且我认为正则式解决此类问题是最合适的。） 学英语的一些例句，每句都有若干词根相同的词，例如 She swears to wear the pearls that appear ... ]]></description>
			<content:encoded><![CDATA[<p>以前在chinaunix回答过<a href="http://bbs.chinaunix.net/viewthread.php?tid=1021624" target="_blank">这样一个问题</a>，用到了正则表达式（而且我认为正则式解决此类问题是最合适的。）</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>学英语的一些例句，每句都有若干词根相同的词，例如 She swears to wear the pearls that appear to be pears. 但是每句的词根都未必相同；我希望把这些包含词根的词都标记出来，请问如何写？</p>
<p>这里说的<strong>词根</strong>不是原本词根的定义，只是一组字母序列，比如</p>
<p>9. The dust in the industrial zone frustrated the industrious man.</p>
<p>词根是dust或ust</p>
<p>10. The just budget judge just justifies the adjustment of justice.</p>
<p>词根是dust</p>
<p>11. I used to abuse the unusual usage, but now I&#8217;m not used to doing so.</p>
<p>词根是use，有变形</p>
<p>12. The lace placed in the palace is replaced first, and displaced later.</p>
<p>词根是lace</p>
<p>13. I paced in the peaceful spacecraft.</p>
<p>词根是pace</p>
<p>14. Sir, your bird stirred my girlfriend&#8217;s birthday party.</p>
<p>词根是ir</p></blockquote>
<p>如果您对此问题感兴趣，请独立思考后再继续阅读本站提供的解决方法。</p>
<p><span id="more-5"></span></p>
<p>我的思路是，既然每行句子的结构是一致的，依靠循环就能解决所有问题。因此只要分析一句即可。对于每一句，需要每个单词进行逐个分析。</p>
<p>对第一句作手术分析。</p>
<blockquote style="border-left:2px solid #DDDDDD; margin:15px 30px 0 10px; padding-left:20px;"><p>She swears to wear the pearls that appear to be pears.</p></blockquote>
<p>人的眼睛一下子就能看出ear是词根。但是，就跟《1984》里面2+2＝?这个含义深刻的式子一样，如何证明它等于几，才是问题所在。</p>
<p>我把自己想像成正则式机器人。我可以一句一句地读取原文。（perl 语法：while(&lt;FILE&gt;)），然后可以读取每个单词来分析（perl语法：\w+表示每个单词）。对于每个单词的任意N（N最小为3，最大为该词词长）个连续字母（记作$matchstr），在整句中检验其出现的次数，将此“词根”和出现次数保存在hash表中。hash表在此的作用是：如果该词根没有记录，则创建该记录，并自动加1。</p>
<p>思路如下。</p>
<ol>
<li>对于每1行</li>
<li>对于每个单词</li>
<li>对于这个单词的任意连续3－N个字母，检查其在文本行中出现的频率M，记录在HASH表中。</li>
<li>对HASH表的值进行排序。取出最大的个一。打印输出。</li>
</ol>
<div class="codecolorer-container perl mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br /></div></td><td><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;">#!/usr/bin/perl -w</span><br />
<span style="color: #0000ff;">$/</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">&amp;</span><span style="color: #b1b100;">lt</span><span style="color: #339933;">;&amp;</span><span style="color: #b1b100;">gt</span><span style="color: #339933;">;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@array</span><span style="color: #339933;">=</span><span style="color: #ff0000;">&quot;&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%myhash</span><span style="color: #339933;">=</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;---------------------------<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;the line is :$_<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span><span style="color: #009966; font-style: italic;">/^\w+/</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #009966; font-style: italic;">s/^(\w+)\W+(.*)$/$2/</span><span style="color: #339933;">;</span><br />
<span style="color: #000066;">push</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@array</span><span style="color: #339933;">,</span><span style="color: #000066;">lc</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#save all the words(in lower case format) into array.</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #0000ff;">@b</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@array</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #666666; font-style: italic;">#copy this array to b, for checking</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$len</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$matchlen</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$item</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@array</span><span style="color: #009900;">&#41;</span><br />
<br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #0000ff;">$len</span><span style="color: #339933;">=</span><span style="color: #000066;">length</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$item</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<span style="color: #b1b100;">for</span><span style="color: #009900;">&#40;</span> <span style="color: #0000ff;">$matchlen</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$len</span><span style="color: #339933;">;</span><span style="color: #0000ff;">$matchlen</span><span style="color: #339933;">&amp;</span><span style="color: #b1b100;">gt</span><span style="color: #339933;">;=</span><span style="color: #cc66cc;">3</span><span style="color: #339933;">;</span><span style="color: #0000ff;">$matchlen</span><span style="color: #339933;">--</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #b1b100;">for</span><span style="color: #009900;">&#40;</span> <span style="color: #0000ff;">$i</span><span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span><span style="color: #0000ff;">$i</span><span style="color: #339933;">&amp;</span><span style="color: #b1b100;">lt</span><span style="color: #339933;">;=</span><span style="color: #0000ff;">$len</span><span style="color: #339933;">-</span><span style="color: #0000ff;">$matchlen</span><span style="color: #339933;">;</span><span style="color: #0000ff;">$i</span><span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #0000ff;">$matchstr</span><span style="color: #339933;">=</span><span style="color: #000066;">substr</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$item</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$i</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$matchlen</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> &nbsp;<span style="color: #666666; font-style: italic;">#define the matchstring.</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #0000ff;">$pig</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">@b</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #b1b100;">next</span> <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #0000ff;">$item</span> <span style="color: #b1b100;">eq</span> <span style="color: #0000ff;">$pig</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#the word can not match against itself.</span><br />
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #0000ff;">$pig</span> <span style="color: #339933;">=~</span> <span style="color: #009966; font-style: italic;">/$matchstr/</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #0000ff;">$myhash</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$matchstr</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">++;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#if matches, record them.</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">keys</span> <span style="color: #0000ff;">%myhash</span><span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;$_:$myhash{$_};<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">;</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;">#print all the successful match records.</span><br />
<span style="color: #009900;">&#125;</span><br />
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>注：本站使用了WP-CODEBOX Plugin，您可以参考<a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank">此处格式</a>在评论中加入代码。</p>
]]></content:encoded>
			<wfw:commentRss>http://iregex.org/blog/%e8%80%81%e8%b4%b4%e6%95%b4%e7%90%86%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e6%ad%a3%e5%88%99%e5%bc%8f%e4%bb%8e%e8%8b%b1%e6%96%87%e5%8f%a5%e5%ad%90%e9%87%8c%e6%8f%90%e5%8f%96%e8%af%8d%e6%a0%b9.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

