Trie
August 1st, 2010
Categories: 笔记
从《Effective Perl》上学习到一个module:Regexp::Trie。它属于正则优化类的module,具体说来,就是提取出备选项文本的公共部分,构造“检索树”,以便最大程度上减少回溯,提升效率。
试用了一下。对于简单的例子,它的效果很惊艳;对于复杂的例子,这个优化器的“机器人”的性格就显示出来了,感觉不如手工写得更高效,更有针对性。见例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #!/usr/bin/perl #author: rex #blog: http://iregex.org #filename a.pl #created: 2010-08-01 use Regexp::Trie; my @simple=qw/foobar foobah fooxar foozap fooza/; my @complex=qw/AuthenticViagra BestDealsOnViagra BestMaleTablets GenuineOnlineViagra GreatViagraShop OriginalViagra TopViagraShop VIAGRA ViagraBestOnlineShop ViagraBrandReseller ViagraPills onlineshop/; #the simple one my $t_s=Regexp::Trie->new; foreach (@simple){ $t_s->add($_); } #the complex one my $t_c=Regexp::Trie->new; foreach (@complex){ $t_c->add($_); } print "simple regex trie: ", $t_s->regexp, "\n"; print "complex regex trie: ", $t_c->regexp, "\n"; |
其输出结果为:
1 2 | simple regex trie: (?-xism:foo(?:ba[hr]|xar|zap?)) complex regex trie: (?-xism:(?:AuthenticViagra|Best(?:DealsOnViagra|MaleTablets)|G(?:enuineOnlineViagra|reatViagraShop)|OriginalViagra|TopViagraShop|V(?:IAGRA|iagra(?:B(?:estOnlineShop|randReseller)|Pills))|onlineshop)) |
对于第2个例子,可以看出来,Trie是基于单个字符来优化的,而不是基于单词。虽然如此,它还是会极大地减少程序员的手工劳动。毕竟,未经优化的多选结构的效率之差不言而喻的,只要经过Trie的优化,其效率就会有质的飞跃;而手工的调教,只是在该飞跃的基础上再有一点点提升而已。对正则表达式比较头痛的程序员,更是可以无视自己的优化,直接使用该module而已。不过,使用Perl的程序员,应该不会对正则表达式头痛吧?
Perl 5.10 之后的版本,已经内置了类似于Trie的优化机制,因此没有手动优化的必要了。不过,我还是好奇于它的实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | package Regexp::Trie; use 5.008001; use strict; use warnings; our $VERSION = sprintf "%d.%02d", q$Revision: 0.2 $ =~ /(\d+)/g; # use overload q("") => sub { shift->regexp }; sub new{ bless {} => shift } sub add{ my $self = shift; my $str = shift; my $ref = $self; for my $char (split //, $str){ $ref->{$char} ||= {}; $ref = $ref->{$char}; } $ref->{''} = 1; # { '' => 1 } as terminator $self; } sub _regexp{ my $self = shift; return if $self->{''} and scalar keys %$self == 1; # terminator my (@alt, @cc); my $q = 0; for my $char (sort keys %$self){ my $qchar = quotemeta $char; if (ref $self->{$char}){ if (defined (my $recurse = _regexp($self->{$char}))){ push @alt, $qchar . $recurse; }else{ push @cc, $qchar; } }else{ $q = 1; } } my $cconly = !@alt; @cc and push @alt, @cc == 1 ? $cc[0] : '['. join('', @cc). ']'; my $result = @alt == 1 ? $alt[0] : '(?:' . join('|', @alt) . ')'; $q and $result = $cconly ? "$result?" : "(?:$result)?"; return $result; } sub regexp{ my $str = shift->_regexp; qr/$str/ } 1; |
大致读了一下,让我想起之前我写过的一篇《关于从普通文本提取正则表达式的再思考》。不过,我的是业余与Trie的专业的差距,呵呵。这两天学习一下这段代码。若有值得分享的心得,再贴出来。
Leave a comment
| Trackback