Trie

August 1st, 2010 Categories: 笔记

从《Effective Perl》上学习到一个module:Regexp::Trie。它属于正则优化类的module,具体说来,就是提取出备选项文本的公共部分,构造“检索树”,以便最大程度上减少回溯,提升效率。

试用了一下。对于简单的例子,它的效果很惊艳;对于复杂的例子,这个优化器的“机器人”的性格就显示出来了,感觉不如手工写得更高效,更有针对性。见例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/usr/bin/perl
#author:         rex
#blog:           http://iregex.org
#filename        a.pl
#created:        2010-08-01

use Regexp::Trie;

my @simple=qw/foobar foobah fooxar foozap fooza/;
my @complex=qw/AuthenticViagra BestDealsOnViagra BestMaleTablets GenuineOnlineViagra GreatViagraShop OriginalViagra TopViagraShop VIAGRA ViagraBestOnlineShop ViagraBrandReseller ViagraPills onlineshop/;
#the simple one
my $t_s=Regexp::Trie->new;

foreach (@simple){
    $t_s->add($_);
}

#the complex one
my $t_c=Regexp::Trie->new;
foreach (@complex){
    $t_c->add($_);
}

print "simple regex trie: ", $t_s->regexp, "\n";
print "complex regex trie: ", $t_c->regexp, "\n";

其输出结果为:

1
2
simple regex trie: (?-xism:foo(?:ba[hr]|xar|zap?))
complex regex trie: (?-xism:(?:AuthenticViagra|Best(?:DealsOnViagra|MaleTablets)|G(?:enuineOnlineViagra|reatViagraShop)|OriginalViagra|TopViagraShop|V(?:IAGRA|iagra(?:B(?:estOnlineShop|randReseller)|Pills))|onlineshop))

对于第2个例子,可以看出来,Trie是基于单个字符来优化的,而不是基于单词。虽然如此,它还是会极大地减少程序员的手工劳动。毕竟,未经优化的多选结构的效率之差不言而喻的,只要经过Trie的优化,其效率就会有质的飞跃;而手工的调教,只是在该飞跃的基础上再有一点点提升而已。对正则表达式比较头痛的程序员,更是可以无视自己的优化,直接使用该module而已。不过,使用Perl的程序员,应该不会对正则表达式头痛吧?

Perl 5.10 之后的版本,已经内置了类似于Trie的优化机制,因此没有手动优化的必要了。不过,我还是好奇于它的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package Regexp::Trie;
use 5.008001;
use strict;
use warnings;

our $VERSION = sprintf "%d.%02d", q$Revision: 0.2 $ =~ /(\d+)/g;

# use overload q("") => sub { shift->regexp };

sub new{ bless {} => shift }
sub add{
    my $self = shift;
    my $str  = shift;
    my $ref  = $self;
    for my $char (split //, $str){
        $ref->{$char} ||= {};
        $ref = $ref->{$char};
    }
    $ref->{''} = 1; # { '' => 1 } as terminator
    $self;
}
sub _regexp{
    my $self = shift;
    return if $self->{''} and scalar keys %$self == 1; # terminator
    my (@alt, @cc);
    my $q = 0;
    for my $char (sort keys %$self){
        my $qchar = quotemeta $char;
        if (ref $self->{$char}){
            if (defined (my $recurse = _regexp($self->{$char}))){
                push @alt, $qchar . $recurse;
            }else{
                push @cc, $qchar;
            }
        }else{
            $q = 1;
        }
    }
    my $cconly = !@alt;
    @cc and push @alt, @cc == 1 ? $cc[0] : '['. join('', @cc). ']';
    my $result = @alt == 1 ? $alt[0] : '(?:' . join('|', @alt) . ')';
    $q and $result = $cconly ? "$result?" : "(?:$result)?";
    return $result;
}
sub regexp{ my $str = shift->_regexp; qr/$str/ }

1;

大致读了一下,让我想起之前我写过的一篇《关于从普通文本提取正则表达式的再思考》。不过,我的是业余与Trie的专业的差距,呵呵。这两天学习一下这段代码。若有值得分享的心得,再贴出来。

Tags: ,

One Response to “Trie”

Leave a Comment

如何在本站贴代码?

可以使用任意语言名称代替“python”.