Fuzzy Highlight
high light Openfind O(kn) n k O(nm) m Knuth O(n) m Knuth Unix grep regular expression exact match Yahoo agrep fuzzy match Gais agrep Openfind gais
exact match fuzzy match fuzzy match O(kn) fuzzy match dynamic programming
1/3 fuzzy highlight FuzzyHighlight.zip highlight Perl, 0, :\>perl -s fuzzymatch5.pl " " " esult:<font color=red> </font> :\>perl -s fuzzymatch5.pl 1 esult: <font color=red> </font> :\>perl -s fuzzymatch5.pl CD-R cdr 1 esult: <font color=red>cd-r </font> :\>perl -s fuzzymatch5.pl "My name is 'sam Tseng'" "sam esult:my name is '<font color=red>sam Tseng</font>' Tseng"
highlight 2/3 D:\>perl -s fuzzymatch5.pl " ", " 1 Result: <font color=red> </font><font color=red> </font> D:\>perl -s fuzzymatch5.pl " ", " 1 Result: <font color=red> <font color=red> </font></font> highlight D:\>perl -s fuzzymatch5.pl " computer " " camputorization " Result:<font color=red> computer </font>
high light 3/3 D:\>perl -s fuzzymatch5.pl " computer " " camputorization" Result: computer high light D:\>perl -s fuzzymatch5.pl " computer " " camputorization" 1 Result:<font color=red> computer</font>
highlight substitution highlight Perl sub FuzzyHighLight { my($doc, $Query, $OSoundex) = @_; # given 3 arguments $Query =~ s/\s+(\w)/$1/g; # delete white spaces my($w, @Terms); # @Terms = FuzzyMatchTerms($Doc, $Query, $OSoundex); # High light longer term first, e.g. : 'DVD', 'DVD Player' # so that 'DVD player' can be correctly high lighted. foreach $w (sort {length($b)<=>length($a)} @Terms) { $Doc =~ s \Q$w\E <font color=red>$w</font> g; } # <font color=red> </font> return $Doc; } # End of &FuzzyHighLight()
Fuzzy Match: highlight fuzzy match highlight token token word token token token Computer white space Computer ( byte ) comput lowercase and stem Computer ( byte ) C513 ( C513 Knuth soundex code soundex code) ( Perl use Text::Soundex; soundex() ) dynamic programming
HTML HTML HTML tag <font-family: Times New Roman > highlight new party HTML tag HTML tag <...> space space In function &DocToken() of FuzzyHighlight.pm } elsif ($c =~ /\s/) { # if white space, do nothing $space.= $c; } elsif ($c eq '<') { # if an HTML tag push @Stack, $c; # Find next balanced '>' for ($wb=$i+1; $wb < $strlen; $wb++) { if (substr($str, $wb, 1) eq '<') { push @Stack, substr($str, $wb, 1); } elsif (substr($str, $wb, 1) eq '>') { pop @Stack; last if @Stack == 0 ; # if empty } } $space.= substr($str, $i, $wb-$i); $i = $wb - 1;
Dynamic Programming for Fuzzy Match DP algorithm: d[i, j] = min( d[i-1, j] + w(a[i], 0), d[i-1, j-1] + w(a[i], B[j]), d[i, j-1] + w(0, B[j]) ) d[i, j] : accumulated distance between sequence A and B w(a[i], B[j]) : cost of substituting A[i]with B[j] w(a[i], 0) : cost of inserting A[i] w(0, B[j]) : cost of deleting B[j] d[0, 0] = 0 (i-1,j-1) (i, j-1) d[i, 0] = d[i-1, 0] + w(a[i], 0), 1<= i <= m d[0, j] = d[0, j-1] + w(0, B[j]), 1<= j <= n Complexity: O(n * m), m : length of A, n : length of B (i-1, j) ( i, j )
Basic Dynamic Programming: Examples B=adecdecf A=adec d[4,4] 0 d[4,4] d[3,3] d[2,2] d[1,1], 0, match adec match adec B 0 DP 0 B A a d e c d e c f a 0 1 2 3 4 5 6 7 d 1 0 1 2 3 4 5 6 e 2 1 0 1 2 3 4 5 c 3 2 1 0 1 2 3 4 B A a c d a d e c f a 0 1 2 3 4 5 6 7 d 1 1 1 2 3 4 5 6 e 2 2 2 2 3 3 4 5 c 3 2 3 3 3 4 3 4
Improved Dynamic Programming 0 d[0, j] = d[0, j-1] + w(0, B[j]), 1<= j <= n d[0, j] = 0, 1<= j <= n d( m) = 0 m exp min m min min d(m) 0 1 m if ($min == $lenq) { $dm = 0; } else { $dm = 1/exp($min/($lenq-$min)); } if (($lenq < 5 && $dm < 0.7) or ($dm < 0.54)) { } #The No Match threshold 1 B a c d a d e c f A 0 0 0 0 0 0 0 0 a 0 1 1 0 1 1 1 1 d 1 1 1 1 0 1 2 2 e 2 2 2 2 2 0 1 2 c 3 3 3 3 3 1 0 1
fuzzymatch5.pl exact match high light match match match O(mn) O(n) fuzzy match