JAIST Reposi https://dspace.j Title WWW における関連リンク集の自動生成 Author(s) 田村, 雅樹 Citation Issue Date 2006-03 Type Thesis or Dissertation Text version author URL http://hdl.handle.net/10119/1979 Rights Description Supervisor: 白井清昭, 情報科学研究科, 修士 Japan Advanced Institute of Science and
WWW 2006 3
WWW 410080 : 2006 2 Copyright c 2006 by Tamura Masaki
1 1 2
1 1 1.1................................... 1 1.2................................... 1 1.3.................................. 2 2 3 2.1....................... 3 2.2........................ 4 2.3........................ 5 3 7 3.1................................. 7 3.2................................... 7 3.2.1.............................. 7 3.2.2............................ 7 3.2.3............................ 9 3.2.4.......................... 10 3.2.5............................. 11 3.2.6............................. 12 3.3................................. 12 3.3.1..................... 13 3.3.2....................... 14 3.3.3........................ 16 3.3.4........................ 17 3.4................ 18 3.4.1.......................... 18 3.4.2.................... 23 4 29 4.1..................................... 29 4.2.............................. 30 4.2.1............................. 30 i
4.2.2.............................. 31 4.2.3........................ 37 5 41 5.1...................................... 41 5.2................................... 41 43 44 ii
3.1................................ 8 3.2................................ 9 3.3....................... 10 3.4................................. 11 3.5................................. 12 3.6............................. 13 3.7.............................. 15 3.8................................ 17 3.9........................... 18 3.10.................................. 20 iii
2.1 Clusty........................ 6 3.1....................... 21 3.2................................... 21 4.1.............................. 29 4.2.............................. 30 4.3.................................. 32 4.4............................. 34 4.5 ( ).................. 36 4.6........................... 38 4.7 ( )............... 39 iv
1 1.1 Yahoo Goo Google 1 1 1 Yahoo 1.2 1 2 1
1.3 2 3 4 3 5 2
2 2.1 [1] ( ) ( ) HTML ( ) HTML 4 (li) (dl dt dd) ( ) (2 ) ( ) (<br>) 5 : : : 2 3
: : 1.2 2.2 [2] Name Collector Contents Editor Organizer 3 Name Collector ( ) aquarium Waikiki Aquarium Steinhart Aquarium Monterey Bay Aquarium 2 URL Contents Editor URL Organizer 2 1 2 4 2 4
2.3 Zamir Suffix Tree Clustering(STC) [3] Suffix Tree ( S) 1. 2. 2 3. S ( ) 4. ( ) 5. S s s Suffix Tree STC Suffix Tree 10 STC STC Gooots [4] 5 (pdf doc ) 2 5
Gooots Gooots 2 Gooots Gooots googleapi 5 Vivísimo[5] Clusty.com ( Clusty.jp 2006 2 β ) Clusty Velocity 2.1 (2006/02/03 ) 2.1: Clusty ( ) 154 26 26 22 16 11.. Clusty 1 6
3 3.1 3.2 6 1. 2. 3. 4. 5. 6. 3.2.1 3.2.2 3.1 3.2.1 Goo 500 7
GNU Wget Goo ( 500 ) GNU Wget 3.1: Goo Goo PDF HTML Content-Type text/html application/xhtml+xml 8
3.2.3 3.2.2 Goo 500 3.2 1 2... 3.2: 1. 3.3 2. 1 3. 4. 9
WikiPedia Perl 3.3 ( ) Perl 3.3: (WikiPedia > Perl ) 3.2.4 1. (3.3 ) 2. HTML 3. 80% 10
HTML 3.2.5 3.4 3.5 ( ) 3.4: 3.4 11
3.5: 3.2.6 3.3 HTML 1. 2. 3. 4. 1 3 12
3.3.1 (li ) 3.6 3.6: (TEX Wiki> ) (ul) (ol) <li> <ul> </ul> <li> <a> <a> href 13
2 3 (dl) (a) URL (b) URL URL (c) URL (a) URL./../ / URL ( mailto: ) http:// ftp:// (b) URL http://www.jaist.ac.jp/is/index-jp.html http://www.jaist.ac.jp/is/ Yahoo! http://www.geocities.jp/abc/ http://www.geocities.jp/def/ 3.3.2 (br ) 3.7 14
3.7: ( ) 15
href <a> </a> 1. 0 2. <br> <br> 1. 0 2. <a> 3 (font b i ) (em strong q ) (img) 1 <b><a href="url"> </a></b><br> 3.3.3 1 2 3.8 16
3.8: (shirabeyou.com > ) <table> </table> <tr> </tr> <a> <a> href <tr> 8 2 3 3.3.4 3 3.9 17
1 1 2 3 4 5 5 6 6 7 7 8 8 3.9: 3.4 3.4.1 ( ) 18
1 1. 2. 3. 4. [6] 1 1 ( ) 1 1 [ ] 3.10 ([ ], ) [7] 19
: ([ ], ) 3.10: 1 3.10 ([ ], ) p k mean(p, k) p k 1 (n p,n f ) mean(p, k) =(n p,n f ) p k k (n p,n f ) 1 mean(p, k) =(n p,n f ) p k 2 mean(p, k) 75% p x 3.1 ([ ], ) ([ ], ) 4 75%(3 ) ([ ], ) mean(p x, ) = (([ ], ), ([ ], )) 20
3.1: [ ] 4 1 [ ] 3 ( ) 6 3.2 Goo 2 ( ) 3.2: ([ ], [ ]) (003, 004,...) 211 ([ ], ) (002, 016,...) 102 ([ ], ) (006, 018,...) 23 ([ ], ) (240, 312,...) 22 ([ ], ) (043, 052,...) 15 ([ ], ) (005, 008,...) 7... ([ ], [ ]) 1. 5 21
10% 2. ([ ], [ ]) 3. [ ] ([ ], ) 3.2 1 5 ([ ], ) ([ ], [ ]) 211 10% ([ ], ) 4 ([ ], [ ]) ([ ], ) ([ ], ) ([ ], ) 2 ([ ], [ ]) 3 ([ ], ) ([ ], ) ([ ], ) 3 22
3.4.2 ( ) :... ([ ], [ ]) ([ ], ) :... ([ ], ) ([ ], ) :... (, ) ([ ], ) 1. 2. 3. 23
(a) (b) 4. 2 3 2 a b s(a, b) = at b a b (3.1) x : x [8] 3 0.4 50 ( ) 3.4.1 (Bag of words) TF(Term Frequency) IDF(Inverse Document Frequency) TF-IDF IDF ICF(Inverse Cluster Frequency) t ICF ( ) Nc icf(t) =ln cf(t) +1 (3.2) N c : cf(t) : t (N c, cf(t) ) 24
ICF IDF p t w(t, p) w(t, p) =Normalize ( tf(t, p) ) icf(t) ( ) tf(t, p) Nc = ln t tf(t, p) cf(t) +1 (3.3) (3.3) TF p ICF c t w(t, c) w(t, c) = p c w(t, p) (3.4) 5 (p n, n =1,...,5) 5 (t m, m = 1,...,5) p 1 p 2 p 3 p 4 p 5 t 1 3 1 2 0 1 t 2 0 0 1 0 0 t 3 1 0 1 0 0 t 4 0 1 0 2 1 t 5 0 0 0 2 3 p1, p2 c 1 p5 c 2 N c =3 cf 3 1 cf(t) = 2 3 2 25
ICF ln ( 3 +1) 3 ln ( 0.693 3 +1) 1 icf(t) = ln ( 1.386 3 +1) 2 ln ( 0.916. 3 +1) 3 ln ( 0.693 3 +1) 0.916 2 TF-ICF w(t, p) =tficf(t, p) = p 1 p 2 p 3 p 4 p 5 3 t 1 0.693 1 0.693 2 0.693 0 0.693 1 0.693 4 2 4 4 5 0 t 2 1.386 0 1.386 1 1.386 0 1.386 0 1.386 4 2 4 4 5 t 3 1 0.916 0 0.916 1 0.916 0 0.916 0 0.916 4 2 4 4 5 0 t 4 0.693 1 0.693 0 0.693 2 0.693 1 0.693 4 2 4 4 5 0 t 5 0.916 0 0.916 0 0.916 2 0.916 3 0.916 4 2 4 4 5 p 1 p 2 p 3 p 4 p 5 t 1 0.520 0.347 0.347 0 0.139 t 2 0 0 0.347 0 0 t 3 0.229 0 0.229 0 0. t 4 0 0.347 0 0.347 0.139 t 5 0 0 0 0.347 0.550 26
w(t, c) = = c 1 c 2 t 1 0.520 + 0.347 0.139 t 2 0+0 0 t 3 0.229 + 0 0 t 4 0+0.347 0.139 t 5 0+0 0.550 c 1 c 2 t 1 0.866 0.139 t 2 0 0 t 3 0.229 0. t 4 0.347 0.139 t 5 0 0.550 p 3, p 4, c 1, c 2 p 3 = 0.347 2 +0.347 2 +0.229 2 0.542, p 4 = 0.347 2 +0.458 2 0.575, c 1 = 0.866 2 +0.229 2 +0.347 2 0.961, c 2 = 0.139 2 +0.139 2 +0.550 2 0.584. 27
s(p, c) = = = c 1 c 2 p 3 c 1 p 3 c 2 p 3 p 3 c 1 p 3 c 2 p 4 c 1 p 4 c 2 p 4 p 4 c 1 p 4 c 2 c 1 c 2 0.347 0.866 + 0.229 0.229 0.347 0.139 p 3 0.542 0.961 0.542 0.584 0.347 0.347 p 4 0.575 0.961 ( c 1 c 2 ) p 3 0.678 0.152. p 4 0.218 0.894 0.347 0.139 + 0.458 0.550 0.575 0.584 p 4 c 2 0.894 > 0.4 p 4 c 2 p 3 c 1, c 2 s(p 3, c) = ( c 1 c 2 ) p 3 0.678 0.152. c 1 0.678 > 0.4 p 3 c 1 p 1,p 2,p 3 c 1 p 4, p 5 c 2 28
4 4.1 4.1 5 3 4.1: ( ) 1 2 3 perl 4 5 29
4.2 4.2.1 3.3 1 15 15 (Error rate: E) (Precision: P ) (Recall: R) 1 E = ( ) ( ) P = ( ) ( ) R = ( ) ( ) (4.1) (4.2) (4.3) 4.2 4.2: 1 8.3% (1 / 12) 95.5% (21 / 22) 53.8% (21 / 39) 2 42.1% (8 / 19) 52.9% (9 / 17) 34.6% (9 / 26) 3 37.5% (9 / 24) 65.4% (17 / 26) 68.0% (17 / 25) 4 34.8% (8 / 23) 50.0% (8 / 16) 57.1% (8 / 14) 5 31.3% (5 / 16) 66.7% (10 / 15) 71.4% (10 / 14) 33.0% (31 / 94) 67.7% (65 / 96) 55.1% (65 / 118) 30
URL URL JAIST ( http://www.jaist.ac.jp/ks/index.html ) ( http://www.jaist.ac.jp/is/index-jp.html ) ( http://www.jaist.ac.jp/ms/index.html ) JAIST http://www.jaist.ac.jp/ks/ http://www.jaist.ac.jp/is/ http://www.jaist.ac.jp/ms/ 3.3 3 ( ) (dt dd ) 4.2.2 3.4.1 2 31
15 3.2.3 15 30 15 30 4.3 4.3: 102 1 23 22 71 79 34 2 32 53 91 28 perl5 27 3 28 perl 39 78 4 80 25 99 5 73 148 56 1188 1 32
4.3 1 3 2 3 perl5 perl 3.4 1 perl5 perl perl5 perl 3 perl perl Perl P = ( ) ( ) (4.4) 4.4 1 = ( ) 2 1 2 ( ) 30 2 33
4.4: 1 2 3 4 5 53.3% (16 / 30) 56.5% (13 / 23) 81.8% (18 / 22) 30.0% (9 / 30) 6.7% (2 / 30) 33.3% (10 / 30) 20.0% (6 / 30) 50.0% (15 / 30) 23.3% (7 / 30) 3.3% (1 / 28) perl5 92.6% (25 / 27) 28.6% (8 / 28) perl 23.3% (7 / 30) 50.0% (15 / 30) 63.3% (19 / 30) 24.0% (6 / 25) 73.3% (22 / 30) 13.3% (4 / 30) 46.7% (14 / 30) 90.0% (27 / 30) 42.6% (244 / 573) 34
3 perl5 perl perl perl5 perl5 perl 4 2 2 [ ] 5 2 2 2 3.2.4 15 3.2.3 15 4.5 1 2 perl5 Goo 35
4.5: ( ) ( ) ( ) 1 2 3 4 5 93.3% (14 / 15) 13.3% (2 / 15) 91.7% (11 / 12) 18.2% (2 / 11) 100.0% (4 / 4) 77.8% (14 / 18) 60.0% (9 / 15) 0.0% (0 / 15) 13.3% (2 / 15) 0.0% (0 / 15) 57.1% (8 / 14) 12.5% (2 / 16) 26.7% (4 / 15) 13.3% (2 / 15) 46.7% (7 / 15) 53.3% (8 / 15) 40.0% (6 / 15) 6.7% (1 / 15) 6.7% (1 / 15) 0.0% (0 / 13) perl5 100.0% (1 / 1) 92.3% (24 / 26) 29.6% (8 / 27) 0.0% (0 / 1) perl 24.1% (7 / 29) 0.0% (0 / 1) 53.6% (15 / 28) 0.0% (0 / 2) 72.0% (18 / 25) 20.0% (1 / 5) 26.1% (6 / 23) 0.0% (0 / 2) 73.3% (11 / 15) 73.3% (11 / 15) 100.0% (4 / 4) 0.0% (0 / 26) 66.7% (10 / 15) 26.7% (4 / 15) 80.0% (8 / 10) 95.0% (19 / 20) 49.4% (154 / 312) 34.5% (90 / 261) 36
4.2.3 3.4.2 3.4.2 4.2.2 15 15 30 15 30 P = ( ) ( ) (4.5) 4.6 5 2/3 4.4 2 5 37
4.6: 1 2 3 4 5 24 87.5% (21 / 24) 0-0 - 3 33.3% (1 / 3) 150 10.0% (3 / 30) 106 3.3% (1 / 30) 20 30.0% (6 / 20) 32 40.0% (12 / 30) 85 76.7% (23 / 30) 1 0.0% (0 / 1) perl5 0-5 20.0% (1 / 5) perl 88 66.7% (20 / 30) 102 50.0% (15 / 30) 41 40.0% (12 / 30) 1 0.0% (0 / 1) 70 80.0% (24 / 30) 1 0.0% (0 / 1) 119 30.0% (9 / 30) 77 93.3% (28 / 30) 925 49.6% (176 / 355) 38
2 4.2.2 15 3.2.3 15 4.7 4.7: ( ) ( ) ( ) 1 2 3 4 5 85.0% (17 / 20) 100.0% (4 / 4) - - - - 50.0% (1 / 2) 0.0% (0 / 1) 13.3% (2 / 15) 6.7% (1 / 15) 6.7% (1 / 15) 0.0% (0 / 15) 50.0% (4 / 8) 16.7% (2 / 12) 36.8% (7 / 19) 45.5% (5 / 11) 66.7% (10 / 15) 86.7% (13 / 15) 0.0% (0 / 1) - perl5 - - 20.0% (1 / 5) - perl 63.6% (14 / 22) 75.0% (6 / 8) 70.6% (12 / 17) 23.1% (3 / 13) 46.2% (12 / 26) 0.0% (0 / 4) - 0.0% (0 / 1) 60.0% (9 / 15) 100.0% (15 / 15) - 0.0% (0 / 1) 33.3% (5 / 15) 26.7% (4 / 15) 50.0% (1 / 2) 96.4% (27 / 28) 48.7% (96 / 197) 50.6% (80 / 158) 30 30 30 39
40
5 5.1 55% 5 42.6% 5.2 4.2.1 3 3.4.1 41
50 TF-ICF 42
43
[1], Web. Master s thesis,, 2004. [2] Satoshi Sato, Madoka Sato: Automatic Generation of Web Directories for Specific Categories. AAAI Workshop on Intelligent Information Systems, Orlando, July, 18-19, 1999. [3] Oren Zamir, Oren Etzioni: Web Document Clustering: A Feasibility Demonstration. SI- GIR 98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp.46-54, August 24-28, 1998. [4],,, : Gooots. 67, 2005. [5] Vivísimo, http://vivisimo.com/ [6] ChaSen s Wiki, http://chasen.naist.jp/hiki/chasen/ [7] syger.com - The English language stop-words, http://www.syger.com/jsc/docs/stopwords/english.htm [8] Alexander Strehl, Joydeep Ghosh, and Raymond Mooney: Impact of Similarity Measures on Web-page Clustering. AAAI 2000: Workshop of Artificial Intelligence for Web Search, pp.58-64, July, 2000. 44