Journal of South China Normal University Natural Science Edition 2016 48 3 53-58 doi 106054 /jjscnun201605006 1 2* 2 3 2 1 510631 2 3 510225 Glove TP3911 A 1000-5463 2016 03-0053-06 Research on Academic Semantic Search Using Word Vector Representations CHEN Guohua 1 TANG Yong 2* XU Yuying 2 HE Chaobo 3 XIAO Danyang 2 1 Network Center South China Normal University Guangzhou 510631 China 2 School of Computer Science South China University of Technology Guangzhou 510631 China 3 School of Information Science and Technology Zhongkai University of Agriculture and Engineering Guangzhou 510225 China Abstract Using the papers in computer science extracted from Scholat as the corpus multiple word vector training schemes are proposed using the Glove semantic toolkit and their performances are compared and analyzed Then a random projection method is proposed to quickly access vectors in the large vector space Finally a semantic vector computing scheme for the whole academic documents is proposed based on the word vector representations A series of experiments are conducted and the effectiveness of the proposed scheme word vector based academic semantic search is verified This scheme is applied to the search function of Scholat and it can obtain satisfying performance Key words academic semantic computing word vectors random projection Scholat 90 Latent Semantic Analysis LSA 1-3 Singular Value Decomposition - BLEI 4-5 Dirichlet Topic Model 1 1 1 80 1 2016-04-24 http / /journalscnueducn /n 863 2013AA01A212 61272067 61502180 2013B090800024 2015A020209178 2016A030303058 2015A030310509 2014A030310238 2014J4300033 * Email ytang4@ qqcom
54 48 6-7 word2vec 8-10 GloVe 11 Word2vec X king - X man X queen - X woman Recurrent Neural Network X shirt - X clothing X chair - X furniture 2 n-gram X king king CBOW 12 GloVe log- 1 LSA 8 000 10 16 2 Topic Model 1 1 1 1 D 2 ansj_seg 3 GloVe 1 V 1 4 Q θ Q 5 Q V 13 Bag of Words SOCHER 14 1 1 1 15 Paragraph Vector 1 1 Ansj_seg 17 Ansj_seg N-Gram 200 /s 96% 1 2 1 4 1 Q V m V n d
3 55 O mdn 2 16 12 727 V V 2 1 1 Figure 1 Split of random trees 3 T i 4 T i v S j 5 S j V i 6 V = V V i 7 Return V 2 + + + TF-IDF 21 + + V TF-IDF 1 1+ -1 = 0 2 Input v d N Output V 22 Procedure 1 V = 2 for i in 1 N 2
56 48 Table 1 1 Comparison of the query performance with or without abstracts 0926 450 286 0517 000 482 1-DFFT 0686 018 873 E-CSPE 0421 188 453 0674 879 131 0414 511 314 0657 541 765 SMT RTL 0396 606 624 U- 0643 088 652 0391 683 085 Q 0641 766 516 0386 460 748 0628 374 558 0385 903 833 Newton-Thiele 0628 356 107 0383 847 701 0738 215 726 0443 087 229 0668 840 042 0389 737 297 0654 369 302 0384 204 714 0609 374 842 0374 215 445 0603 729 577 0372 000 271 0601 388 043 0371 005 674 V- 0592 452 381 0363 339 901 0567 773 813 0362 977 328 Table 2 2 Comparison of single word splitting performance with or without key words 0926 450 286 1979 2 0399 743 546 1-DFFT 0686 018 873 0399 503 776 0674 879 131 2009 0391 703 807 0657 541 765 DTI 0390 108 847 U- 0643 088 652 H264 0377 643 127 Q 0641 766 516 0376 913 545 0628 374 558 0361 568 256 Newton-Thiele 0628 356 107 0354 862 961 0738 215 726 0505 907 123 0668 840 042 G/G/1-FCFS M/G/1-PS M/G/ Web 0468 035 227 0654 369 302 Deep Web 0416 168 959 0609 374 842 0412 447 286 0603 729 577 / Verilog 0406 117 685 0601 388 043 0401 674 471 V- 0592 452 381 0399 956 055 0567 773 813 0376 883 075
3 57 23 2 3 1 1 TF-IDF TF-IDF 1 2 1 TF-IDF 1 Table 3 3 Comparison of query performance with or without weight information TF-IDF 0926 450 286 0926 450 288 1-DFFT 0686 018 873 1-DFFT 0686 018 877 0674 879 131 0674 879 135 0657 541 765 0657 541 767 U- 0643 088 652 U- 0643 088 651 Q 0641 766 516 Q 0641 766 517 0628 374 558 0628 374 561 Newton-Thiele 0628 356 107 Newton-Thiele 0628 356 109 0738 215 726 0738 215 729 0668 840 042 0668 840 044 0654 369 302 0654 369 299 0609 374 842 0609 374 841 0603 729 577 0603 729 577 0601 388 043 0601 388 044 V- 0592 452 381 V- 0592 452 378 0567 773 813 0567 773 811 24 1 4 Q Q θ 4 065 5 4 Table 4 Related word expansion of machine learning 4 0999 999 999 999 999 6 0706 581 424 336 596 7 0700 522 392 797 570 3 + 0650 479 383 139 883 8 TF-IDF
58 48 Table 5 5 Comparison of query performance with or without query expansion 0926 450 286 0883 932 571 1-DFFT 0686 018 873 0707 267 018 0674 879 131 0654 039 706 0657 541 765 0653 243 473 U- 0643 088 652 Q 0651 101 475 Q 0641 766 516 0647 492 558 0628 374 558 0638 793 333 Newton-Thiele 0628 356 107 0636 359 742 3 8 MIKOLOV T CHEN K CORRADO G et al Efficient estimation of word representations in vector space J /OL 2013-09-07 2016-03-25 Computer Science http wwwoalibcom/ paper /4057741#Vx3Rz_mEAso + 9 MIKOLOV T SUTSKEVER I CHEN K et al Distributed representations of words and phrases and their compo- sitionality J Advances in Neural Information Processing Systems 2013 26 3111-3119 10 MIKOLOV T YIH W ZWEIG G Linguistic regularities in continuous space word representations C Proceedings of NAACL-HLT Atlanta sn 2013 746-751 11 PENNINGTON J SOCHER R MANNING C D Glove global vectors for word representation C Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing EMNLP Doha s n 2014 1532-1543 12 BENGIO Y SCHWENK H SEN CAL J S et al Neural 1 DEERWESTER S DUMAIS S T FURNAS G W et al Indexing by latent semantic analysis J Journal of the American Society for Information Science 1990 41 6 391 2 HOFMANN T Unsupervised learning by probabilistic latent semantic analysis J Machine Learning 2001 42 1 /2 177-196 3 DUMAIS S T Latent semantic analysis J Annual Review of Information Science and Technology 2004 38 1 188-230 4 BLEI D M NG A Y JORDAN M I Latent dirichlet allocation J The Journal of Machine Learning Research 2003 3 993-1022 5 BLEI D M LAFFERTY J D A correlated topic model of science J The Annals of Applied Statistics 2007 1 1 17-35 6 SCHMIDHUBER J Deep learning in neural networks an overview J Neural Networks 2015 61 85-117 7 WESTON J RATLE F MOBAHI H et al Deep learning via semi-supervised embedding J Lecture Notes in Computer Science 2012 7700 1168-1175 probabilistic language models M HOLMES D E JAIN L C Innovations in Machine Learning Berlin Springer 2006 137-186 13 MITCHELL J LAPATA M Composition in distributional models of semantics J Cognitive Science 2010 34 8 1388-1429 14 SOCHER R LIN C C MANNING C et al Parsing natural scenes and natural language with recursive neural networks C Proceedings of the 28th International Conference on Machine Learning Bellevue sn 2011 129-136 15 LE Q V MIKOLOV T Distributed representations of sentences and documents C Proceedings of the 31th International Conferences of Machine Learning Beijing sn 2014 1188-1196 16 Z /OL 2016-03- 25 http wwwscholatcom 17 NLPChina Ansj Z /OL 2016-04- 10 https githubcom / NLPchina / ansj_seg