2013 2 计算机工程与设计 Feb. 2013 34 2 COMPUTER ENGINEERING AND DESIGN Vol. 34 No. 2 夏 静, 柴玉梅, 昝红英 : ( 郑州大学信息工程学院, 河南郑州 450001) 词的兼类问题是汉语词性标注中的关键问题之一针对常用词的兼类识别进行研究, 综合考虑了影响兼类词识别 的不同特征, 分别使用条件随机场模型 最大熵模型和 k 最近邻等统计方法, 根据兼类词本身的特点以及其在上下文句子 中的关系, 同时针对不同的方法采用词语信息 词性信息等不同的特征模板分别对训练语料进行特征抽取, 并取得了较好 的实验结果 ; 对一些识别结果不够理想的词又尝试了规则的方法, 构建兼类词的规则, 不断进行测试, 改进规则库, 在相 同的条件下, 得到了优于统计方法的实验结果 : 中文信息处理 ; 兼类词 ; 条件随机场 ; 最大熵 ; K 近邻 : TP391 : A : 1000-7024 ( 2013) 02-0654-06 Study on multi-category of common words based on statistics and rules XIA Jing CHAI Yu-mei ZAN Hong-ying College of Information Engineering Zhengzhou University Zhengzhou 450001 China Abstract The problem of multiple syntactic category words is one of the key issues in part -of -speech tagging of Chinese. The reconginition on syntactic category of common words is mainly researched and the different characteristics is considered which impact the recognition of multi-category word. Firstly three methods attempted which are conditional random fields Maximum Entropy and k-nearest neighbor method and have achieved good results are obtained. According to the characteristics of the multi-category words and their relations in the context of the sentence for the different methods such as word information and part of speech information will be used as templates to extract features for the training data. The rule method also is tried to deal with some words which recognition results are not ideal and the rules for the multi-category words are constructed and by constantly testing to the rule base is improved. In the same conditions has been better than the results of statistical methods. Key words Chinese information processing multi-category word conditional RandomFields MaximumEntropy K-nearest neighbor 0 1 CRFs 2 3 2012-04-09 2012-06-17 60970083 104100510026 1986-1964- 1966- E-mail happydayxia@ 126. com
34 2 夏静, 柴玉梅, 昝红英 : 基于统计和规则的常用词的兼类识别研究 655 1 4 0 1 2 3 4 5 6 7 8 w w-4 p-4 w-3 p-3 w-2 p-2 w-1 p-1 9 10 11 12 13 14 15 16 17 w + 1 p + 1 w + 2 p + 2 w + 3 p + 3 w + 4 p + 4 p 4 1 NLP 15 0 1 16 2 17 3 4 5 6 7 8 9 10 11 12 13 14 HMM a n wd c rz vl v v p SVM ME n CRF k # Unigram U01 % x 0 1 /% x 0 2 U02 % x 0 3 /% x 0 4 U03 % x 0 5 /% x 0 6 1. 1 conditional RandomFields CRF U04 % x 0 7 /% x 0 8 U05 % x 0 9 /% x 0 10 2001 J. Lafferty U06 % x 0 11 /% x 0 12 U07 % x 0 13 /% x 0 14 6 CRF U08 % x 0 15 /% x 0 16 # Bigram B CRF 0 1 2 3 4 5 6 7 8 15 16 17 9 10 11 12 13 14 a n wd c rz vl v v p p CRF 18 CRF 1. 2 tokens E. T. Jaynes 1957 Token 7 1 4 1 W W-i i = 1 2 3 P-i i = 1 2 3 9 10 W + i i = 1 2 3 P + i i = 1 2 3 P 17 8
656 计算机工程与设计 2013 k k weka + ME + 3 KNN 3 ME 1. 4 lable f1 f2 fn v1 v2 vn 4 p w0 = w-4 = p-4 = a wp-4 = a w- 3 = p-3 = n wp-3 = n w-2 = p-2 = wd wp- 2 = wd w-1 = p-1 = c wp-1 = c w + 1 = p + 1 = rz wp + 1 = rz w + 2 = p + 2 = vl wp + 2 = vl w + 3 = p + 3 = v wp + 3 = v w + 4 = p + 4 = v wp + 4 = v w p wp w0 p < ID > F M L R N E F < 1 > < 2 > a v n M < 1 > < 2 > a v n L < 1 > < 2 > a v n R < 1 > < 2 > a v n 1. 3 K K k-nearest neighbor KNN E < 1 > < 2 > a v n ID F Cover Hart 1968 M L R N x x E K K x Lable f1 = v1 f2 = v2 fn = vn 11-12 BNF N < 1 > MYM < 2 > a v n KNN @ < c > N ^N @ < d > R ^R v @ < c > F ^F ~ @ < d > N ^N w * v KNN KNN k k BNF
34 2 夏静, 柴玉梅, 昝红英 : 基于统计和规则的常用词的兼类识别研究 657 BNF /n /vn /wu /v /n /n /c /n /n /n /n /u /wd /p /b /n shang5 /f /v /ul /n /ud /vn /n /wf /c /wd /a 7 /m /qv /v /Ng / wu /v /n /u /n /d /p 1 /vn /wyy /n /wd 2 2 3-5 5 5 + 4 2. 1 CRF + + CRF + + Yet Another Toolkit CP /OL. http / /www. chasen. org / ~ taku / software /CRF + + 2000 1 4 /vn /c /wd /t /v /v /ud /n /c /vn /n /wd /v /v /rz /Vg che1 /n /Vg /n /vn /wu /v /n /n /c /n /n /n /n /u /wd /p /b /n shang5 /f /v /ul /n /ud /vn /wd /n /wf /c /a 7 /m /qv /v /Ng /wu /v /n /u /n /d /p /lv /ud /vn /n /v /wyz / d /p /v /c 93. 4507% 97. 0684% 93. 0528% vn /wyy /n /wd d /m 79. 8860% 93. 7766% 79. 5183% p /d /n /Ug 87. 5148% 96. 1427% 87. 3855% d /c 89. 3519% 89. 8148% 89. 3519% d /c 86. 3281% 82. 8125% 85. 2500% 2 CRF 90% ud /c < c > /wd /t /v /v / /vn /n /c /vn /n /wd /v /v /rz /Vg che1 /n /Vg /lv /ud /vn /n /v /wyz 2000 1 2 3 CRF a b c + 2 CRF a b c p /c /jn 92. 2386% 94. 8619% 92. 2015% + b
658 计算机工程与设计 2013 4 k 2. 2 90% + c Zhang Le maxent maxent http / / homepages. inf. ed. ac. uk / s0450736 /maxent _ too lkit. html maxent a b c + ME 4 3 3 ME 3 ME a b c p /c /jn 93. 3104% 93. 7165% 93. 3442% 图 1 三种统计方法的实验结果 d /p /v /c 95. 0796% 95. 8755% 94. 5007% d /m 82. 8165% 93. 0413% 83. 5465% p /d /n /Ug 88. 6194% 94. 1959% 88. 8780% d /c 89. 3519% 87. 0370% 89. 3519% d /c 87. 1094% 79. 6875% 87. 1094% 3 ME b 2. 4 ME CRF 2. 3 k 1 ME KNN 1 CRF weka k 5 10 11 4 3 KNN d /c 92. 5781% 87. 1094% ME 4 KNN a b c p /c /jn 93. 4959% 91. 8699% 93. 6807% d /p /v /c 93. 8857% 92. 5832% 93. 9580% d /m 86. 4469% 92. 3077% 93. 4066% p /d /n /Ug 85. 0843% 89. 4293% 89. 6239% d /c 89. 8148% 88. 8889% 89. 8148% d /c 84. 7656% 86. 3281% 83. 9844% d /m 96. 5909% 93. 7766% CRF d /c 97. 6851% 89. 8148% CRF 5
34 2 夏静, 柴玉梅, 昝红英 : 基于统计和规则的常用词的兼类识别研究 659 C / /Proceedings of the 18th ICML-01 2001 282-289. 5 Cohn T Blunsom P. Semantic role labeling with tree conditio-nal random fields C / /Proceedings of the Ninth Conference on Computational Natural Language Learning. Ann Arbor Michigan As- 3 sociation for Computational Linguistics 2005 169-172. K 6 MIAO Xuelei. Chinese word sense disambiguation method based on conditional random fields D. Shenyang Shenyang Aerospace U- niversity 2007 in Chinese.. + D. 2007. 7 Jaynes E T. Information theory and statistical mechanics J. Physics Reviews 1957. 8 CHEN Xiaorong QIN Jin. Maximum entropy-based chinese word sense disambiguation J. Computer Science 2005 32 5 174-176 in Chinese.. J. 2005 32 5 174-176. 1 ZHANG Yizhe QU Weiguang LIU Jinke. Research on disambiguation of multiple syntactic category words based on ensemble of classifiers J. Journal of Nanjing Normal University 2010 33 4 144-147 in Chinese.. J. 2010 33 4 144-147. 2 HONG Mingcai ZHANG Kuo TANG Jie. A Chinese part of speech tagging approach using conditional random fields J. Computer Science 2006 33 10 148-151 in Chinese.. CRFs J. 2006 33 10 148-151. Polytechnic University 2007 in Chinese. 3 ZHANG Hu ZHENG Jiaheng. Consistency check on POS tagging of Chinese corpus based on classification J. Computer Engineering 2008 34 8 90-92 in Chinese.. 12 ZAN Hongying ZHANG Kunli CHAI Yumei. Studies on the J. functional word knowledge base of modern Chinese J. Journal 2008 34 8 90-92. of Chinese Information Processing 2007 21 5 107-111 in 4 Lafferty J McCallum A Pereira F. Conditional random fields probabilistic models for segmenting and labeling sequence data 9 ZHANG Lei. Chinese POS tagging study based on maximum entropy D. Dalian Dalian University of Technology 2008 in Chinese.. D. 2008. 10 PENG Qiwei. Classification of emotional tendency of the Chinese text based on statistical methods D. Taiyuan Shanxi University 2007 in Chinese.. D. 2007. 11 ZAN Hongying ZHANG Kunli CHAI Yumei. The formal description of the modern Chinese adverb usage C / /The 8th Chinese Lexical Semantics Workshop Proceedings The Hong Kong. C / / 2007. Chinese.. J. 2007 21 5 107-111.