40 4 Vol. 40, No. 4 2014 4 ACTA AUTOMATICA SINICA April, 2014 1, 2 1, 2 1, 3 1, 3 N.,,., : 1),, ; 2),., N.,. DOI,,,,,,.., 2014, 40(4): 624 634 10.3724/SP.J.1004.2014.00624 Chinese Pinyin-to-character Conversion Based on Cascaded Reranking LI Xin-Xin 1, 2 WANG Xuan 1, 2 YAO Lin 1, 3 GUAN Jian 1, 3 Abstract The word n-gram language model is the most common approach for Chinese pinyin-to-character conversion. It is simple, efficient, and widely used in practice. However, in the decoding phase of the word n-gram model, the determination of a word only depends on its previous words, which lacks long distance grammatical or syntactic constraints. In this paper, we propose two reranking approaches to solve this problem. The linear reranking approach uses minimum error learning method to combine different sub-models, which includes word and character n-gram language models, part-of-speech tagging model and dependency model. The averaged perceptron reranking approach reranks the candidates generated by word n-gram model by employing features extracted from word sequence, part-of-speech tags, and dependency tree. Experimental results on Lancaster Corpus of Mandarin Chinese and People s Daily show that both reranking approaches can efficiently utilize information of syntactic structures, and outperform the word n-gram model. The perceptron reranking approach which takes the probability output of linear reranking approach as initial weight achieves the best performance. Key words Chinese pinyin-to-character conversion, reranking approach, minimum error learning, averaged perceptron Citation Li Xin-Xin, Wang Xuan, Yao Lin, Guan Jian. Chinese pinyin-to-character conversion based on cascaded reranking. Acta Automatica Sinica, 2014, 40(4): 624 634,.. 2013-04-22 2013-09-22 Manuscript received April 22, 2013; accepted September 22, 2013 (2011ZX03002-004-01), (JC201104210032A, JC201005260112A) Supported by Key Science and Technology Projects of the Ministry of National Science and Technology (2011ZX03002-004-01) and Shenzhen Basic Research Key Project (JC201104210032A, JC201005260112A) Recommended by Associate Editor DANG Jian-Wu 1. 518055 2. 518055 3. 518057 1. Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055 2. Shenzhen Applied Technology Engineering Laboratory for Internet Multimedia Application, Shenzhen 518055 3. Public Service Platform of Mobile Internet Application Security Industry, Shenzhen 518057, 410 ( ). 7 000, 3 500.., 60 yi.., N. N,, [1].. OOV (Out of vocabulary).,,. OOV,, [1 2]. [3 6].,
4 : 625, 1. N,., N,.,,. N,., [7 9] [10 14] [15].,,..,..,,. LCMC (Lancaster Corpus of Mandarin Chinese). 1 N.,., / / ( 2 )., N,..,,, [7 8]. Siu N, [9]., N.. [2]. /,, 1 http://pinyin.sogou.com. N. Ney N, [16]. ( ) N, [17 18]..,, N.,. [15]., [10 11] [12] [13] [14] [19],. / /, N /., [20 21]. [22],.,.. 2 N N. S = s 1, s 2,, s n, C = c 1, c 2,, c n, W = w 1, w 2,, w m (m n). 1., w k c i,, c j, (s i,, s j ). i, j k. S, W : W = arg P (W S) (1) W N, : W = arg W P (S W )P (W ) P (S) (2)
626 40, P (S W ) = p((s i,, s j ) w k ) (3) P (W ) = k=1 p(w k w 1,, w k 1 ) (4) k=1 [23], 1. Table 1 1 Feature templates for the character-based model 1 s n (n = 2,, 2) 2 s ns n+1 (n = 1,, 0) 3 s 1s 1 Fig. 1 1 An example of Chinese pinyin-to-character conversion, P (W ) = p(w 1 )p(w 2 w 1 ) m k=3 p(w k w k 2, w k 1 ). N, C : C = arg C P (C S) = arg C P (S C)P (C) P (S) (5), P (S C) P (C) P (S W ) P (W ). N, Beam k., P (S W ) P (S C).. j, s j s i,, s j (i = (0, j 20)),,. j, k.,. 3 N,.,., N.. 3.1,. 1, s 0, s n, s n n n. S, C C = arg arg C GEN(S) C GEN(S) P char (C S) = Φ(S, C) ᾱ (6), P char (C S). Φ(S, C),. [23 24]. 3.2 S W, P occur (W S) = p(w k (s i,, s j )) (7) P occur (S W ) = k=1 p((s i,, s j ) w k ) (8) k=1, p(w k (s i,, s j )) p((s i,, s j ) w k ) w k (s i,, s j ),. s i,, s j,, p(w k (s i,, s j )) p((s i,, s j ) w k ) 1.,. p(w k (s i,, s j )) = N(w k, (s i,, s j )) + 1 N(s i,, s j ) + N Sj i+1 (9) p((s i,, s j ) w k ) = N((s i,, s j ), w k ) + 1 N(w k ) + N Wj i+1 (10), N Sj i+1 j i + 1, N Wj i+1 j i + 1. N(s i,, s j ), N(s i ) N(w i, (s i,, s j ))
4 : 627. N, 6. 3.3 W, T. [25]... W, T T = arg arg T GEN(W ) T GEN(W ) P (T W ) = Φ(T, W ) ᾱ (11) (W, T ), P (T W ), [23 26], 2. /,,. Table 2 2 Feature templates for part-of-speech tagging 1 w 2t 0 end(w 1)w 0start(w 1)t 0 2 w 1t 0 when len(w 0) = 1 3 w 0t 0 start(w 0)t 0 4 w 1t 0 end(w 0)t 0 5 w 2t 0 c nt 0, (n = 1, len(w 0) 2) 3.5. W T, P occur (W T ) = p(w i t i ) (13) P occur (T W ) = i=1 p(t i w i ) (14) i=1, p(w i t i ) p(t i w i ),,. 3.6,.. (W, T ), [27]. (CTB5), [27], 86 %. 2.,,.. 6 t 1t 0 start(w 0)c nt 0 (n = above) 7 t 2t 1t 0 end(w 0)c nt 0 (n = above) 8 t 1w 0 c nc n+1t 0 (c n = c n+1) 9 w 0t 0end(w 1) class(start(w 0))t 0 10 w 0t 0start(w 1) class(end(w 0))t 0 (Chinese treebank 5, CTB5), [26]. F 1 95.26 %. 3.4 N N, N. T P (T ) = p(t i t 1,, t i 1 ) (12) i=1 N N. p(t i t 1,, t i 1 ), p(w k w 1,, w k 1 ). Fig. 2 2 An example of a dependency tree D P (D W, T ) = i w i f i (D W, T ) (15), f i (D W, T ). 4 [28]. 3. S, Beam k W,.,
628 40. k. (Minimum error training method, MERT) [29].,, [30]., S, W N P (W ),, P mert (W S) = P mert (W, C, T, D S) = k w i P sub (W, C, T, D S) = i=0 w 0 P (W ) + w 1 P (C) + w 2 P char (C S) + w 3 P occur (W S) + w 4 P occur (S W ) + w 5 P (T W ) + w 6 P (T ) + w 7 P occur (W T ) + w 8 P occur (W T ) + w 9 P (D W, T ) (16), 9 i=0 w i = 1. P sub (W, C, T, D S), N P (W ), N P (C), P char (C S), P occur (W S) P occur (S W ), P (T W ), N P (T ), P occur (W T ) P occur (W T ), P (D W, T ). W T D, 1. w j (0 j k), w j. P mert (W S) = w j P j (W S) + w i P i (W S) (17) i j, w j P j (W S), i j w i P i (W S).,,. jth. w j.,,. [29 31]. 5, :. W, T D.,. S,. S W W = arg W GEN(S) P rerank (W S)) = arg W GEN(S) (P init (W S) + (P init (W S) + Φ(W, S) ᾱ) (18), GEN(S) S. P init (W S) S W, N P (W ), P mert (W S). P mert (W S) 4. P rerank (W S). 2..,...,., 6.. S k D i (i = 1,, k), ᾱ ᾱ = ᾱ + Φ(ᾱ, D g ) Φ(ᾱ, D p ) (19), D g S D i.,, D g. D p.
4 : 629 6 6.1.,,,. The Lancaster Corpus of Mandarin Chinese (LCMC) 2, 15.,., 50 000, 40 000, 5 000 3. LCMC 45 735. pinyin4j 4, pinyin4j,., Sogou., 99.7 %. N, 200, 39 735.,. 5,. 3 6. 3 Table 3 Statistics of training, development, and test data 112 859 8 235 8 623 875 397 62 961 63 608 723 424 52 039 52 263 1 144 559 82 527 82 788 OOV 13 692 1 043 1 245 OOV (%) 1.56 1.66 1.96 6.2 N N N 1998 2006 2007 2009 2012 2 http://www.lancs.ac.uk/fass/projects/corpus/lcmc/ 3 http://www.uniml.com/nlp/py2char/pddata.tar.gz 4 http://pinyin4j.sourceforge.net 5 :,.! : ;. 6 http://www.uniml.com/nlp/py2char/lcmcdata.tar.gz.,.,. N 130 750, : 7 000, 56 064,, 94 412 Google Chinese 5-gram corpus 7. N 7 000.,, SRILM [32], Knerser-Ney.. N, N, N oracle, N k oracle, 4. 4 (%) Table 4 Performance of different LMs on development data (%) CER IVWER OOVWER N 12.92 11.25 15.04 N 11.27 9.03 14.23 oracle 7.38 6.89 11.58 k oracle 2.01 4.01 5.94 (Character error rate, CER) : CER = 100 % = 100 % (20) N k oracle, 3. k. n = 1, CER N., k, oracle CER., 100. CER, IVWER (In vocabulary word error rate) OOVWER (Out of vocabulary word error rate). 4 IV OOV 1. 4 N N, N 7 http://www.ldc.upenn.edu/catalog/catalogentry.jsp?catalogid=ldc2010t06
630 40 N. N, IV OOV. 3 Fig. 3 N k oracle Oracle of different k-best candidates on development data oracle, k oracle. 4,. N k oracle N. 6.3 k oracle, S 100 (W, D). P sub (W, C, T, D S),. (CER). LCMC,,...,.. P occur (S W ),. 5. 5., N, N. 5,. N.,,.,. 5 Table 5 Experimental results of linear reranking model on development data CER (%) A 9.76 A/0 / N 10.47 A/1 / N 10.10 A/2 / 9.83 A/3 / 10.06 A/4 / 9.84 A/5 / 9.83 A/6 / 9.76 A/7 / N 9.78 A/8 / 9.82,, 4. 0 8, 5. 4. 5, N, N.,. Fig. 4 4 Performance of sub models on development data N.
4 : 631 N. N,, N. 6.4 S D,., 6. (W, T 1, T 2),., w 0 t 0, w ±i i, +i i i.. 6, D1, D2, D3., G F S L R. D F S, 1 0,., A F S., F t S w. Table 6 6 Features for the reranking model W w 2w 1w 0, w 1w 0, w 0 T 1 t 2t 1t 0, t 1t 0, t 0, t 0w 0 T 2 t 1t 0w 0, w 1t 1t 0w 0 D1 D2 D3 t 2w it 1t 0w 0, w 2t 2w 1t 1t 0w 0 F td F SA F SS t, F wd F SA F SS ws t F wf td F SA F SS t, F wf td F SA F SS ws t G td GF A GF F td F SA F SS t G td GF A GF F td F SA F SS ws t G td GF A GF F wf td F SA F SS t G wg td GF A GF F td F SA F SS t G td GF A GF F wf td F SA F SS ws t G wg td GF A GF F wf td F SA F SS t G wg td GF A GF F wf td F SA F SS ws t L td F LA F LR td F RA F RF t L wl td F LA F LR wr td F RA F RF t L td F LA F LR td F RA F RF wf t L wl td F LA F LR wr td F RA F RF wf t., P init (W S) 5. 5, P init (W ), N P (W ), P mert (W S). Fig. 5 5 Results of reranking models with different initial weights and flat feature sets on development data 5 ( ).,,.,,. N P (W ), W W, D1, D3. P mert (W S),. W, T 1 W, T 1, T 2,. P mert (W S) P (W ),.,,, 7., N 1.88. 6.5,, 8., 10.96 %, N 1.07, 8.89 %,., N, [33 34].
632 40 7 Table 7 Results of reranking models with different initial weights and dependency feature sets on development data CER (%) W 9.95 P (W ) W, D1 11.26 W, D1, D2 11.37 W, D1, D3 11.46 W, T 1 9.50 P mert(w S) W, T 1, D1 9.40 W, T 1, D1, D2 9.40 W, T 1, D1, D3 9.39 6 N.,,, N.,,,. Table 8 8 Comparison of different approaches on test data CER (%) N 12.03 10.96 10.60 [33] 18.49 [33 34] 13.9 6.6. 9. IV, OOV. IV, 1, ; 1, N,. OOV,. N,,. Table 9 9 Error analysis of different approaches on development dataset N IVR OOVR IVR OOVR IVR OOVR 1 81.18 0 82.62 0 84.26 0 2 90.34 44.93 92.47 46.96 92.06 47.11 3 96.34 56.98 97.84 58.91 97.05 60.07 4 99.59 78.72 99.32 78.01 99.18 78.72 All 86.13 52.45 87.86 54.08 88.45 54.56 Fig. 6 6 Results of different models on sentences with different word numbers 6.7,,, 10. LCMC., N.. 10 Table 10 Comparison of different approaches on People s Daily dataset CER (%) [11] 10.86 [12] 5.28 [13] 7.06 [14] 11.46 [19] 4.45 [35] 7.99 N 5.48 LCMC 4.74 4.39 [3] 4.98 N [5] 1.52 [6] 4.44
4 : 633 10, [11 14, 19, 35], [19]. [19],, 6.9 %. 10.09 %,., [3],.,.,. 6.8. 4,,., O(l), l. l.,. 7, N.... LCMC,..,.. References 1 Chen S F, Goodman J. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report, Computer Science Group, Harvard University, 1998 2 Brown P F, desouza P V, Mercer R L, Della Pietra V J, Lai J C. Class-based n-gram models of natural language. Computational Linguistics, 1992, 18(4): 467 479 3 Huang J H, Powers D M. Adaptive compression-based approach for Chinese pinyin input. In: Proceedings of the 3rd SIGHAN Workshop Chinese Language Learning. Barcelona, Spain: Association for Computational Linguistics, 2004. 24 27 4 Wei J, Li P X. Applying the word acquiring algorithm to the pinyin-to-character conversion. In: Proceedings of the 5th International Conference on Natural Computation. Washington, DC, USA: IEEE Computer Society, 2009. 17 21 5 Tang B Z, Wang X L, Wang X, Wang Y H. Frequency-based online adaptive n-gram models. In: Proceedings of the 2nd International Conference on Multimedia and Computational Intelligence. Wuhan, China: IEEE, 2010. 263 266 6 Huang J H, Powers D. Error-driven adaptive language modeling for Chinese pinyin-to-character conversion. In: Proceedings of the 2011 International Conference on Asian Language Processing. Penang, Malaysia: IEEE, 2011. 19 22 7 Pauls A, Klein D. Faster and smaller n-gram language models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. 258 267 8 Shan Yu-Xiang, Chen Xie, Shi Yong-Zhe, Liu Jia. Fast language model look-ahead algorithm using extended n-gram model. Acta Automatica Sinica, 2012, 38(10): 1618 1626 (,,,. N., 2012, 38(10): 1618 1626) 9 Siu M H, Ostendorf M. Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing, 2000, 8(1): 63 75 10 Wang X, Li L, Yao L, Anwar W. A imum entropy approach to Chinese pinyin-to-character conversion. In: Proceedings of the 2006 IEEE International Conference on Systems, Man, and Cybernetics. Taipei, China: IEEE, 2006. 2956 2959 11 Zhao Y, Wang X L, Liu B Q, Guan Y. Research of pinyinto-character conversion based on imum entropy model. Journal of Electronics, 2006, 23(6): 864 869 12 Xiao J H, Liu B Q, Wang X L. Exploiting pinyin constraints in pinyin-to-character conversion task: a class-based imum entropy markov model approach. Computational Linguistics and Chinese Language Processing, 2007, 12(3): 325 348 13 Jiang Wei, Guan Yi, Wang Xiao-Long, Liu Bin-Quan. Pinyin-to-character conversion model based on support vector machines. Journal of Chinese Information Processing, 2007, 21(2): 100 105 (,,,.., 2007, 21(2): 100 105) 14 Li L, Wang X, Wang X L, Yu Y B. A conditional random fields approach to Chinese pinyin-to-character conversion. Journal of Communication and Computer, 2009, 6(4): 25 31 15 Wang X L, Chen Q C, Yeung D S. Mining pinyin-tocharacter conversion rules from large-scale corpus: a rough set approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2004, 34(2): 834 844 16 Ney H, Essen U, Kneser R. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 1994, 8(1): 1 38 17 Wang Xuan, Wang Xiao-Long, Zhang Kai. Language model for speech recognition applications. Acta Automatica Sinica, 1999, 25(3): 309 315 (,,.., 1999, 25(3): 309 315) 18 Roark B. Probabilistic top-down parsing and language modeling. Computational Linguistics, 2001, 27(2): 249 276
634 40 19 Yang S H, Zhao H, Lu B L. A machine translation approach for chinese whole-sentence pinyin-to-character conversion. In: Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation. Bali, Indonesia: Universitas Indonesia, 2012. 333 342 20 Wen J, Wang X J, Xu W Z, Jiang H X. Ambiguity solution of pinyin segmentation in continuous pinyin-to-character conversion. In: Proceedings of the 2008 International Conference on Natural Language Processing and Knowledge Engineering. Beijing, China: IEEE, 2008. 1 7 21 Chen Z, Lee K F. A new statistical approach to Chinese pinyin input. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong: Association for Computational Linguistics, 2000. 241 247 22 Zheng Y B, Li C, Sun M S. CHIME: an efficient errortolerant Chinese pinyin input method. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: AAAI Press, 2011. 2551 2556 23 Collins M. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Philadelphia, PA, USA: Association for Computational Linguistics, 2002. 1 8 24 Li X X, Wang X, L Yao Y. Joint decoding for Chinese word segmentation and POS tagging using character-based and word-based discriminative models. In: Proceedings of the 2011 International Conference on Asian Language Processing (IALP). Washington, DC, USA: IEEE, 2011. 11 14 25 Ng H T, Low J K. Chinese part-of-speech tagging: one-ata-time or all at once? word-based or character-based? In: Proceedings of the 2004 EMNLP. Barcelona, Spain: Association for Computational Linguistics, 2004. 277 284 26 Zhang Y, Clark S. Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL-08: HLT. Columbus, Ohio: Association for Computational Linguistics, 2008. 888 896 27 Zhang Y, Nivre J. Transition-based dependency parsing with rich non-local features. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. 188 193 28 Liu Di, Sun Dong-Mei, Qiu Zheng-Ding. Feature level fusion based on speaker verification via relation measurement Fusion framework. Acta Automatica Sinica, 2011, 37(12): 1503 1513 (,,.., 2011, 37(12): 1503 1513) 29 Och F J. Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, 2003. 160 167 30 Jiang W B, Huang L, Liu Q, Lü Y J. A cascaded linear model for joint chinese word segmentation and part-ofspeech tagging. In: Proceedings of ACL-08: HLT. Columbus, Ohio: Association for Computational Linguistics, 2008. 897 904 31 Zaidan O. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 2009, 91(1): 79 88 32 Stolcke A. SRILM an extensible language modeling toolkit. In: Proceedings of the 2002 International Conference on Spoken Language Processing. Denver, Colorado: IEEE 2002. 901 904 33 Liu W, Guthrie L. Chinese pinyin-text conversion on segmented text. In: Proceedings of the 12th International Conference on Text, Speech and Dialogue. Berlin, Heidelberg: Springer-Verlag, 2009. 116 123 34 Zhou X H, Hu X H, Zhang X D, Shen X J. A segment-based hidden Markov model for real-setting pinyin-to-chinese conversion. In: Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM 2007). New York, NY, USA: ACM Press, 2007. 1027 1030 35 Zhang Sen. Solving the pinyin-to-chinese-character conversion problem based on hybrid word lattice. Chinese Journal of Computers, 2007, 30(7): 1145 1153 (.., 2007, 30(7): 1145 1153).,.. E-mail: lixxin2@gmail.com (LI Xin-Xin Ph. D. candidate at Harbin Institute of Technology Shenzhen Graduate School. His research interest covers natural language processing, and network information processing. Corresponding author of this paper.).,. E-mail: wangxuan@insun.hit.edu.cn (WANG Xuan Professor at Harbin Institute of Technology Shenzhen Graduate School. His research interest covers artificial intelligence, network multimedia information processing.).,. E-mail: yaolin@hit.edu.cn (YAO Lin Lecturer at Harbin Institute of Technology. Her research interest covers network information processing and biology information processing.).,. E-mail: guanjian2000@gmail.com (GUAN Jian Ph. D. candidate at Harbin Institute of Technology Shenzhen Graduate School. His research interest covers artificial intelligence and speech recognition.)