4 : 625, 1. N,., N,.,,. N,., [7 9] [10 14] [15].,,..,..,,. LCMC (Lancaster Corpus of Mandarin Chinese). 1 N.,., / / ( 2 )., N,..,,, [7 8]. Siu N, [9].



Similar documents
,,.,, : 1),,,,, 2),,,,, 3),,,,,,,,,, [6].,,, ( ),, [9], : 1), 2),,,,, 3),,, 2.,, [10].,,,,,,,,, [11]. 2.1,, [12],, ;, ; Fig. 1 1 Granular hier

填 写 要 求 一 以 word 文 档 格 式 如 实 填 写 各 项 二 表 格 文 本 中 外 文 名 词 第 一 次 出 现 时, 要 写 清 全 称 和 缩 写, 再 次 出 现 时 可 以 使 用 缩 写 三 涉 密 内 容 不 填 写, 有 可 能 涉 密 和 不 宜 大 范 围 公

2013国际营销科学与信息技术大会(MSIT2013)

[9] R Ã : (1) x 0 R A(x 0 ) = 1; (2) α [0 1] Ã α = {x A(x) α} = [A α A α ]. A(x) Ã. R R. Ã 1 m x m α x m α > 0; α A(x) = 1 x m m x m +

Microsoft Word - chnInfoPaper6

第一章

Microsoft Word tb 赵宏宇s-高校教改纵横.doc

一般社団法人電子情報通信学会 信学技報 THE INSTITUTE OF ELECTRONICS, IEICE Technical Report INFORMATION THE INSTITUTE OF AND ELECTRONICS, COMMUNICATION ENGINEERS IEICE L

2 3. 1,,,.,., CAD,,,. : 1) :, 1,,. ; 2) :,, ; 3) :,; 4) : Fig. 1 Flowchart of generation and application of 3D2digital2building 2 :.. 3 : 1) :,

698 39,., [6].,,,, : 1) ; 2) ,, 14,, [7].,,,,, : 1) :,. 2) :,,, 3) :,,,., [8].,. 1.,,,, ,,,. : 1) :,, 2) :,, 200, s, ) :,.

标题

物理学报 Acta Phys. Sin. Vol. 62, No. 14 (2013) 叠 [4]. PET 设备最重要的部件就是探测器环, 探测 备重建图像具有减少数据插值的优势. 器环的性能直接影响 PET 的成像能力. 探头与探头 之间得到的符合直线叫做投影线. 所有的投影线在

Microsoft Word - A doc

Dan Buettner / /

第 2 期 王 向 东 等 : 一 种 运 动 轨 迹 引 导 下 的 举 重 视 频 关 键 姿 态 提 取 方 法 257 竞 技 体 育 比 赛 越 来 越 激 烈, 为 了 提 高 体 育 训 练 的 效 率, 有 必 要 在 体 育 训 练 中 引 入 科 学 定 量 的 方 法 许 多

中文模板

<4D F736F F D D DBACEC0F25FD0A3B6D4B8E55F2DB6FED0A32D2D2DC8A5B5F4CDBCD6D0B5C4BBD8B3B5B7FBBAC52E646F63>

2 ( 自 然 科 学 版 ) 第 20 卷 波 ). 这 种 压 缩 波 空 气 必 然 有 一 部 分 要 绕 流 到 车 身 两 端 的 环 状 空 间 中, 形 成 与 列 车 运 行 方 向 相 反 的 空 气 流 动. 在 列 车 尾 部, 会 产 生 低 于 大 气 压 的 空 气 流

Microsoft Word 記錄附件

Microsoft Word - Preface_1_14.doc

~ 10 2 P Y i t = my i t W Y i t 1000 PY i t Y t i W Y i t t i m Y i t t i 15 ~ 49 1 Y Y Y 15 ~ j j t j t = j P i t i = 15 P n i t n Y

然 而 打 开 目 前 市 场 上 流 行 的 任 意 一 款 智 能 输 入 法, 上 面 提 到 的 词 都 会 被 轻 轻 松 松 的 输 出 来 ; 不 仅 如 此, 所 有 的 智 能 输 入 法 都 支 持 用 户 短 句 级 别 以 及 句 子 级 别 的 输 入 方 法, 并 且 能

* CUSUM EWMA PCA TS79 A DOI /j. issn X Incipient Fault Detection in Papermaking Wa

<4D F736F F D20B8BDBCFE3220BDCCD3FDB2BFD6D8B5E3CAB5D1E9CAD2C4EAB6C8BFBCBACBB1A8B8E6A3A8C4A3B0E5A3A92E646F6378>

44(1) (1) (4) (4) 63-88TSSCI Liu, W. Y., & Teele S. (2009). A study on the intelligence profile

~ ~

一 课 程 负 责 人 情 况 姓 名 吴 翊 性 别 男 出 生 年 月 基 本 信 息 学 位 硕 士 职 称 教 授 职 务 所 在 院 系 理 学 院 数 学 与 系 统 科 学 系 电 话 研 究 方 向 数 据 处 理 近 三 年 来

% GIS / / Fig. 1 Characteristics of flood disaster variation in suburbs of Shang

Fig. 1 Frame calculation model 1 mm Table 1 Joints displacement mm

标题

作 主 动 追 求 知 识 获 取 技 能, 在 心 理 和 生 理 上 都 非 常 积 极 的 个 体 (Zimmerman & Pons, 1986) 在 此 期 间, 自 我 效 能 感 (self-efficacy) 自 我 控 制 (self-control) 自 我 管 理 (self-

穨423.PDF

SVM OA 1 SVM MLP Tab 1 1 Drug feature data quantization table

1 引言

Chinese Journal of Applied Probability and Statistics Vol.25 No.4 Aug (,, ;,, ) (,, ) 应用概率统计 版权所有, Zhang (2002). λ q(t)

[1] Liu Hongwei,2013, Study on Comprehensive Evaluation of Iron and Steel Enterprises Production System s Basic Capacities, International Asia Confere

,.,,.. :,, ,:, ( 1 ). Π,.,.,,,.,.,. 1 : Π Π,. 212,. : 1)..,. 2). :, ;,,,;,. 3

(單位名稱)大事記---96學年度(96

2013 年 7 月 总 第 235 期 主 办 单 位 : 中 国 科 学 院 自 动 化 研 究 所 CONTENTS 中 国 科 学 院 自 动 化 研 究 所 所 刊 卷 首 语 节 令 是 一 种 命 令 毕 淑 敏 1 聚 焦 CASIA 自 动 化 所 召 开 庆 祝 建 党 92 周

Improving the Effectiveness of the Training of Civil Service by Applying Learning Science and Technology: The Case Study of the National Academy of Ci

山东省招生委员会

United Nations ~ ~ % 2010

a b

g 100mv /g 0. 5 ~ 5kHz 1 YSV8116 DASP 1 N 2. 2 [ M] { x } + [ C] { x } + [ K]{ x } = { f t } 1 M C K 3 M C K f t x t 1 [ H( ω )] = - ω 2

[1] [4] Chetverikov Lerch[8,12] LeaVis CAD Limas-Serafim[6,7] (multi-resolution pyramids) 2 n 2 n 2 2 (texture) (calf leather) (veins)

北 京 大 学

并非没有必要的一些宏观思考

34 22 f t = f 0 w t + f r t f w θ t = F cos p - ω 0 t - φ 1 2 f r θ t = F cos p - ω 0 t - φ 2 3 p ω 0 F F φ 1 φ 2 t A B s Fig. 1

1

STEAM STEAM STEAM ( ) STEAM STEAM ( ) 1977 [13] [10] STEM STEM 2. [11] [14] ( )STEAM [15] [16] STEAM [12] ( ) STEAM STEAM [17] STEAM STEAM STEA

具有多个输入 特别是多个输出的 部门 或 单位 ( 称为 决策单元 Decision Making Unit 简称 DMU) 间的相对有效 8 性 C2R 模型是 DEA 的个模型 也是 DEA 的基础 和重要模型 假设有 n 个决策单元 DMUj( j = n) 每个 DMU 有 m

2015 年 第 24 卷 第 11 期 计 算 机 系 统 应 用 历 的 主 体 部 分 多 以 非 结 构 化 的 文 本 形 式 存 储, 很 多 研 究 只 能 基 于 有 限 的 结 构 化 数 据 进 行 [4,5], 无 法 满 足 临

word2vec 8-10 GloVe 11 Word2vec X king - X man X queen - X woman Recurrent Neural Network X shirt - X clothing X chair - X furniture 2 n-gra

PCA+LDA 14 1 PEN mL mL mL 16 DJX-AB DJ X AB DJ2 -YS % PEN

untitled

标题

132 包 装 工 程 2016 年 5 月 网 产 品 生 命 周 期 是 否 有 与 传 统 产 品 生 命 周 期 曲 线 相 关 的 类 似 趋 势 旨 在 抛 砖 引 玉, 引 起 大 家 对 相 关 问 题 的 重 视, 并 为 进 一 步 研 究 处 于 不 同 阶 段 的 互 联 网

增 刊 谢 小 林, 等. 上 海 中 心 裙 房 深 大 基 坑 逆 作 开 挖 设 计 及 实 践 745 类 型, 水 位 埋 深 一 般 为 地 表 下.0~.7 m 场 地 地 表 以 下 27 m 处 分 布 7 层 砂 性 土, 为 第 一 承 压 含 水 层 ; 9 层 砂 性 土

Microsoft Word - 01李惠玲ok.doc

双 语 教 学 之 中 综 上 所 述, 科 大 讯 飞 畅 言 交 互 式 多 媒 体 教 学 系 统, 围 绕 语 音 核 心 技 术 的 研 究 与 创 新, 取 得 了 一 系 列 自 主 产 权 并 达 到 国 际 领 先 水 平 的 技 术 成 果, 同 时 获 得 发 明 专 利 3

Microsoft Word doc

附3

cm /s c d 1 /40 1 /4 1 / / / /m /Hz /kn / kn m ~

<4D F736F F D20C9CFBAA3BFC6BCBCB4F3D1A7D0C5CFA2D1A7D4BA C4EAC7EFBCBEC8EBD1A7B2A9CABFD7CAB8F1BFBCCAD4CAB5CAA9CFB8D4F22D C8B7B6A8B8E5>

14-1-人文封面

<A448A4E5AAC0B77CBEC7B3F8B2C43132A8F7B2C434B4C15F E706466>

标题

Microsoft Word - A _ doc

Scoones World Bank DFID Sussex IDS UNDP CARE DFID DFID DFID 1997 IDS

untitled

语篇中指代词的分布规律与心理机制*

Microsoft Word 定版

8 DEA min θ - ε( ^e T S - + e T S ) [ + ] GDP n X 4 j λ j + S - = θx 0 j = 1 n Y j λ j - S + = Y 0 j = 1 5 λ J 0 j = 1 n S - 0 S + 0 ^e = ( 1 1

13-4-Cover-1

續論

University of Science and Technology of China A dissertation for master s degree Research of e-learning style for public servants under the context of

现代汉语语料库基本加工规格说明书

Integration of English-Chinese Word Segmentation and Word Alignment

荨荨 % [3] [4] 86%( [6] 27 ) Excel [7] 27 [8] 2 [9] K2 [2] ; Google group+ 5 Gmail [2] 2 fxljwcy 3E [22] 2 2 fxljzrh 2D [23] 3 2 fxzphjf 3D 35

M M. 20

m 3 m m 84 m m m m m m m

,, [1 ], [223 ] :, 1) :, 2) :,,, 3) :,, ( ),, [ 6 ],,, [ 3,728 ], ; [9222 ], ;,,() ;, : (1) ; (2),,,,, [23224 ] ; 2,, x y,,, x y R, ( ),,, :

A VALIDATION STUDY OF THE ACHIEVEMENT TEST OF TEACHING CHINESE AS THE SECOND LANGUAGE by Chen Wei A Thesis Submitted to the Graduate School and Colleg

Revit Revit Revit BIM BIM 7-9 3D 1 BIM BIM 6 Revit 0 4D 1 2 Revit Revit 2. 1 Revit Revit Revit Revit 2 2 Autodesk Revit Aut

实验室代码

4 115,,. : p { ( x ( t), y ( t) ) x R m, y R n, t = 1,2,, p} (1),, x ( t), y ( t),,: F : R m R n.,m, n, u.,, Sigmoid. :,f Sigmoid,f ( x) = ^y k ( t) =

语篇中指代词的分布规律与心理机制*

附件4

38 張 元 素 歸 經 引 經 理 論 研 究 本 文 以 張 元 素 著 述 為 主 要 材 料, 采 用 上 海 涵 芬 樓 景 印 元 杜 思 敬 濟 生 拔 粹 刊 本 4 ; 醫 學 啟 源 為 任 應 秋 點 校 本, 任 本 以 明 成 化 八 年 刊 本 為 底 本, 旁 校 上

θ 1 = φ n -n 2 2 n AR n φ i = 0 1 = a t - θ θ m a t-m 3 3 m MA m 1. 2 ρ k = R k /R 0 5 Akaike ρ k 1 AIC = n ln δ 2

Vol. 15 No. 1 JOURNAL OF HARBIN UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb O21 A

Fig. 1 1 The sketch for forced lead shear damper mm 45 mm 4 mm 200 mm 25 mm 2 mm mm Table 2 The energy dissip

中文模板

科 研 信 息 化 技 术 与 应 用,2015, 6 (1) of identity and the framework of identity management, this paper analyses the development trend of Identity Management

y 1 = 槡 P 1 1h T 1 1f 1 s 1 + 槡 P 1 2g T 1 2 interference 2f 2 s y 2 = 槡 P 2 2h T 2 2f 2 s 2 + 槡 P 2 1g T 2 1 interference 1f 1 s + n n

六 到 八 歲 兒 童, 設 計 並 發 展 一 套 以 van Hiele 幾 何 思 考 層 次 理 論 為 基 礎 的 悅 趣 化 學 習 數 位 教 材, 取 名 為 米 德 玩 形 狀, 同 時 探 討 低 年 級 學 童 在 使 用 本 數 位 教 材 之 後, 在 平 面 幾 何 的

理 成 可 做 關 聯 分 析 的 格 式, 再 應 用 統 計 統 計 計 算 軟 體 R (R Core Team, 2013) 中 的 延 伸 套 件 arules (Hahsler, Gruen, and Hornik, 2005; Hahsler, Buchta, Gruen, and H

<4D F736F F D20C4A3B0E520D3A2D3EFBFDAD3EFBBFABFBCD6D0D3A2D3EFC8FBD2F4D3EFD2F4CAB6B1F0B5C4B8C4BDF8D0CDB7BDB7A8D1D0BEBF2E646F63>

~ ~ ~ ~ ~ ~ ~ % % ~ 20% 50% ~ 60%

m m m ~ mm

Transcription:

40 4 Vol. 40, No. 4 2014 4 ACTA AUTOMATICA SINICA April, 2014 1, 2 1, 2 1, 3 1, 3 N.,,., : 1),, ; 2),., N.,. DOI,,,,,,.., 2014, 40(4): 624 634 10.3724/SP.J.1004.2014.00624 Chinese Pinyin-to-character Conversion Based on Cascaded Reranking LI Xin-Xin 1, 2 WANG Xuan 1, 2 YAO Lin 1, 3 GUAN Jian 1, 3 Abstract The word n-gram language model is the most common approach for Chinese pinyin-to-character conversion. It is simple, efficient, and widely used in practice. However, in the decoding phase of the word n-gram model, the determination of a word only depends on its previous words, which lacks long distance grammatical or syntactic constraints. In this paper, we propose two reranking approaches to solve this problem. The linear reranking approach uses minimum error learning method to combine different sub-models, which includes word and character n-gram language models, part-of-speech tagging model and dependency model. The averaged perceptron reranking approach reranks the candidates generated by word n-gram model by employing features extracted from word sequence, part-of-speech tags, and dependency tree. Experimental results on Lancaster Corpus of Mandarin Chinese and People s Daily show that both reranking approaches can efficiently utilize information of syntactic structures, and outperform the word n-gram model. The perceptron reranking approach which takes the probability output of linear reranking approach as initial weight achieves the best performance. Key words Chinese pinyin-to-character conversion, reranking approach, minimum error learning, averaged perceptron Citation Li Xin-Xin, Wang Xuan, Yao Lin, Guan Jian. Chinese pinyin-to-character conversion based on cascaded reranking. Acta Automatica Sinica, 2014, 40(4): 624 634,.. 2013-04-22 2013-09-22 Manuscript received April 22, 2013; accepted September 22, 2013 (2011ZX03002-004-01), (JC201104210032A, JC201005260112A) Supported by Key Science and Technology Projects of the Ministry of National Science and Technology (2011ZX03002-004-01) and Shenzhen Basic Research Key Project (JC201104210032A, JC201005260112A) Recommended by Associate Editor DANG Jian-Wu 1. 518055 2. 518055 3. 518057 1. Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055 2. Shenzhen Applied Technology Engineering Laboratory for Internet Multimedia Application, Shenzhen 518055 3. Public Service Platform of Mobile Internet Application Security Industry, Shenzhen 518057, 410 ( ). 7 000, 3 500.., 60 yi.., N. N,, [1].. OOV (Out of vocabulary).,,. OOV,, [1 2]. [3 6].,

4 : 625, 1. N,., N,.,,. N,., [7 9] [10 14] [15].,,..,..,,. LCMC (Lancaster Corpus of Mandarin Chinese). 1 N.,., / / ( 2 )., N,..,,, [7 8]. Siu N, [9]., N.. [2]. /,, 1 http://pinyin.sogou.com. N. Ney N, [16]. ( ) N, [17 18]..,, N.,. [15]., [10 11] [12] [13] [14] [19],. / /, N /., [20 21]. [22],.,.. 2 N N. S = s 1, s 2,, s n, C = c 1, c 2,, c n, W = w 1, w 2,, w m (m n). 1., w k c i,, c j, (s i,, s j ). i, j k. S, W : W = arg P (W S) (1) W N, : W = arg W P (S W )P (W ) P (S) (2)

626 40, P (S W ) = p((s i,, s j ) w k ) (3) P (W ) = k=1 p(w k w 1,, w k 1 ) (4) k=1 [23], 1. Table 1 1 Feature templates for the character-based model 1 s n (n = 2,, 2) 2 s ns n+1 (n = 1,, 0) 3 s 1s 1 Fig. 1 1 An example of Chinese pinyin-to-character conversion, P (W ) = p(w 1 )p(w 2 w 1 ) m k=3 p(w k w k 2, w k 1 ). N, C : C = arg C P (C S) = arg C P (S C)P (C) P (S) (5), P (S C) P (C) P (S W ) P (W ). N, Beam k., P (S W ) P (S C).. j, s j s i,, s j (i = (0, j 20)),,. j, k.,. 3 N,.,., N.. 3.1,. 1, s 0, s n, s n n n. S, C C = arg arg C GEN(S) C GEN(S) P char (C S) = Φ(S, C) ᾱ (6), P char (C S). Φ(S, C),. [23 24]. 3.2 S W, P occur (W S) = p(w k (s i,, s j )) (7) P occur (S W ) = k=1 p((s i,, s j ) w k ) (8) k=1, p(w k (s i,, s j )) p((s i,, s j ) w k ) w k (s i,, s j ),. s i,, s j,, p(w k (s i,, s j )) p((s i,, s j ) w k ) 1.,. p(w k (s i,, s j )) = N(w k, (s i,, s j )) + 1 N(s i,, s j ) + N Sj i+1 (9) p((s i,, s j ) w k ) = N((s i,, s j ), w k ) + 1 N(w k ) + N Wj i+1 (10), N Sj i+1 j i + 1, N Wj i+1 j i + 1. N(s i,, s j ), N(s i ) N(w i, (s i,, s j ))

4 : 627. N, 6. 3.3 W, T. [25]... W, T T = arg arg T GEN(W ) T GEN(W ) P (T W ) = Φ(T, W ) ᾱ (11) (W, T ), P (T W ), [23 26], 2. /,,. Table 2 2 Feature templates for part-of-speech tagging 1 w 2t 0 end(w 1)w 0start(w 1)t 0 2 w 1t 0 when len(w 0) = 1 3 w 0t 0 start(w 0)t 0 4 w 1t 0 end(w 0)t 0 5 w 2t 0 c nt 0, (n = 1, len(w 0) 2) 3.5. W T, P occur (W T ) = p(w i t i ) (13) P occur (T W ) = i=1 p(t i w i ) (14) i=1, p(w i t i ) p(t i w i ),,. 3.6,.. (W, T ), [27]. (CTB5), [27], 86 %. 2.,,.. 6 t 1t 0 start(w 0)c nt 0 (n = above) 7 t 2t 1t 0 end(w 0)c nt 0 (n = above) 8 t 1w 0 c nc n+1t 0 (c n = c n+1) 9 w 0t 0end(w 1) class(start(w 0))t 0 10 w 0t 0start(w 1) class(end(w 0))t 0 (Chinese treebank 5, CTB5), [26]. F 1 95.26 %. 3.4 N N, N. T P (T ) = p(t i t 1,, t i 1 ) (12) i=1 N N. p(t i t 1,, t i 1 ), p(w k w 1,, w k 1 ). Fig. 2 2 An example of a dependency tree D P (D W, T ) = i w i f i (D W, T ) (15), f i (D W, T ). 4 [28]. 3. S, Beam k W,.,

628 40. k. (Minimum error training method, MERT) [29].,, [30]., S, W N P (W ),, P mert (W S) = P mert (W, C, T, D S) = k w i P sub (W, C, T, D S) = i=0 w 0 P (W ) + w 1 P (C) + w 2 P char (C S) + w 3 P occur (W S) + w 4 P occur (S W ) + w 5 P (T W ) + w 6 P (T ) + w 7 P occur (W T ) + w 8 P occur (W T ) + w 9 P (D W, T ) (16), 9 i=0 w i = 1. P sub (W, C, T, D S), N P (W ), N P (C), P char (C S), P occur (W S) P occur (S W ), P (T W ), N P (T ), P occur (W T ) P occur (W T ), P (D W, T ). W T D, 1. w j (0 j k), w j. P mert (W S) = w j P j (W S) + w i P i (W S) (17) i j, w j P j (W S), i j w i P i (W S).,,. jth. w j.,,. [29 31]. 5, :. W, T D.,. S,. S W W = arg W GEN(S) P rerank (W S)) = arg W GEN(S) (P init (W S) + (P init (W S) + Φ(W, S) ᾱ) (18), GEN(S) S. P init (W S) S W, N P (W ), P mert (W S). P mert (W S) 4. P rerank (W S). 2..,...,., 6.. S k D i (i = 1,, k), ᾱ ᾱ = ᾱ + Φ(ᾱ, D g ) Φ(ᾱ, D p ) (19), D g S D i.,, D g. D p.

4 : 629 6 6.1.,,,. The Lancaster Corpus of Mandarin Chinese (LCMC) 2, 15.,., 50 000, 40 000, 5 000 3. LCMC 45 735. pinyin4j 4, pinyin4j,., Sogou., 99.7 %. N, 200, 39 735.,. 5,. 3 6. 3 Table 3 Statistics of training, development, and test data 112 859 8 235 8 623 875 397 62 961 63 608 723 424 52 039 52 263 1 144 559 82 527 82 788 OOV 13 692 1 043 1 245 OOV (%) 1.56 1.66 1.96 6.2 N N N 1998 2006 2007 2009 2012 2 http://www.lancs.ac.uk/fass/projects/corpus/lcmc/ 3 http://www.uniml.com/nlp/py2char/pddata.tar.gz 4 http://pinyin4j.sourceforge.net 5 :,.! : ;. 6 http://www.uniml.com/nlp/py2char/lcmcdata.tar.gz.,.,. N 130 750, : 7 000, 56 064,, 94 412 Google Chinese 5-gram corpus 7. N 7 000.,, SRILM [32], Knerser-Ney.. N, N, N oracle, N k oracle, 4. 4 (%) Table 4 Performance of different LMs on development data (%) CER IVWER OOVWER N 12.92 11.25 15.04 N 11.27 9.03 14.23 oracle 7.38 6.89 11.58 k oracle 2.01 4.01 5.94 (Character error rate, CER) : CER = 100 % = 100 % (20) N k oracle, 3. k. n = 1, CER N., k, oracle CER., 100. CER, IVWER (In vocabulary word error rate) OOVWER (Out of vocabulary word error rate). 4 IV OOV 1. 4 N N, N 7 http://www.ldc.upenn.edu/catalog/catalogentry.jsp?catalogid=ldc2010t06

630 40 N. N, IV OOV. 3 Fig. 3 N k oracle Oracle of different k-best candidates on development data oracle, k oracle. 4,. N k oracle N. 6.3 k oracle, S 100 (W, D). P sub (W, C, T, D S),. (CER). LCMC,,...,.. P occur (S W ),. 5. 5., N, N. 5,. N.,,.,. 5 Table 5 Experimental results of linear reranking model on development data CER (%) A 9.76 A/0 / N 10.47 A/1 / N 10.10 A/2 / 9.83 A/3 / 10.06 A/4 / 9.84 A/5 / 9.83 A/6 / 9.76 A/7 / N 9.78 A/8 / 9.82,, 4. 0 8, 5. 4. 5, N, N.,. Fig. 4 4 Performance of sub models on development data N.

4 : 631 N. N,, N. 6.4 S D,., 6. (W, T 1, T 2),., w 0 t 0, w ±i i, +i i i.. 6, D1, D2, D3., G F S L R. D F S, 1 0,., A F S., F t S w. Table 6 6 Features for the reranking model W w 2w 1w 0, w 1w 0, w 0 T 1 t 2t 1t 0, t 1t 0, t 0, t 0w 0 T 2 t 1t 0w 0, w 1t 1t 0w 0 D1 D2 D3 t 2w it 1t 0w 0, w 2t 2w 1t 1t 0w 0 F td F SA F SS t, F wd F SA F SS ws t F wf td F SA F SS t, F wf td F SA F SS ws t G td GF A GF F td F SA F SS t G td GF A GF F td F SA F SS ws t G td GF A GF F wf td F SA F SS t G wg td GF A GF F td F SA F SS t G td GF A GF F wf td F SA F SS ws t G wg td GF A GF F wf td F SA F SS t G wg td GF A GF F wf td F SA F SS ws t L td F LA F LR td F RA F RF t L wl td F LA F LR wr td F RA F RF t L td F LA F LR td F RA F RF wf t L wl td F LA F LR wr td F RA F RF wf t., P init (W S) 5. 5, P init (W ), N P (W ), P mert (W S). Fig. 5 5 Results of reranking models with different initial weights and flat feature sets on development data 5 ( ).,,.,,. N P (W ), W W, D1, D3. P mert (W S),. W, T 1 W, T 1, T 2,. P mert (W S) P (W ),.,,, 7., N 1.88. 6.5,, 8., 10.96 %, N 1.07, 8.89 %,., N, [33 34].

632 40 7 Table 7 Results of reranking models with different initial weights and dependency feature sets on development data CER (%) W 9.95 P (W ) W, D1 11.26 W, D1, D2 11.37 W, D1, D3 11.46 W, T 1 9.50 P mert(w S) W, T 1, D1 9.40 W, T 1, D1, D2 9.40 W, T 1, D1, D3 9.39 6 N.,,, N.,,,. Table 8 8 Comparison of different approaches on test data CER (%) N 12.03 10.96 10.60 [33] 18.49 [33 34] 13.9 6.6. 9. IV, OOV. IV, 1, ; 1, N,. OOV,. N,,. Table 9 9 Error analysis of different approaches on development dataset N IVR OOVR IVR OOVR IVR OOVR 1 81.18 0 82.62 0 84.26 0 2 90.34 44.93 92.47 46.96 92.06 47.11 3 96.34 56.98 97.84 58.91 97.05 60.07 4 99.59 78.72 99.32 78.01 99.18 78.72 All 86.13 52.45 87.86 54.08 88.45 54.56 Fig. 6 6 Results of different models on sentences with different word numbers 6.7,,, 10. LCMC., N.. 10 Table 10 Comparison of different approaches on People s Daily dataset CER (%) [11] 10.86 [12] 5.28 [13] 7.06 [14] 11.46 [19] 4.45 [35] 7.99 N 5.48 LCMC 4.74 4.39 [3] 4.98 N [5] 1.52 [6] 4.44

4 : 633 10, [11 14, 19, 35], [19]. [19],, 6.9 %. 10.09 %,., [3],.,.,. 6.8. 4,,., O(l), l. l.,. 7, N.... LCMC,..,.. References 1 Chen S F, Goodman J. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report, Computer Science Group, Harvard University, 1998 2 Brown P F, desouza P V, Mercer R L, Della Pietra V J, Lai J C. Class-based n-gram models of natural language. Computational Linguistics, 1992, 18(4): 467 479 3 Huang J H, Powers D M. Adaptive compression-based approach for Chinese pinyin input. In: Proceedings of the 3rd SIGHAN Workshop Chinese Language Learning. Barcelona, Spain: Association for Computational Linguistics, 2004. 24 27 4 Wei J, Li P X. Applying the word acquiring algorithm to the pinyin-to-character conversion. In: Proceedings of the 5th International Conference on Natural Computation. Washington, DC, USA: IEEE Computer Society, 2009. 17 21 5 Tang B Z, Wang X L, Wang X, Wang Y H. Frequency-based online adaptive n-gram models. In: Proceedings of the 2nd International Conference on Multimedia and Computational Intelligence. Wuhan, China: IEEE, 2010. 263 266 6 Huang J H, Powers D. Error-driven adaptive language modeling for Chinese pinyin-to-character conversion. In: Proceedings of the 2011 International Conference on Asian Language Processing. Penang, Malaysia: IEEE, 2011. 19 22 7 Pauls A, Klein D. Faster and smaller n-gram language models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. 258 267 8 Shan Yu-Xiang, Chen Xie, Shi Yong-Zhe, Liu Jia. Fast language model look-ahead algorithm using extended n-gram model. Acta Automatica Sinica, 2012, 38(10): 1618 1626 (,,,. N., 2012, 38(10): 1618 1626) 9 Siu M H, Ostendorf M. Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing, 2000, 8(1): 63 75 10 Wang X, Li L, Yao L, Anwar W. A imum entropy approach to Chinese pinyin-to-character conversion. In: Proceedings of the 2006 IEEE International Conference on Systems, Man, and Cybernetics. Taipei, China: IEEE, 2006. 2956 2959 11 Zhao Y, Wang X L, Liu B Q, Guan Y. Research of pinyinto-character conversion based on imum entropy model. Journal of Electronics, 2006, 23(6): 864 869 12 Xiao J H, Liu B Q, Wang X L. Exploiting pinyin constraints in pinyin-to-character conversion task: a class-based imum entropy markov model approach. Computational Linguistics and Chinese Language Processing, 2007, 12(3): 325 348 13 Jiang Wei, Guan Yi, Wang Xiao-Long, Liu Bin-Quan. Pinyin-to-character conversion model based on support vector machines. Journal of Chinese Information Processing, 2007, 21(2): 100 105 (,,,.., 2007, 21(2): 100 105) 14 Li L, Wang X, Wang X L, Yu Y B. A conditional random fields approach to Chinese pinyin-to-character conversion. Journal of Communication and Computer, 2009, 6(4): 25 31 15 Wang X L, Chen Q C, Yeung D S. Mining pinyin-tocharacter conversion rules from large-scale corpus: a rough set approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2004, 34(2): 834 844 16 Ney H, Essen U, Kneser R. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 1994, 8(1): 1 38 17 Wang Xuan, Wang Xiao-Long, Zhang Kai. Language model for speech recognition applications. Acta Automatica Sinica, 1999, 25(3): 309 315 (,,.., 1999, 25(3): 309 315) 18 Roark B. Probabilistic top-down parsing and language modeling. Computational Linguistics, 2001, 27(2): 249 276

634 40 19 Yang S H, Zhao H, Lu B L. A machine translation approach for chinese whole-sentence pinyin-to-character conversion. In: Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation. Bali, Indonesia: Universitas Indonesia, 2012. 333 342 20 Wen J, Wang X J, Xu W Z, Jiang H X. Ambiguity solution of pinyin segmentation in continuous pinyin-to-character conversion. In: Proceedings of the 2008 International Conference on Natural Language Processing and Knowledge Engineering. Beijing, China: IEEE, 2008. 1 7 21 Chen Z, Lee K F. A new statistical approach to Chinese pinyin input. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong: Association for Computational Linguistics, 2000. 241 247 22 Zheng Y B, Li C, Sun M S. CHIME: an efficient errortolerant Chinese pinyin input method. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: AAAI Press, 2011. 2551 2556 23 Collins M. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Philadelphia, PA, USA: Association for Computational Linguistics, 2002. 1 8 24 Li X X, Wang X, L Yao Y. Joint decoding for Chinese word segmentation and POS tagging using character-based and word-based discriminative models. In: Proceedings of the 2011 International Conference on Asian Language Processing (IALP). Washington, DC, USA: IEEE, 2011. 11 14 25 Ng H T, Low J K. Chinese part-of-speech tagging: one-ata-time or all at once? word-based or character-based? In: Proceedings of the 2004 EMNLP. Barcelona, Spain: Association for Computational Linguistics, 2004. 277 284 26 Zhang Y, Clark S. Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL-08: HLT. Columbus, Ohio: Association for Computational Linguistics, 2008. 888 896 27 Zhang Y, Nivre J. Transition-based dependency parsing with rich non-local features. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. 188 193 28 Liu Di, Sun Dong-Mei, Qiu Zheng-Ding. Feature level fusion based on speaker verification via relation measurement Fusion framework. Acta Automatica Sinica, 2011, 37(12): 1503 1513 (,,.., 2011, 37(12): 1503 1513) 29 Och F J. Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, 2003. 160 167 30 Jiang W B, Huang L, Liu Q, Lü Y J. A cascaded linear model for joint chinese word segmentation and part-ofspeech tagging. In: Proceedings of ACL-08: HLT. Columbus, Ohio: Association for Computational Linguistics, 2008. 897 904 31 Zaidan O. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 2009, 91(1): 79 88 32 Stolcke A. SRILM an extensible language modeling toolkit. In: Proceedings of the 2002 International Conference on Spoken Language Processing. Denver, Colorado: IEEE 2002. 901 904 33 Liu W, Guthrie L. Chinese pinyin-text conversion on segmented text. In: Proceedings of the 12th International Conference on Text, Speech and Dialogue. Berlin, Heidelberg: Springer-Verlag, 2009. 116 123 34 Zhou X H, Hu X H, Zhang X D, Shen X J. A segment-based hidden Markov model for real-setting pinyin-to-chinese conversion. In: Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM 2007). New York, NY, USA: ACM Press, 2007. 1027 1030 35 Zhang Sen. Solving the pinyin-to-chinese-character conversion problem based on hybrid word lattice. Chinese Journal of Computers, 2007, 30(7): 1145 1153 (.., 2007, 30(7): 1145 1153).,.. E-mail: lixxin2@gmail.com (LI Xin-Xin Ph. D. candidate at Harbin Institute of Technology Shenzhen Graduate School. His research interest covers natural language processing, and network information processing. Corresponding author of this paper.).,. E-mail: wangxuan@insun.hit.edu.cn (WANG Xuan Professor at Harbin Institute of Technology Shenzhen Graduate School. His research interest covers artificial intelligence, network multimedia information processing.).,. E-mail: yaolin@hit.edu.cn (YAO Lin Lecturer at Harbin Institute of Technology. Her research interest covers network information processing and biology information processing.).,. E-mail: guanjian2000@gmail.com (GUAN Jian Ph. D. candidate at Harbin Institute of Technology Shenzhen Graduate School. His research interest covers artificial intelligence and speech recognition.)