O09-1

Size: px

Start display at page:

Download "O09-1"

坐邵
5 years ago
Views:

1 Lkelhood Rato Based Dscrmnant Analyss for Large Vocabulary Contnuous Speech Recognton Hung-Shn Lee Department of Computer Scence and Informaton Engneerng Natonal Tawan Normal Unversty Berln Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan Normal Unversty (automatc speech recognton, ASR) (generalzed lkelhood rato dscrmnant analyss, GLRDA) (lkelhood rato test) (heteroscedastcty) (null hypothess) (lnear dscrmnant analyss, LDA) (heteroscedastc lnear dscrmnant analyss, HLDA) (GLRDA) (feature transformaton) (automatc speech recognton, ASR) 5

2 n d R n d (d < n) [] [2] (classfer-dependent) (classfer-ndependent)(mnmum phone error, MPE)[3](mnmum classfcaton error, MCE)[4] (acoustc models) (lnear dscrmnant analyss, LDA) (squared Mahalanobs dstance)[5](lda) (heteroscedastc lnear dscrmnant analyss, HLDA)(maxmum lkelhood)[6] (heteroscedastc dscrmnant analyss, HDA) [7](LDA)(HLDA) (maxmum mutual nformaton, MMI)(MCE) [8](parwse emprcal error rate) [9-] [2] (pattern recognton) (generalzed lkelhood rato dscrmnant analyss, GLRDA) (lkelhood rato test, LRT) (heteroscedastcty) (null hypothess) (Gaussan dstrbuton) (GLRDA) 6

3 (statstcal hypothess testng)[3](lrt) (null hypothess)h 0 (alternatve hypothess)h H 0 (unon) (parameter space)ω H 0 H 0 H LR sup L ω = () sup L L (sample)sup LS S () (maxmum lkelhood estmaton, MLE) (confdence measure) H 0 (true) ω sup Lω sup L LR H 0 (false) ω sup Lω sup L (phonetc confusons)[4](voce actvty detecton, VAD)[5] H H 0 : : n d R H 0 (generalzed lkelhood rato dscrmnant analyss, GLRDA) sup L( ) J GLRDA( ) = LRGLRDA( ) = (2) sup L ( ) J ( ) GLRDA (Homoscedastcty) 7

4 (homoscedastcty) C C H H homo H H homo 0 homo : C homo 0 : C = = = homo H0 (GLRDA) H homo 0 (Gaussan dstrbuton) (LDA) H0 homo H homo (GLRDA) C (GLRDA) sup L homo ( ) homo H 0 J GLRDA ( ) = (3) sup L homo ( ) H (LDA) T S B J LDA ( ) = (4) T S W n n n n SB R SW R (between-class scatter matrx) (wthn-class scatter matrx)[6] (3) log J homo GLRDA ( ) = sup log L H homo ( ) sup log L homo ( ) (5) 0 log L homo ( ) log L homo ( ) x N H 0 H H log L homo H0 = g ( N, d ) log L homo H = g ( N, d ) ( ) = log p( x C = = n 2 ( ) = log p( x C n 2 N,,, ) T ) ) ( m ( m + trace( S ) + log ) N,{ },, ) T ( m ) ( m ) + trace( S ) + log ) (6) (7) g( N, d) = ( Nd 2) log(2π ) N n C C d m S 8

5 } { homo 0 H homo H homo 0 H homo 0 homo 0 (6) 0 ( ) ( ) m m m S m m 0 ) ( log ) trace( ) ( ) ( 2 ) ( log homo 0 homo 0 = = = = + + = = = = C C C T H N n n n L (8) ( ) T B W C T C C T C C C T H N n N n n n n n L S S S m m m m S m m m m S S m m ) )( ( 0 ) ( ) ( log ) trace( ) ( ) ( 2 ) ( log homo 0 homo 0 homo 0 homo 0 = + = + = = + + = + + = = = = = = = (9) n n T R S (total scatter matrx)[6] m homo 0 = T S ˆ homo 0 = (6) homo 0 H 2 log 2 ), ( ),,, ( max log ) ( sup log homo 0 homo 0 homo 0 Nd N d N g p L T N H = = S x (0) homo H } { homo, homo (7) 0 ( ) ( ) C T H n n L m m S m m 0 ) ( log ) trace( ) ( ) ( 2 ) ( log homo, homo = = = + + = = () 9

6 ( ) W C C C C T H N n n n n L S S S S m m 0 log ) trace( ) ( ) ( 2 ) ( log homo homo, homo, homo = = = + = + + = = = = = m homo, = W S homo = (7) homo H 2 log 2 ), ( ), },,{ ( max log ) ( sup log homo homo, homo Nd N d N g p L W N H = = S x (3) (0)(3)(5) log 2 log 2 ) log( (log 2 ) log (log 2 2 log 2 ), ( 2 2 ), ( ) ( log homo GLRDA + = + = + = = = W B W B W W B W T W W T N N N N Nd N d N g Nd N d N g J S S S S S S S S S S S S (4) (4)(monotoncally ncreasng) (4) S S W T B T (LDA)(4) (Heteroscedastcty) [7] = : : heter heter 0 C H C H (GLRDA) ) ( sup log ) ( sup log ) ( log heter heter 0 heter GLRDA H L H L J = (5) ) ( log heter H 0 L ) ( log heter H L (2) 0

7 ( ) = + + = = C T N H n d N g p L log ) trace( ) ( ) ( 2 ), ( ) },,{, ( log ) ( log heter 0 S m m x (6) ( ) = + + = = C T N H n d N g p L log ) trace( ) ( ) ( 2 ), ( ) }, },{,{ ( log ) ( log heter S m m x (7) heter 0 H heter 0 heter 0, (6) 0 ( ) ( ) = = = = = = = = = + + = C C C C C C T H n n n n n n L m S S m m S m m 0 ) ( log ) trace( ) ( ) ( 2 ) ( log heter 0 heter 0 (8) ( ) ( ) T T k C T H n n L S B S m m S m m S m m ) )( ( 0 ) )( ( 2 log ) trace( ) ( ) ( 2 ) ( log heter 0 heter 0 heter 0, heter 0 heter 0 heter 0 heter 0 heter 0 + = + = = + = + + = = T ) )( ( heter 0 heter 0 m m B = (8) S = (8)(9) heter 0 heter 0, (6) heter 0 H = + = = C N H n Nd d N g p L heter 0, heter 0 log 2 2 ), ( ) },, {, ( max log ) ( log sup heter 0 S B x (20) heter H heter heter, (7) 0 ( ) C T H n n L m m S m m 0 ) ( log ) trace( ) ( ) ( 2 ) ( log heter, heter = = = + + = = (2) (9)

8 ( ) ( ) C T H n n L S S S m m 0 2 log ) trace( ) ( ) ( 2 ) ( log heter, heter, heter, heter = = + = + + = = (2)(22) heter heter, (7) heter H ( ) ( ) = = = = + = + + = = C C C T N H n Nd d N g d n d N g n d N g p L heter, heter log 2 2 ), ( log 2 ), ( log ) trace( ) ( ) ( 2 ), ( ) }, },{,{ ( max log ) ( log sup heter S S S S S m m S m m x (23) (20)(23)(5) ( ) = = = = = = + = + = + = + = + = C T C T p p C p p C C C n n n n n Nd d N g n Nd d N g J heter 0 heter 0 heter 0 heter 0 ) ( ) ( heter GLRDA ) ( ) ( log 2 ) )( ( log 2 log 2 ) log (log 2 log 2 2 ), ( log 2 2 ), ( ) ( log m S m m m S I B S I S S B S S B (24) (GLRDA) ( ) = + = C T T T T T T n G heter 0 heter 0 H ) ( ) ( ) ( log 2 ) ( m S m (25) (25) = + + = C n G H ) trace( ) ( ) ( B S B S B S S (26) T ) )( ( heter 0 heter 0 m m B = B B T = S S T = (22) 2

9 (GLRDA) = 0 m S N T log ST = 2 H homo H H H homo = : : = m C heter 0 C n S n S = = heter : : m m S N W log SW 2 B + S S C = n log B + S 2 C = n log S 2 (GLRDA) (LDA) H0 homo H homo (HLDA) (HLDA) (GLRDA) H0 homo H heter (GLRDA)(HLDA) [6]HLDA J HLDA C N T n T ( ) = log ( n d ) ST ( n d ) log d S d + N log (27) 2 2 = (full-rank matrx) ( n n ) = [ d, ( n d ) ] [8] T S = S T log log T T T ( n d ) T d S T T d ( n d ) S = log S T d T ( n d ) T S d = log T ( n d ) + log T T T ( n d ) S S log S T T d ( n d ) T d (28) N T N log log ST 2 N N = N log log log S 2 2 N = log ST 2 T N log 2 (29) (28)(27)(29) (HLDA) 3

10 MATBN 0 K RCD n () ng () 66,353 2 an () eng () 42,550 3 () sl () 3,796 4 u () sl () 29,082 5 sc_e () sl () 26,34 6 sc_ () sl () 25,709 7 ng () sl () 2,629 8 g_u () sl () 9,97 9 an () e () 7,22 0 sc_ () () 7,022 J HLDA C n T N T N ( ) = log d S d log d ST d log S = suplog L heter ( ) H suplog L homo ( ) H 0 T (30) ( N 2)log ST (30) (GLRDA) H0 homo H heter (HLDA) (GLRDA) H 0 (HLDA) H 0 homo (GLRDA) H 0 heter (robust) (GLRDA) H 0 heter (LDA) MATBN K (n, ng 66,353 n ng ng n 0% 80% 4

11 K 0 C 2 G 2 G C C 3 C 8 C 6 C 4 C 9 C 7 G 3 C C 2 = 2 C 2 C 3 2 = 3 = 2 = 3 (confusable cluster)(graph) (vertex)(edge) (graph)(connected subgraph) (flood fll algorthm)[9] (GLRDA) (confuson nformaton based GLRDA, CI-GLRDA) G : { G k} K 5

12 (%) GLRDA Wthout MLLT Wth MLLT (weghted mean) (arthmetrc mean) CI H 0 : CC G l = l CI H : C l C (confuson nformaton based GLRDA, CI-GLRDA) G T T CI T T T T CI ( + ( m ) ( S ) ( m ) n (3) C CI ( ) = log l l ) =, G l G 2 (3) B GCI( ) = C =, G l G CI CI T l ( m l )( m ) l ( S S B l + B l ) S n + trace( S B ) l T = B B S l = l = T S (32) MATBN [20] ,774 34, (developng set) K (Mel-frequency cepstral coeffcents, MFCCs) (hdden Markov models, HMMs) INITIAL FINAL (rght-context-dependent model, RCD model) (slence) 5 [2] EM (expectaton-maxmzaton algorthm) 0 (word bgram and trgram language models)(central News AgencyCNA)

13 (%) Wthout MLLT Wth MLLT (LDA) (HLDA) (HDA) (Heteroscedastc GLRDA) (CI-GLRDA) SRI Language Modelng Toolkt (SRILM)[22](MFCCs) (baselne)(character accuracy) 72.23% (LDA) (HLDA)(HDA) (GLRDA) 62 (n = 62)(super-vector) 9 (Mel-frequency flterbank) 8 39 (d = 39) (HMMs)(state) (forced algnment) (HMMs) (maxmum lkelhood lnear transformaton, MLLT)[23] (GLRDA)(25) 0 heter (8) (weghted mean) C C heter 0 = n S n S m (33) = = (arthmetrc mean) N heter 0 = x (34) N = (GLRDA) 0 heter (CI-GLRDA) 7

14 MLLT (GLRDA) (HDA) C T J HDA ( ) = n log S + N log = T S B (35) (LDA) MLLT (LDA) MFCCs (LDA)[24-25] MLLT (HLDA) (Heteroscedastc GLRDA) 74.88%(LDA) K K = % (LRT) (GLRDA) NSC E MY3 NSC E MY3NSC S [] B. D. Rpley, Pattern Recognton and Neural Networks. New York: Cambrdge Unversty Press, 996. [2] X. Wang and K. K. Palwal, "Feature extracton and dmensonalty reducton algorthms and ther applcatons n vowel recognton," Pattern Recognton, vol. 36, pp. 8

15 , [3] D. Povey, et al., "fmpe: dscrmnatvely traned features for speech recognton," n Proc. ICASSP, 2005, pp [4] X.-B. L, et al., "Dmensonalty reducton usng MCE-optmzed LDA transformaton," n Proc. ICASSP, 2004, pp [5] R. A. Fsher, "The statstcal utlzaton of multple measurements," Annals of Eugencs, vol. 8, pp , 938. [6] N. Kumar and A. G. Andreou, "Heteroscedastc dscrmnant analyss and reduced rank HMMs for mproved speech recognton," Speech Communcaton, vol. 26, pp , 998. [7] G. Saon, et al., "Maxmum lkelhood dscrmnant feature spaces," n Proc. ICASSP, 2000, pp [8] K. Demuynck, et al., "Optmal feature sub-space selecton based on dscrmnant analyss " n Proc. Eurospeech, 999, pp [9] H.-S. Lee and B. Chen, "Lnear dscrmnant feature extracton usng weghted classfcaton confuson nformaton," n Proc.Interspeech, 2008, pp [0] H.-S. Lee and B. Chen, "Improved lnear dscrmnant analyss consderng emprcal parwse classfcaton error rates," n Proc. ISCSLP, 2008, pp [] H.-S. Lee and B. Chen, "Emprcal error rate mnmzaton based lnear dscrmnant analyss," n Proc. ICASSP, [2] X. Cu, et al., "Stereo-based stochastc mappng wth dscrmnatve tranng for nose robust speech recognton," n Proc. ICASSP, 2009, pp [3] W. J. Krzanowsk, Prncples of Multvarate Analyss: A User's Perspectve. New York: Oxford Unversty Press, 988. [4] Y. Lu and P. Fung, "Acoustc and phonetc confusons n accented speech recognton," n Proc. Interspeech, 2005, pp [5] J. M. Górrz, et al., "Generalzed LRT-based voce actvty detector," IEEE Sgnal Processng Letters, vol. 3, pp , [6] K. Fukunaga, Introducton to Statstcal Pattern Recognton, 2nd ed. New York: Academc Press, 990. [7] N. A. Campbell, "Canoncal varate analyss wth unequal covarance matrces - generalzatons of the usual soluton," Mathematcal Geology, vol. 6, pp , 984. [8] M. Saka, et al., "Lnear dscrmnant analyss usng a generalzed mean of class covarances and ts applcaton to speech recognton," IEICE Trans. Informaton and Systems, vol. E9-D, pp , [9] J. D. Foley, et al., Computer Graphcs: Prncples and Practce n C, 2nd ed.: Addson-Wesley, 995. [20] H.-M. Wang, et al., "MATBN: A mandarn Chnese broadcast news corpus," Internatonal Journal of Computatonal Lngustcs and Chnese Language Processng, vol. 0, pp , [2] B. Chen, et al., "Lghtly supervsed and data-drven approaches to mandarn broadcast news transcrpton," n Proc. ICASSP, 2004, pp [22] A. Stolcke, SRI Language Modelng Toolkt (Verson.5.2). [23] R. A. Gopnth, "Maxmum lkelhood modelng wth Gaussan dstrbutons for classfcaton," n Proc. ICASSP, 998, pp [24] L. Wood, et al., "Improved vocabulary-ndependent sub-word HMM modellng," n Proc. ICASSP, 99, pp [25] G. Yu, et al., "Dscrmnant analyss and supervsed vector quantzaton for contnuous speech recognton," n Proc. ICASSP, 990, pp

16 20

17 Wavelet Energy-Based Support Vector Machne for Nosy Word Boundary Detecton Wth Speech Recognton Applcaton Cha-Feng Juang, Chun-Nan Cheng and Chu-Chuan Tu Department of Electrcal Engneerng Natonal Chung-Hsng Unversty, Tachung, 402 Tawan, R.O.C. e-mal: Abstract Word boundary detecton n varable nose-level envronments by support vector machne (SVM) usng Low-band Wavelet Energy (LWE) and Zero Crossng Rate (ZCR) features s proposed n ths paper. The Wavelet Energy s derved based on Wavelet transformaton; t can reduce the affecton of nose n a speech sgnal. Wth the ncluson of ZCR, we can robustly and effectvely detect word boundary from nose wth only two features. For detector desgn, a Gaussan-kernel SVM s used. The proposed detecton method s appled to detecton word boundares for an solated word recognton system n varable nosy envronments. Experments wth dfferent types of noses and varous sgnal-to-nose ratos are performed. The results show that usng the LWE and ZCR parameters-based SVM, good performance s acheved. Comparson wth another robust detecton method has also verfed the performance of the proposed method. Keywords: Speech detecton, word boundary detecton, support vector machne, wavelet transform, nosy speech recognton.. INTRODUCTION For speech recognton, the detecton of speech affects recognton performance. A robust word boundary detecton method n the presence of varable-label noses s necessary and s studed n ths paper. Dependng on the characterstcs of speech, a varety of parameters have been proposed for boundary detecton. They nclude the tme energy (the magntude n tme doman), zero crossng rate (ZCR) [] and ptch nformaton [2]. These parameters usually fal to detect word boundary when sgnal-to-nose rato (SNR) s low. Another parameter concernng frequency doman has also been recently proposed. Accordng to the frequency energy, the tme-frequency (TF) parameter [3] whch sums the energy n tme doman and the frequency energy was presented. The TF-based algorthm may work well for fxed-level background nose. However, ts detecton performance degrades for background nose of varous levels. For ths problem, some modfed TF parameters are proposed [4]. In [5], the dea of usng Wavelet transform features as speech detecton features was proposed. In ths paper, we present a new Low-band Wavelet Energy (LWE) parameter whch separates the speech from nose n the doman of Wavelet transform. Computaton of the WE parameter s easer than the modfed TF parameters, and t s shown n the experment secton that a better detecton performance s acheved. 2

18 After the features for detecton have been extracted, the next step s to determne thresholds and decson rules. Many decson methods based on computatonal ntellgence technques have been proposed, such as fuzzy neural networks (FNNs) [4] and neural networks (NNs) [6]. Generalzaton performance may be poor when FNNs and NNs are over-traned. To cope wth the low generalzaton ablty problem, a new learnng method, the Support Vector Machne (SVM), has been proposed [7, 8]. SVM s a new and useful learnng method whose formulaton s based on the prncple of structural rsk mnmzaton. Instead of mnmzng an objectve functon based on tranng, SVM attempts to mnmze a bound on the generalzaton error. SVM has ganed wde acceptance due to ts hgh generalzaton abltes for a wde range of applcatons. For ths reason, ths paper used a SVM as a detector. The rest of the paper s organzed as follows. Secton II ntroduces the dervaton and analyss of the WE and ZCR parameters. Secton III descrbes the SVM detector. Experments on word boundary detecton for nosy speech recognton are studed n Secton IV. Fnally, Secton V draws conclusons. 2. ROBUST DETECTION PARAMETERS Wavelet Transform (WT) s a technque for analyzng the tme-frequency doman that s most suted for a non-statonary sgnal [9]. For short-tme analyss and dscrete speech sgnal, dscrete-tme WT (DTWT) s used. Let the ampltude of the k th pont n the th frame of a nosy speech sgnal be denoted by sk (, ) and the frame length n sample number be represented by N. The DTWT of the -th speech frame s as follows, DTWT( mn, ) = sk (, ) ( a kn ) a N m m ψ 0 τ, () 0 0 k = m where ψ ( ) represents a wavelet bass functon, a 0 s the scale and τ 0 s a translaton m parameter whch s set to a 0 n ths paper. The commonly used value a 0 =2 s used n ths paper, resultng n a bnary dlaton. Thus, Eq. () can be wrtten as N m DTWT( mn, ) = (, ) [2 ( )] m skψ kn 2 k =. In ths paper, the Harr wavelet s used n Eq. (2), where m, 0 2 ( k n) 2 m m ψ [2 ( k n)] =, 2 ( k n) 2 0, otherwse (2) (3) m Generally, the DTWT s computed at scales a 0 for, theoretcally, all m. The output of DTWT can be regarded as fndng the output of a bank of band-pass flters, where dfferent values of scales corresponds to dfferent band-pass flters. The outputs of DTWT at dfferent 22

19 (a) (b) Fg.. (a) The LWEs of clean speech (b) The LWEs of speech wth whte nose added at SNR5. scales contan dfferent amounts of speech and nose nformaton, and only the crucal scale(s) that contans maxmum word sgnal nformaton and s robust to nose should be used. Therefore, energy of the crucal scale s adopted as detecton parameter for dstncton between speech and nose n ths paper. To fnd the crucal scale, some observatons on the effect of addtve nose are made on dfferent scales of DTWT. It s found that at the scale of a =, dstrbuton of the STWT m ampltudes matches well wth the speech nterval. After computng DTWT for each tme frame of a speech sgnal at the scale a =, m the next step s to fnd an energy parameter to stand for the amount of word sgnal nformaton at ths scale. It s found the speech secton corresponds to large DTWT ampltude values. Thus, summaton of the ampltudes over n can be used as a parameter to stand for the amount of word sgnal nformaton. It s also found that the ampltudes of nose tend to become larger when translaton ndex n s larger than 0.8N. Thus, summaton s performed only from n =0 to n = 0.8N. Ths novel detecton parameter, called low-band 23

20 Fg. 2. Dstrbutons of speech/non-speech frames n the LWE-ZCR plane wth nose rangng from SNR20 to SNR0, where and + denote non-speech and speech, respectvely. wavelet energy (LWE), s computed as follows, 0.8N N 6 LWE = sk (, ) ψ (2 ( kn)) 3 n= 0 2 k= (4) For llustraton, a clean speech and ts correspondng WE parameters of each frame are shown n Fg. (a). The speech wth whte nose and ts correspondng WE parameters at SNR5 s shown n Fg. (b). Ths example shows that the WE parameter can robustly represent the energy of speech sgnal at dfferent SNRs. In addton to the WE parameter whch s used to measure speech energy, the other parameter used for speech detecton s the Zero Crossng Rate (ZCR). The reason for usng the ZCR s that t s partcularly sutable for un-voced detecton due to the hgh-frequency nature of the majorty of frcatves. Fgure 2 shows dstrbutons of speech/non-speech frames n the LWE-ZCR plane wth nose levels SNR=20, 5, 0, and 5. The results show that the speech frames locate n a certan regon of the two dmensonal feature space. 3. SUPPORT VECTOR MACHINE DETECTOR SVM s based on the statstcal learnng theory developed by Vapnk [7]. SVM frst maps the nput ponts nto a hgh dmensonal feature space and fnds a separatng hyperplane that maxmzes the margn between two classes n ths space. Suppose we are gven a set S of labeled tranng set, S = {( x, y),( x2, y2),,( xn, yn)}, where x n, and {, } y +. Consderng that the tranng data s lnearly non-separable, the goal of SVM s to fnd an optmal hyperplane such that 24

21 Fg. 3. The sequence of speech used for SVM tranng. y w x b N (5) T ( + ) ξ, =,, where n w, b, and 0 ξ s a slack varable. For ξ >, the data are msclassfed. To fnd an optmal hyperplane s to solve the followng constraned optmzaton problem: N T Mn w, ξ ww+ Cξ 2 = T Subject to y ( w x + b) ξ (6) where C s a user defned postve cost parameter and ξ s an upper bound on the number of tranng errors. After solvng Eq. (2), the fnal hyperplane decson functon s acheved, and where N T f ( x ) = sgn( w x + b) = sgn( yα < x, x >+ b) = sgn( yα < x, x >+ b) = SV α s a Lagrange multpler and the tranng samples for whch α 0 are support vectors (SVs). A detaled dervaton process can be found n [8]. The above lnear SVM can be readly extended to a nonlnear classfer by frst usng a nonlnear operator Φ to map the nput data nto a hgher dmensonal feature space. In ths way, t can solve nonlnear problems. By replacng x n Eqs. () and (2) wth the feature space Φ( x ) and solvng the constraned optmzaton problem, the decson functon (7) N f ( x) = sgn( yα <Φ( x), Φ ( x ) >+ b) = N = sgn( yα K( x, x ) + b) = = sgn( yα K( x, x ) + b) SV (8) 25

22 s acheved, where K( x, x ) =Φ( x) Φ( x ) j j s called a kernel functon. Ths paper uses a Gaussan-kernel SVM wth K( x, x j ) 2 = exp( xx / γ ), where γ s the wdth of a Gaussan-kernel. The two-dmensonal nputs of the Gaussan-kernel SVM detector are ZER and LWE. For SVM, there s only one output and the desred output s and f the nput frame s speech and non-speech, respectvely. Durng test, the SVM output ndcates where or not the nput frame s speech. 4. EXPERIMENTS The wave fles of speech are recorded by.025 khz sample rate, mono channel and 6-bt resoluton. For SVM tranng, the tranng sequence length s 3 seconds and s shown n Fg. 3. It conssts of 20 words and s corrupted by whte nose whose energy level ncreases from the start to SNR=0 and then decreases tll the end of the sequence. For testng, the speech database s bult of sequences of transcrptons from the same male speaker, where each sequence conssts of ten solated Mandarn words 0,,, 9. There are a total of 50 test sequences used for playng the judcal role n performance comparson. The nose added to the speech sequence s of varable nose level durng the sequence. Fgure 4(a) shows the flowchart of tranng by (a) j SVM (b) SVM Fg. 4. (a) Flowchart of SVM tranng. (b) Flowchart of LWE-based SVM for test data. 26

23 Fg. 5. Tranng performance of C n the range n the range [, 85], where the range s spaced to 7 equal scales. Fg. 6. Word boundary results by LWE-based SVM and RTF-based RSONFIN n varable nose level envronment. LWE-based SVM s shown, and Fg.4 (b) shows test of LWE-based SVM. The classfcaton rate defned n Eq. (9) s used as tranng performance ndex. Correctly detected frame number Classfcaton rate = (9) total frame number n tranng sequance For SVM, the value of C nfluences the tranng performance. Fg. 5 shows the tranng performance of C n the range [, 85], where the range s spaced to 7 equal scales. The cost value C s set to 40 n the followng experments, where there are a total of

24 (a) (b) Fg. 7. Nosy speech recognton results by dfferent word boundary detecton methods. (a) whte nose (b) factory nose. SVs n the traned SVM. To get a quck vew on the test performance, some llustratve examples are expermented and shown n Fg. 6, where whte nose wth sharp varaton n ampltude s added to the clean speech. Most word boundares are correctly detected. For comparson, Fg. 6 also shows the performance of refned tme-frequency (RTF) feature based recurrent self-organzng neural fuzzy nference network (RTF-based RSONFIN) [4] detecton method. The result shows that RTF-based RSONFIN almost fals to detect most of the words, and the performance of LWE-based SVM shows much better performance than RTF-based RSONFIN. Next, the ten Mandarn dgtal words n each sequence of transcrptons n the test database are to be recognzed. The words n each sequence are detected by the two methods respectvely. When the number of successve frames beng detected as speech s larger than 0. second, we regard t as word for recognton, otherwse these frames are dscarded. So the number of words detected n each sequence of transcrpton may be larger or smaller than exact ten words. Consderng ths phenomenon, we defne the followng recognton rate TEUS recognton rate = 00%, (0) T where T s the total number of words n the reference transcrptons, E s the number of words recognzed ncorrectly, U s un-detect words of reference transcrptons, and S s surplus words of reference transcrptons. For the recognzer, the herarchcal sngleton-type recurrent neural fuzzy network (HSRNFN) [9] that put SNR20 whte nose as tranng data s used. The reason we use HRNFN s that t acheves hgh recognton rate and s robust to dfferent types of nose under dfferent SNR. Wth HSRNFN recognzer, the recognton results by hand-segment, 28

25 LWE-based SVM, and RTF-based RSONFIN methods under whte and factory nose are shown n Fg. 7. The results show that recognton rate of the LWE-based SVM method s slghtly lower than that of hand segmentaton, but s much larger than that of the RTF-RSONFIN method. 5. CONCLUSIONS Two research results on robust speech detecton n varable nose-level envronment have been presented ths paper, one s the robust LWE-based parameters, and the other s detector desgn by SVM. Varable nose-level nstead of fxed nose-level s added to each sequence of transcrpt. Dstrbutons of the LWE-based parameters n the 2-dmensonal feature space for dfferent SNRs have shown that the LWE-based parameters are feasble for speech detecton over varable level nose. The LWE-based SVM can be appled to a speech recognton system as demonstrated n the experments. REFERENCES [] M. H. Savoj, A robust algorthm for accurate end-pontng of speech sgnals, Speech Communcaton, vol. 8. no., 989, pp [2] J. Rouat, Y. C. Lu, and D. Morssette, Ptch determnaton and voced/unvoced decson algorthm for nosy speech, Speech Communcaton, vol. 2, no. 3, 997, pp [3] J. C. Junqua, B. Mak, and B. Reaves, A robust algorthm for word boundary detecton n the presence of nose, IEEE Trans. Speech and Audo Processng, vol. 2, 994, pp [4] G. D. Wu and C. T. Ln A recurrent neural fuzzy network for word boundary detecton n varable nose-level envronments, IEEE Transactons on systems, Man, and cybernetcs, vol. 3, no., 200, pp [5] J. F. Wang and S. H. Chen, A C/V segmentaton algorthm for mandarn speech sgnal based on Wavelet transforms, Proc. of ICASSP, vol., pp , March 999. [6] Y. Q and B. R. Hunt, Voced-unvoced-slence classfcaton of speech usng hybrd features and a network classfer, IEEE Trans. Speech and Audo Processng, vol., 993, pp [7] C. Cortes, and V. Vapnk, Support vector networks, Internatonal Journal on Machne Learnng, vol. 20, pp. -25, 995. [8] N. Crstann and J. S.-Taylor, An Introducton to Support Vector Machnes And Other Kernel-based Learnng Methods, Cambrdge Unversty Press, [9] Y. T. Chan, Wavelet Bascs, 995, Kluwer Academc Publshers. [0] C. F. Juang, C. T. Chou, and C. L. La, Herarchcal sngleton-type recurrent neural fuzzy networks for nosy speech recognton, IEEE Trans. Neural Networks, vol. 8, no. 3, pp , May

26 30

27 Nose-Robust Speech Features Based on Cepstral Tme Coeffcents Ja-Zang Yeh Department of Scence and Engneerng Natonal Sun Yat-Sen Unversty Kaohsung, Tawan Cha-Png Chen Department of Scence and Engneerng Natonal Sun Yat-Sen Unversty Kaohsung, Tawan Abstract In ths paper, we nvestgate the nose-robustness of features based on the cepstral tme coeffcents (CTC). By cepstral tme coeffcents, we mean the coeffcents obtaned from applyng the dscrete cosne transform to the commonly used mel-frequency cepstral coeffcents (MFCC). Furthermore, we apply temporal flters used for computng delta and acceleraton dynamc features to the CTC, resultng n delta and acceleraton features n the frequency doman. We experment wth fve dfferent varatons of such CTC-based features. The evaluaton s done on the Aurora 3 nosy dgt recognton tasks wth four dfferent languages. The results show all but one such feature set performance gan, the other feature sets actually lead to performance gans. The best feature set acheves an mprovement of 25% over the baselne feature set of MFCC. Keywords: MFCC, CTC, delta, robust feature. Introducton Afront-endofaspeechrecogntonsystemmayconsstofseveralstagesfornose-robustnessto acheve good performance. In the early stage of spectral doman, well-known methods such as spectral subtracton [] and Wener flter [2] may be appled. In the mddle stage of cepstral doman, the mel-frequency cepstral coeffcents (MFCC) are commonly used as the statc feature set. In the postprocessng stage, there may be normalzaton, temporal nformaton ntegraton, and transformaton modules. It has been observed that smple normalzaton approaches, such as the cepstral mean subtracton (CMS) [3], cepstral varance normalzaton (CVN) [4], and hstogram normalzaton (HEQ) [5] can lead to sgnfcant performance mprovement n recognton accuracy n nosy envronment. Apparently such methods are capable of allevatng the msmatch between the clean and nosy data. In ths paper we nvestgate novel features based on smple transformaton methods. Specfcally, we nsert a wndow of statc cepstral vectors n a matrx and then apply the dscrete cosne transform (DCT) along the temporal axs. The coeffcents after the DCT s called the cepstral tme coeffcents, 3

28 Fgure : The block dagram of the proposed feature transformaton methods. and the resultant matrx s called the cepstral tme matrx (CTM) [6, 7]. After CTM for each frame s extracted, we further apply normalzaton and routnes for delta and acceleraton feature extracton to the cepstral tme coeffcents. The transformed features are combned wth the statc MFCC features to form the fnal feature vector. Ths paper s organzed as follows. Secton 2 defnes the cepstral tme matrx and ntroduces the nvestgated feature transformatons. The expermental setup and recognton results are descrbed n Secton 3. In Secton 4, we draw conclusons. 2. Feature Transformatons Our feature extracton and transformaton process s llustrated n Fgure. We begn wth a revew of the cepstral tme matrx, whch s followed by the mathematcal defnton of the proposed addtve transformaton methods. 2.. Cepstral Tme Coeffcents We frst nsert a fxed number of adjacent feature vectors n a matrx C t C2 t... CT t C t. [ ].... f t f t+... f t+t. () CK t CK2 t... CKT t Here K s the feature vector dmenson, and f t s the feature vector of frame t, C t s the matrx whose column vectors are the T consecutve feature vectors startng from frame t. The cepstral tme matrx at frame t, D t,srelatedtoc t by the dscrete-cosne transform. Each row of D t s the dscrete-cosne transform of the correspondng row of C t.thats, D: t = DCT(Ct : ). (2) Here D t : s the -th row of matrx D. We call D t n the nth cepstral tme coeffcent (CTC) of channel at frame t. D s also called cepstral tme matrx (CTM). It represents the spectral nformaton of 32

29 cepstral coeffcent n an analyss wndow of frames. Snceourmatrxndexstartsfrom nstead of 0, herethedctneedstobe T ( ) (2τ )(n )π Dn t = Cτ t cos. (3) 2T 2.2. CTC-Based Features τ= In ths paper, we have 5 dfferent transforms appled to CTC, each leadng to a dfferent feature vector Method E The frst transform s dvdng the frst column of D t by the number of frames (T ), whle leavng other columns unchanged. Let E t be the new feature matrx, we have { E: t = D:/T t E:n t = D:n t, n (4) Note E: t has a physcal meanng. Accordng to (2), t s the mean of the cepstral coeffcents wthn an analyss wndow (whle D: t s the sum). We then compute a novel feature set based on E t. Specfcally, we treat the columns n E t as a temporal sequence and apply the delta and acceleraton feature extracton steps. That s, {Ĕt :2 = E:2 t E: t Ĕ:3 t = E:3 t 2Et :2 + Et :. (5) We add the Ĕ(t) :2 and Ĕ(t) :3 to the statc MFCCs, resultng n a feature vector of E t = C t : Ĕ t :2 Ĕ t :3. (6) Method F An alternatve transform s to normalze the feature values n the frst column to the range of [, ]. Ths s acheved by dvdng D: t by the maxmum magntude of the frst column. Let F t be defned by { F: t = D: t /N t F:n t = D:n t, n (7) where N t s the maxmum magntude n the frst column,.e., N t =max Dd t. d The remanng operatons are smlar to Method E. Thats, { F t :2 = F:2 t F : t F :3 t = F:3 t 2F :2 t + F : t. (8) In general, we wll use notaton A : to denote the -th row vector and A :j to denote the j-th column vector, of matrx A. 33

30 We add F (t) :2 and F (t) : Method G to the statc MFCCs, resultng n a feature vector of F t = C t : F t. (9) In Method G, we add the frst and second columns of CTM, whch represents the zeroth and frst cepstral tme coeffcents, to the statc MFCC vector, G t = :2 F t :3 C t : D t : D t :2. (0) Method H In Method H, we add the second and thrd columns of CTM, whch represent the frst and second cepstral tme coeffcents, to the statc MFCC vector, H t = C t : D t :2 D t :3. () Method I In Method I, we no longer use the MFCC. Instead, we smply use the zeroth, frst, and second cepstral tme coeffcents, I t = D t : D t :2 D t :3. (2) Method B For completeness, we descrbe our baselne features as Method B. Our baselne smply uses the 2 MFCCs (c,...,c 2 ), the log energy, and the delta and delta-delta features. Therefore, the feature vector has a dmenson of 39, whchagreeswthothermethods. Furthermore,ourbaselneresults agree wth the Aurora 3 baselne results [8, 9]. 3. Experments 3.. Expermental Database We evaluate the proposed CTC-based speech features on the Aurora 3 nosy-dgt recognton tasks [8, 9]. Aurora 3 s a mult-lngual speech database, consstng of dgt-strng utterances n Dansh, German, Fnnsh and Spansh. It provdes a platform for far comparson between systems of dfferent front-ends. All the results reported n ths paper follow the Aurora 3 evaluaton gudelnes. 34

31 3.2. Results We frst evaluate the number of vectors to be ncluded n C t,anddecdetouset =5.Forthestatc features we use 2 MFCC features and the log energy, makng K =3.Therefore,thentalmatrx C t s of sze 3 5. Table lsts the expermental results on the Aurora 3 database. The entres n the table are the averaged relatve mprovements of word error rates over the baselne. Consstent performance across dfferent methods have been observed n the experments. Specfcally, Method H acheves the best performance, whle Method G yelds the worst performance, n all languages. Gven that Method G and Method H dffer only n the cepstral tme coeffcents they nclude n the fnal feature vector, t s far to say that the zeroth cepstral tme coeffcent s detrmental to recognton accuracy. Methods E, Method F, and Method I yeld mxed results. In Fnnsh, Method E outperforms Method F and Method I. In Spansh and Dansh, Method F outperforms Method I and Method E. Method E and Method F are smlar n the sense that the frst column (zeroth cepstral tme coeffcents) are normalzed, and then used n procedures smlar to delta and acceleraton feature extracton, n the frequency doman rather than n the tme doman. It s not surprsng that they have smlar performance level. Table : The overall (averaged over condtons) relatve mprovements of the word error rates n the Aurora 3 tasks. German Spansh Fnnsh Dansh E F G H I The comparson of Method G and H concludes that the zeroth CTC s detrmental of recognton accuracy. The zeroth CTC corresponds to the frst column of CTM. Therefore n Method E and F, we try schemes of normalzng the frst column of CTM. In Method E we dvde the frst column of CTM by T, and n Mthod F we normalze the value of frst column to the range to. Theperformance of E and F gven n Table are better than the baselne. Lastly, we also try Method I, whch uses only CTCs, and excludes MFCCs. Its recognton accuracy s also better than the baselne. Fgure 2 plots the temporal sequences of the ffth dmenson of the thrd column (Dmenson 3 out of 39)ofthefeaturevectorsofMethodB,F,andHofaparofDanshutterances.Theparconsstsof an utterance of Channel 0 (the cleaner nstance) and an utterance of Channel (the noser nstance). Specfcally, usng our prevously defned notatons, Fgure 2(B) s the plot of 2 f t 5,Fgure2(F)sthe plot of F t 53, andfgure2(h)stheplotof H t 53. ItappearsthatthedfferencebetweenChannel0and Channel s smaller n the cases of (F) and (H) than n the case of (B). Therefore the msmatchedness s reduced. Table 2 lsts the expermental results of Method H on the Aurora 3 database, gven as percent word error rate (WER) results. These results nclude the four Aurora 3.0 languages (Fnnsh, Spansh, German, and Dansh) and the Well-Matched(WM), Medum-Matched(MM), and Hghly- Msmatched(HM) tranng/testng cases. 35

32 Table 2: Our most recent Aurora 3.0 results usng the method H, gven as percent word error rate (WER) results.theseresultsncludethefouraurora3.0languages(fnnsh,spansh,german,and Dansh) and the Well-Matched(WM), Medum-Matched(MM), and Hghly-Msmatched(HM) tranng/testng cases. Aurora3 Reference Word Error Rate German Spansh Fnnsh Dansh WM MM HM Aurora3 Word Error Rate, Method H German Spansh Fnnsh Dansh Well Md Hgh Aurora3 Relatve Percentage Improvement German Spansh Fnnsh Dansh Avg. Well Md Hgh overall Concluson and Future Work In ths paper, we use fve dfference feature sets based on the cepstral tme coeffcents. Method EandF,whchfrstnormalzethefrstcolumnandthenapplythedeltaanddelta-deltaoperatonson the frst 3 columns of CTM, lead to performance gans over the baselne. Method G and H, whch combne dfferent sets of columns of CTM wth the raw MFCC vector, lead to mxed results. Method I, whch uses all cepstral tme coeffcents, leads to mprovement. Overall, the combnaton of raw MFCC and the second and the thrd columns of CTM yelds the best results among all expermented feature sets. 5. References [] S. Boll, Supresson of acoustc nose n speech usng spectral subtracton, IEEE Transactons on Acoustcs, Speech and Sgnal Processng,vol.27,no.2,pp.3 20,Aprl979. [2] A. Bersten and I. Shallom, An hypotheszed Wener flterng approach to nosy speechrecognton, n Acoustcs, Speech, and Sgnal Processng, 99. ICASSP-9., 99 Internatonal Conference on, 99,pp

33 Fgure 2: Plot of Dmenson 3 (out of 39) ofadanshutterancerecordedntwomsmatchedchannels. (B) s the 2 f t 5,(F)s F t 53, and(h)s H t 53. The horzontal axs s the frmae ndex and the vertcal axs s the feature value. The dotted lne (. ) represents Channel 0 and the starred lne ( * ) represents Channel. [3] S. Furu, Cepstral analyss technque for automatc speaker verfcaton, IEEE Transactons on Acoustcs, Speech and Sgnal Processng,vol.29,no.2,pp ,98. [4] O. Vkk, D. Bye, and K. Laurla, A recursve feature vector normalzaton approach for robust speechrecognton n nose, n Acoustcs, Speech and Sgnal Processng, 998. Proceedngs of the 998 IEEE Internatonal Conference on, vol.2,998. [5] A. de La Torre, A. Penado, J. Segura, J. Perez-Cordoba, M. Bentez, and A. Rubo, Hstogram equalzaton of speech representaton for robust speech recognton, IEEE Transactons on Speech and Audo Processng,vol.3,no.3,pp ,

34 [6] B. Mlner, Incluson of temporal nformaton nto features for speechrecognton, n Spoken Language, 996. ICSLP 96. Proceedngs., Fourth Internatonal Conference on,vol.,996. [7], A comparson of front-end confguratons for robust speechrecognton, n IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, Proceedngs.(ICASSP 02), vol., [8] Motorola Au/374/0, Small vocabulary evaluaton: Baselne mel-cepstrum performances wth speech endponts, October 200. [9] A. Moreno, B. Lndberg, C. Draxler, G. Rchard, K. Choukr, S. Euler, and J. Allen, Speechdatcar: A large speech database for automotve envronments, n Proceedngs of the II LREC Conference, vol.,no.2,

35 A Study of Sub-band Modulaton Spectrum Compensaton for Robust Speech Recognton Sheng-yuan Huang Dept of Electrcal Engneerng, Natonal Ch Nan Unversty Tawan, Republc of Chna Wen-hsang Tu Dept of Electrcal Engneerng, Natonal Ch Nan Unversty Tawan, Republc of Chna Jeh-weh Hung Dept of Electrcal Engneerng, Natonal Ch Nan Unversty Tawan, Republc of Chna Aurora-2 65% 7% 32% Abstract In ths paper, we propose a novel scheme n performng feature statstcs normalzaton technques for robust speech recognton. In the proposed approach, the processed temporal-doman feature sequence s frst converted nto the modulaton spectral doman. The magntude part of the modulaton spectrum s decomposed nto non-unform sub-band 39

36 segments, and then each sub-band segment s ndvdually processed by the well-known normalzaton methods, lke mean normalzaton (MN), mean and varance normalzaton (MVN) and hstogram equalzaton (HEQ). Fnally, we reconstruct the feature stream wth all the modfed sub-band magntude spectral segments and the orgnal phase spectrum usng the nverse DFT. Wth ths process, the components that correspond to more mportant modulaton spectral bands n the feature sequence can be processed separately. For the Aurora-2 clean-condton tranng task, the new proposed sub-band spectral MN, MVN and HEQ provde relatve error rate reductons of 8.66% and 23.58% over the conventonal temporal MVN and HEQ, respectvely. (automatc speech recognton, ASR)[] (varaton) (envronmental msmatch)(speaker varaton)(pronuncaton varaton) (temporal doman) (cepstral mean normalzaton, CMN)[2] (cepstral mean and varance normalzaton, CMVN)[3] (hstogram equalzaton, HEQ)[4] (spectrum hstogram equalzaton, SHE)[5] (probablty dstrbuton) [6][7] [8] 6 Hz SHE 40

37 () (magntude spectrum) [5] SHE (HEQ) (MN)(MVN). (2-) m x n; n N, m M, (2-) M N m x n m x n m x n " m" 2. xn ; nnn (dscrete Fourer transform, DFT) Xk N 2nk j N K Xk xne, 0 k n0 2 (2-2) xn (frame rate)f s HzXk Fs 0, Xk(polar form) 2 Xk k Xk Ake j (2-3) Ak Xk k Xk Ak k 3. N Ak ;0 k 2 (non-unform) (octave) L (2-4) F s 0,, f. L 2 2 (2-4) 2 2 Fs 2 Fs,, f 2,3,..., L. L L

38 Ak A k ' L 4. A k ' (MN)(MVN) (HEQ)A k ' (MN) (2-5) ' ' A k A k, s, a, (2-5),s (sngle),a (all) (MVN)(2-6) ' A k ', s A k, a, a (2-6), s,s,s,a,a (HEQ)(2-7) ' ' A k F, a F, sa k (2-7) F,s 5. A k ' N Ak ;0 k 2 Ak (2-3) k (nverse dscrete Fourer transform, IDFT)xn (2-8) N 2nk j jk N xn Ake e, 0 n N. (2-8) N k 0 Ak AN k kn k (2-8) Ak k N k N 2 2 L 3 (HEQ)[5] (SHE)(2-5)(2-6)(2-7) (sub-band spectral mean normalzaton, SB-SMN)(sub-band F,a 42

39 spectral mean and varance normalzaton, SB-SMVN) (sub-band spectral hstogram equalzaton, SB-SHE)[5](full-band) SHE FB-SHE () SB-SMN SB-SMVN 0 (2) SB-SMN SB-SMVN SB-SHE (3) Ak N N N 2 L N, N,..., N L N 2 L N 60 5 N 30 4 () (power spectral densty, PSD) (temporal doman) (CMN)(CMVN)(HEQ) AURORA-2 [9] MIP_28826Z4A (SNR) (babble) MN(the frst cepstral coeffcent, c )(a) PSD (b) CMN MN (temporal CMN) MN (c)(f), 2 4 SB-SMN (2), SB-SMN (3) SB-SMN (4) PSD (f) SB-SMN (4) PSD PSD 43

40 (a) unprocessed c Hz (b) Temporal-CMN Hz (c) FB-SMN Hz (d) SB-SMN (2) Hz (e) SB-SMN (3) Hz (f) SB-SMN (4) Hz c (a)c (b) MN CMN(c) MN FB-SMN (d) MN SB-SMN (2) (e) MN SB-SMN (3) (f) MN SB-SMN (4) MVN(the frst cepstral coeffcent, c )(b) CMVN MVN (temporal CMVN)(b) CMN PSD MVN (c)(f) MN 2 4 SB-SMVN (2), SB-SMVN (3) SB-SMVN (4) PSD (f) SB-SMVN (4) PSD PSD 44

41 MVN PSD MN (a) un-processed c Hz (b) Temporal-CMVN Hz (c) FB-SMVN Hz (d) SB-SMVN (2) Hz (e) SB-SMVN (3) Hz (f) SB-SMVN (4) Hz c (a)c (b) MVN CMVN(c) MVN FB-SMVN (d) MVN SB-SMVN (2) (e) MVN SB-SMVN (3) (f) MVN SB-SMVN (4) HEQ(the frst cepstral coeffcent, c) HEQ PSD MN MVN ((c) FB-SMN, (c) FB-SMVN (c)fb-she) FB-SHE FB-SMN FB-SMVN PSD FB-SMN FB-SMVN PSD 45

42 [5] FB-SHE SHE(SB-SHE) FB-SHE PSD (c)(f) MN MVN 2 4 SB-HEQ (2), SB- HEQ (3) SB- HEQ (4) PSD (f) SB-HEQ (4) PSD PSD (a) unprocessed c Hz (b) Temporal-HEQ Hz (c)fb-she Hz (d) SB-SHE (2) Hz (e) SB-SHE (3) Hz (f) SB-SHE (4) Hz c (a)c (b) HEQ Temporal HEQ(c) HEQ FB-HEQ(d) HEQ SB-HEQ (2) (e) HEQ SB-HEQ (3) (f) HEQ SB-HEQ (4) 46

43 () (European Telecommuncaton Standard Insttute, ETSI) Aurora-2 [9] (sgnal-to-nose rato, SNR)clean20 db5 db0 db5 db0 db -5 db G.72 MIRS (Internatonal Telecommuncaton Unon, ITU)[0] () (MFCC) 8 khz (frame sze) 25 ms, 200 (frame shft) 0 ms, z (wndow) (Hammng wndow) (T) 256 (flter bank) 23 (feature vector) MFCC 3 (cc2, log-energy) + MFCC 3 + MFCC 3 39 (hdden Markov model, HMM)[] (acoustc models)(zero, one, two,, nne oh) (slence)620 () (absolute error rate reducton, AR) (relatve error rate reducton, RR )(baselne) 2 (relatve error rate reducton 2, RR 2 ) (3-) (3-2)(3-3) AR % 00% (3-) RR % 00% 00% RR2 % 00% 00% (3-2) (3-3) 47

44 . RR 9.00% SB-SHE SB-SMN SB-SMVN SB-SMN SB-SMVN SB-SHE SB-SHE 2. (6Hz) (RR 2 )SB-SMN (4) 8.3%SB-SMVN (4) 32.64%SB-SHE (4) 7.56% Method Set A Set B Set C average AR RR RR 2 Baselne FB-SMN SB-SMN (2) SB-SMN (3) SB-SMN (4) FB-SMVN SB-SMVN (2) SB-SMVN (3) SB-SMVN (4) FB-SHE SB-SHE (2) SB-SHE (3) SB-SHE (4) (%) () MFCC. () SB-SMN (4) CMN 88.02% CMN 8.66% SB-SMN (L=4) 78.99% SB-SMN (2) CMN 48

45 (2) SMN SB-SMN (4) CMN RR %SB-SMN (4) CMVN RR %SB-SMN (4) MVA RR %SB-SMN (4) HEQ RR % (3) SB-SMN (L=4) CMVNMVA HEQ 90.00% (SB-SMN) CMVNMVA HEQ CMN CMVN MVA HEQ Method Set A Set B Set C average AR RR RR 2 Baselne CMN FB-SMN SB-SMN (2) SB-SMN (3) SB-SMN (4) CMVN FB-SMN SB-SMN (2) SB-SMN (3) SB-SMN (4) MVA FB-SMN SB-SMN (2) SB-SMN (3) SB-SMN (=4) HEQ FB-SMN SB-SMN (L=2) SB-SMN (L=3) SB-SMN (L=4) SMN 2. () SB-SMVN (4) CMVN 89.87% CMVN 83.23% SB-SMVN (4) 86.36% 49

46 (2) SMVN SB-SMVN (4) CMN RR %SB-SMVN (4) CMVN RR 2 2.7%SB-SMVN (4) MVA RR %SB-SMVN (4) HEQ RR % CMN CMVN MVA HEQ Method Set A Set B Set C average AR RR RR 2 Baselne CMN FB-SMVN SB-SMVN (2) SB-SMVN (3) SB-SMVN (4) CMVN FB-SMVN SB-SMVN (2) SB-SMVN (3) SB-SMVN (4) MVA FB-SMVN SB-SMVN (2) SB-SMVN (3) SB-SMVN (4) HEQ FB-SMVN SB-SMVN (2) SB-SMVN (3) SB-SMVN (4) SMVN 3. () SB-SHE (L=4) CMVN 90.8% CMVN 83.23% SB-SHE CMVN SB-SHE MFCC (2) SHE SB-SHE (4) CMN RR %SB-SHE (4) CMVN RR %SB-SHE (4) 50

47 MVA RR %SB-SHE (4) HEQ RR 2 6.9% Method Set A Set B Set C average AR RR RR 2 Baselne CMN FB-SHE CMN SB-SHE (2) SB-SHE (3) SB-SHE (4) CMVN FB-SHE CMVN SB-SHE (2) SB-SHE (3) SB-SHE (4) MVA FB-SHE MVA SB-SHE (2) SB-SHE (3) SB-SHE (4) HEQ FB-SHE HEQ SB-SHE (2) SB-SHE (3) SB-SHE (4) SHE ( 6 Hz) 5

48 3C [], "",, 2004 [2] Sadaok Furu, "Cepstral analyss technque for automatc speaker verfcaton", IEEE Trans. on Acoustcs, Speech and Sgnal Processng, pp , 98 [3] Oll Vkk and Kar Laurla, ''Cepstral doman segmental feature vector normalzaton for nose robust speech recognton '', Speech Communcaton, vol. 25, pp.33-47, 998 [4] Ángel de la Torre, Antono M. Penado, José C. Segura, José L. Pérez-Córdoba, Ma Carmen Benítez, and Antono J. Rubo, "Hstogram equalzaton of speech representaton for robust speech recognton", IEEE Trans. on Speech and Audo Processng, pp , 2005 [5] Lang-che Sun, Chang-wen Hsu and Ln-shan Lee, "Modulaton spectrum equalzaton for robust speech recognton", n Proc. IEEE Workshop on Automatc Speech Recognton & Understandng (ASRU), pp.8-86, 2007 [6] Hynek Hermansky and Nelson Morgan, "RASTA processng of speech", IEEE Trans. on Speech and Audo Processng, pp , 994 [7] Hynek Hermansky and Petr Fousek, "Mult-resoluton RASTA flterng for TANDEM-based ASR", 2005 Internatonal Conference on Spoken Language Processng (Interspeech), pp [8] Noboru Kanedera, Takayuk Ara, Hynek Hermansky, and Msha Pavel, "On the mportance of varous modulaton frequences for speech recognton", 997 European Conference on Speech Communcaton and Technology (Eurospeech), pp [9] Davd Pearce and Hans-Günter Hrsch, "The AURORA expermental framework for the performance evaluatons of speech recognton systems under nosy condtons", n Proc. of ISCA IIWR ASR2000, Pars, France, pp.8-88, 2000 [0] ITU recommendaton G.72, "Transmsson performance characterstcs of pulse code modulaton channels", Nov. 996 [] Henry Stark, John W. Woods, "Probablty and random processes wth applcatons to sgnal processng", 3 rd Edton, Prentce-Hall,

49 Web Mnng for Unsupervsed Classfcaton We-Yen Day Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty Chun-Y Ch Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty Ruey-Cheng Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty Pu-Jen Cheng Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty Pe-Sen Lu Insttute for Informaton Industry Abstract Data acquston s a major concern n text classfcaton. The excessve human efforts requred by conventonal methods to buld up qualty tranng collecton mght not always be avalable to research workers. In ths paper, we look nto possbltes to automatcally collect tranng data by samplng the Web wth a set of gven class names. The basc dea s to populate approprate keywords and submt them as queres to search engnes for acqurng tranng data. Two methods are presented n ths study: One method s based on samplng the common concepts among the classes, and the other based on samplng the dscrmnatve concepts for each class. A seres of experments were carred out ndependently on two dfferent datasets, and the result shows that the proposed methods sgnfcantly mprove classfer performance even wthout usng manually labeled tranng data. Our strategy for 53

50 retrevng Web samples, we fnd that, s substantally helpful n conventonal document classfcaton n terms of accuracy and effcency. Keywords: Unsupervsed classfcaton, text classfcaton, Web mnng. Introducton Document classfcaton has been extensvely studed n the felds of data mnng and machne learnng. Conventonally, document classfcaton s a supervsed learnng task [, 2] n whch adequately labeled documents should be gven so that varous classfcaton models,.e., classfers, can be learned accordngly. However, such requrement for supervsed text classfcaton has ts lmtatons n practce. Frst, the cost to manually label suffcent amount of tranng documents can be hgh. Secondly, the qualty of labor works s suspcous, especally when one s unfamlar wth the topcs of gven classes. Thrdly, n certan applcatons, such as emal spam flterng, prototypes for documents consdered as spam mght change over tme, and the need to access the dynamc tranng corpora specfcally-talored for ths knd of applcaton emerges. Automatc methods for data acquston, therefore, can be very mportant n real-world classfcaton work and requre further exploraton. Prevous works on automatc acquston of tranng sets can be dvded n two types. One of whch focused on augmentng a small number of labeled tranng documents wth a large pool of unlabeled documents. The key dea from these works s to tran an ntal classfer to label the unlabeled documents and uses the newly-labeled data to retran the classfer teratvely. Although classfyng unlabeled data s effcent, human effort s stll nvolved n the begnnng of the tranng process. The other type of work focused on collectng tranng data from the Web. As more data s beng put on the Web every day, there s a great potental to explot the Web and devse algorthms that automatcally fetch effectve tranng data for dverse topcs. A major challenge for Web-based methods s the way to locate qualty tranng data by sendng effectve queres, e.g., class names, to search engnes. Ths type of works can be found n [3, 4, 5, 6], whch present an approach that assumes the search results ntally returned from a class name are relevant to the class. Then the search results are treated as auto-labeled and addtonal assocated terms wth the class names are extracted from the labeled data. By sendng the class names together wth the assocated terms, approprate tranng documents can be retreved automatcally. Although generatng queres s more convenent than manually collectng tranng data, the qualty of the ntal search results may not always be good especally when the gven classes have multple concepts. For example, the concepts of applcatons. The goal of ths paper s, gven a set of concept classes, to automatcally acqure tranng corpus based merely on the names of the gven classes. Smlar to our prevous attempts, we 54

51 employ a technque to produce keywords by expandng the concepts encompassed n the class names, query the search engnes, and use the returned snppets as tranng nstances n the subsequent classfcaton tasks. Two ssues may arse wth ths technque. Frst, the gven class names are usually very short and ambguous, makng search results less relevant to the classes. Secondly, the expanded keywords generated from dfferent classes may be very close to each other so that the correspondng search-result snppets have lttle dscrmnaton power to dstngush one class from the others. We present two concept expanson methods to deal wth these problems, respectvely. The frst method, expanson by common concepts, ams at allevatng the problem of ambguous class names. The method utlzes the relatons among the classes to dscover ther common ncepts of classes Apple and Mcrosoft. Combned wth the common concepts, relevant tranng documents to the gven classes can be retreved. The second method, expanson by dscrmnatve concepts, ams at fndng dscrmnatve concepts among the gven classes. For example, Pod could be one of the unque concepts of class Apple. Combned wth the dscrmnatve concepts, effectve tranng documents that dstngush one class from another can be retreved. Our methods are tested under two dfferent expermental setups, the CS papers and Web pages classfcaton tasks. The proposed methods are effectve n retrevng qualty tranng data by queryng search engnes. Moreover, the result shows that the obtaned Web tranng data and manually labeled tranng data are complementary. Our methods can sgnfcantly mprove classfcaton accuracy when only a few manually labeled tranng data s avalable. Contrbuton of our work can be addressed as follows. We propose an automatc way to sample the Web and collect the tranng data wth good qualty. Apart from the prevous work, our methods are fully automatc, relable, and robust, and acheve an 8% accuracy n text classfcaton tasks. Wth a lttle help from a small number of labeled data added nto the scene, the classfcaton accuracy can be as hgh up to 90%. Several experment results are also revealed to help nvestgaton and realzaton of automatc Web samplng methods, n whch the dffcultes encountered are presented n detal. The sectons are organzed as follows. In Sectons 2 and 3, we present our basc dea and the two methodologes, respectvely. The experments are ntroduced n Secton 4. In Secton 5, we dscuss the related work of ths paper. Fnally, n Secton 6, we gve out dscussons and conclusons. 2. The Basc Idea Suppose that we are gven a set of classes C =(c, c 2 n ), where c s the name of the -th class. We plan to generate keywords based on classes C, form a few queres and send them off to search engnes so as to collect tranng nstances. Our methods presented n ths paper are ndependent of classfcaton models; that s, any model can be ncorporated wth our methods. 55

52 To carefully examne the possblty of queryng search engnes for acqurng tranng data, we dd an evaluaton wth dfferent search engnes, search-result types (snppet or document), and the number of search results. 5 CS-related classes were taken nto account, ncludng Archtecture, IR, Network, Programmng, and Theory. Each class name c was sent to 3 search engnes, ncludng Google, Yahoo! 2, and Lve Search 3. Top 00 snppets were extracted as tranng data. We also gathered the research papers from the correspondng conferences to the 5 classes as the testng documents. Table shows the performance of dfferent search engnes. Queryng by the class names can acheve classfcaton accuracy at a range from 0.35 to More specfcally, the three search engnes perform well n Programmng and Theory but poorly n the others on average. Ths arses from the fact that rrelevant documents may be located for those classes wth ambguous names. The way to query by the class names s not relable due to the ambguty of the class names (the frst challenge). For example, the word archtecture s wdely used n CS, art and constructon. From the results, we select Google as our backend search engne n ths paper. We further explore f the classfcaton performance can be mproved by downloadng Web pages for tranng. The result s shown n Table 2. It reveals that Web pages mght ntroduce more noses than snppets do, whle the snppets summarze Web pages and capture the concepts of classes C by ther context. Moreover, to download Web pages s tme-consumng. Our methods, therefore, only retreve snppets as the tranng source. Table. Accuracy of dfferent search engnes for classfcaton of CS papers. Engne A rchtecture IR Network Programmng Theory Avg. Google Yahoo! Lve Search Intutvely, collectng more snppets or documents mght enhance the performance. Table 3 shows the results of changng tranng data szes from 00 to 900. It could be found that classfcaton accuracy does not ncrease obvously when the numbers of snppets and documents reach 200 and 300, respectvely. Ths s because much relevant nformaton can be retreved n top ranked search results returned by the search engne. Noses are unavodably ncluded from longer lsts. Hence, smply fetchng a large amount of snppets or documents from a sngle search result cannot acheve satsfactory performance. Even f we expand the queres,.e., the class names, usng pseudo-relevance feedback (PRF) [7, 8, 9], the mprovement s stll mnor snce the generated expanded keywords cannot effectvely dscrmnate dfferent classes (our second challenge). The performance comparson between our methods and PRF wll be gven n Secton 4.2. The Google search engne: 2 The Yahoo! search engne: 3 The MSN Lve search engne: 56

53 To collect good tranng corpora and help classfers learn more quckly (queryng search engnes s costly), two methods are proposed n ths paper. The frst method, expanson by common concepts, ams at allevatng the ambguty problem. Generally, a short class name easly conveys multple meanngs. For example, class Apple may be a frut or company name. We fnd that class c s context-aware f ts context C - {c } provdes relevant nformaton to c. For example, f Apple and Mcrosoft are put together, Apple would be a company. If we are gven apple and banana, apple could refer to a frut. Our frst method, whch wll be descrbed n Secton 3., s tryng to dscover such common concepts among classes C,.e., company and frut, from the Web, and use them as constrants to expand our orgnal queres C. Table 2. Accuracy of dfferent tranng types n CS papers classfcaton. Source A rchtecture IR Network Programmng Theory Avg. Snppet Documents Table 3. Average accuracy of dfferent tranng szes n CS papers classfcaton. # of docs Snppets Documents Common concepts can help us collect more relevant documents to each class but cannot dscrmnate one class from the others. The latter becomes mportant because classfcaton s nherently to dstngush dfferent classes. Our second method s focused on the fndng of dscrmnatve concepts among classes C. Consder prevous example. PowerPont Pod are possble dscrmnatve concepts because PowerPont Mcrosoft whle Pod s only about Apple. Dfferent from PRF, our second method, expanson by dscrmnatve concepts, whch wll be descrbed n Secton 3.2, ams at acqurng Web tranng data not only relevant to each class but also effectvely dstngushng one class from another. 3. The Proposed Methods In ths secton, we wll descrbe the two tranng data acqurng methods, samplng the Web by common concepts expanson and dscrmnatve concepts expanson, respectvely. 3. Expanson by Common Concepts The goal of our frst method s to collect tranng data va samplng the Web by dscoverng the common concepts between gven classes C. Expandng class names C by ther common concepts s helpful n obtanng more sutable tranng data from search engnes. An ntutve way to dscover the concepts s to fnd common concepts from well-known topc herarches on the Web such as Open Drectory Project 4 (DMOZ), whch s one of the largest and comprehensve human-edted drectores. To obtan common concepts, we frst search 4 DMOZ Open Drectory Project: 57

54 DMOZ by each class name c and get a set of the nodes relevant to c n the DMOZ drectory. Suppose the set of the nodes s N(c ). The least common ancestors (LCA) of all of the nodes N(c ) are vewed as the common concepts of C. The LCAs are the shared ancestors of N(c ) located farthest from the roots of the DMOZ drectory. For example, by uld be the common concepts between the Although Open Drectory Project covers dverse topcs and s very precse, sometmes we mght get few or even no common concepts among C. The problem s serous for those classes not so popular such as names of person or organzatons. For example, f we query the class Cornell, we only get 5 paths for the class, whch contans few canddates of concepts to select and expand. To deal wth ths problem, we extract terms co-occurrng wth each class c n Web pages, cluster the terms, and treat the representatve term for each class as one of the common concepts. More specfcally, all of the classes {c, c 2,, c n } C are combned nto one query c + c c n, and then submtted to a search engne. After stemmng and removng stopwords, we extract 20 hgh-frequency terms as canddates for common concepts from top 00 snppets returned from the search engne. To group these canddates, we send them separately to the search engne and generate correspondng feature vectors based on ther top 00 snppets. Un- and b-grams are adopted as feature terms and TF-IDF s used to calculate feature weghts. Next, a graph G = (V, E) s constructed, where v V represents one canddate term, and e E s the cosne smlarty between two feature vectors. Fnally, we perform the star clusterng algorthm [0] to choose the star centers, whch are the common concepts among C. In ths paper, we adopt both of LCAs from DMOZ and co-occurrng terms from Web pages as our common concepts among C. After common concepts generated, we can ether use t to sample the Web and acqure good tranng data, or utlze them whle dscrmnatve concepts are generatng. 3.2 Expanson by Dscrmnatve Concepts A dscrmnatve concept s a concept that can help dstngush one certan concept class of nterest (say, c) from all the classes (c'c). Such a concept contrbutes more relevance to one specfc class than to the others. Unlke common concepts, whch are shared by all the concept classes n C, any dscrmnatve concept has a specfc concept class to contrbute relevance to, called the host. Let f c be the feature vector of concept class c. Let the smlarty between any two concepts x and y be denoted as x,y = cos(f x, f y ). An deal dscrmnatve concept k for concept c must satsfy all the followng constrants:. Concept k should exhbt hgh smlarty to ts host c. 58

55 2. The smlarty between k and c should be sgnfcantly greater than that between k and any one of the other concept classes. These constrants loosely defne the crtera that we can use to dscover dscrmnatve concepts, and they also rule out the possblty that a concept k has two or more hosts. Based on the second constrant, a plausble decson crtera for dscrmnatve concept s gven below. Let c denote the set of all dscrmnatve concepts hosted by concept class c. We have: k c,k > c,k In the crtera, the rght-hand sde equaton needs to be satsfed for all other concept classes c'; n other words, we have a multple-constrant-satsfacton problem. For every concept-class par (c, c'), the rato c,k / c',k descrbes the degree of devaton n smlartes exhbted by k toward both classes. When the value s greater than, we say that c s more lkely to be the host; when t les between 0 and, we say that c' s more lkely. The parameter determnes the tghtness of the decson boundary. Snce we expect hgher smlarty between k and c than that between k and other c's, t would normally be defned as a value greater than. Generally, fxed boundary value s easer to tran and sutable for general cases, whle we also fnd that ths type of setup can cause problems n extreme cases. Suppose we have two classes c and c 2 whch are relevant topcs or extraordnarly smlar to each other (n terms of the smlarty between ther feature vectors). The number of dscrmnatve concepts for ether c or c 2 may drastcally decrease because, for most k's, devaton n smlartes toward both classes becomes even more subtle and harder to detect by usng a smple constant. An obvous soluton to ths problem, we fnd, s to adopt a functon n place of the constant, that assgns hgh threshold value for general cases and low threshold value for the aforementoned extreme cases. In real practce, we use a very smple form of decson crtera to dentfy dscrmnatve concepts: k c,k > ( c,c ) for all c c,k where s the dscrmnaton coeffcent. Wth the new crtera, a number of 5 to 40 dscrmnatve concepts (for each class) can stll be dscovered even when any two concept classes exhbt hgh class-to-class smlarty to each other. The next step s to apply ths technque to each concept class n C so as to populate new tranng nstances from the Web. Let SR c be the set of search-result snppets obtaned by sendng concept c as a query to the search engne. Assume that the tranng set for concept class c before and after the expanson s denoted as D c and D' c, respectvely. Gven c, we can form a set of new queres by concatenatng c wth each k c,, send them off to the search engne, and obtan a new set of tranng nstances,.e., { SR c k c k c. In formalsm, we have: D c =D c kc SR ck 59

56 Ths procedure can be repeated several tmes for each class so that the total number of dscovered tranng nstances can reach our expectaton. However, certan changes on defntons and notatons requred by the adaptaton needs to be clarfed n advance. Frst, we can no longer expect that the feature vector for a concept class c remans the same throughout multple teratons. In each teraton, when new tranng nstances added to the collecton, feature vectors for c actually changes. Next, the set of dscrmnatve concepts dscovered n each teraton would vary, as smlarty measure suffers from the change as well. Certan modfcaton n defntons should be taken care of so as to seamlessly ft the aforementoned crtera nto the framework, whle treatng notatons for a concept and for a concept class dfferently mght clutter up the framework. For smplcty, no explct treatment wll be done to the equatons n the followng text. Readers should take cauton that, when class-to-concept or class-to-class smlarty computaton s consdered n dscussons, the feature vector referred to a concept class s n fact derved from ts current tranng set (rather than SR c ). On the other hand, we wll use () n place of plan n the rest of the work. Assume that the algorthm repeats t tmes. The ntal tranng data for concept class c s denoted as D c (0), and n each teraton [,t], a new tranng set for c s produced and represented by D c (). Consder a smple framework as follows. For all c C, we have: D c 0 = SR c D c () =D c () kc () SR ck, > 0 where D c () and () s the set of tranng nstances and the set of dscovered dscrmnatve concepts, respectvely, for concept class c at teraton. Eventually, the algorthm stops after the t-th teraton. The content of D c (t) wll serve as the fnal tranng data for all concept class c. Snce practcally t s nfeasble to populate the entre set of c, several heurstcs are nvolved n creaton of the set: ) We look for terms wth hgh dscrmnaton power (specfcally, ungram and bgrams) n the context of D (-) usng commonly-used nformaton-theoretc measures, such as nformaton gan and nverse document-frequency. These canddates are then examned wth the decson crtera and dsqualfed ones are dscarded mmedately; 2) canddates that survved the test are ranked accordngly by the score functon, whch s a smple rewrte of the crtera that ndcates the average degree of devaton for the canddate k: c c,k ( c,c ) c,k cc The summands wll not cancel out snce the score s calculated for the canddates satsfyng the crtera. Generally, testng all the canddate terms may result n an extremely neffcent procedure. In practce, we set up a strct threshold on the nformaton gan and df n lght of reducng the number of canddates. We test only the top m concepts selected by the flter n the end. The value m s set to be 5 throughout the work. 60

57 4. Experments We evaluate our performance n CS papers and Web pages classfcaton. 4. Expermental Setup We conduct experments on two datasets. Frst s a set of papers from several CS-related conferences, and there are fve classes used for tranng, ncludng Archtecture, IR, Network, Programmng, and Theory, as shown n Table 4. For the paper dataset, there are about 500 papers n each class. Another dataset s the Web pages collected from four unverstes, and can be downloaded from the WebKB project 5. The classes for these unverstes nclude Cornell, Texas, Washngton, and Wsconsn. As the orgnal dataset for Web pages s mbalanced, we randomly choose 827 Web pages (.e. the mnmum sze of orgnal data among the 4 classes) for classfcaton. Please note that the two datasets are our testng data snce the tranng data are fully collected from the Web. Moreover, the two datasets are qute dfferent n document length and qualty. The papers are often longer, well wrtten, and wth much useful nformaton about the CS-related classes, whle the Web pages mght cover more noses and shorter contents. Our frst method of expanson by common concepts s denoted as CM; the second method of expanson by dscrmnatve concepts s denoted as DM; CM+DM s the combnaton of both methods, where CM s appled frst so that search results could be more relevant, and then DM s used to extract dscrmnatve concepts from the relevant search results. Wth the common concepts extracted from CM, we can use these concepts plus the class name as a whole query and perform DM to teratvely collect the dscrmnatve concepts. Table 4. The nformaton about the dataset of CS papers for classfcaton. Class # papers F rom Conferences Archtecture 490 SIGARCH(04-08), DAC(00-07) IR 484 SIGIR(02-07), CIKM(02-07) Network 446 SIGCOMM(02-08), IPSN(04-07), MOBICOM(03-07), MOBIHOC(00-07), IMC(0-07) Programmng 505 POPL(02-08), PLDI(00-07), ICFP(02-07), OOPSLA(00-07) Theory 47 SODA(0-08), STOC(00-07) To compare our performance wth the state-of-the-art Web-based method, LveClassfer [3], denoted as LC, has been mplemented, where the concepts herarchy are referred from DMOZ. For the CS-related classes, these concepts are computers, reference, busness, software, and scence; and for the classes of four unverstes, the soc, people, unversty, school, educaton, sports, Unted States. The thresholds and t used n DM are set to 0.85 and 7 n both dataset, respectvely. We use the Ranbow tool 6 and the VSM (Vector Space Model) classfcaton model for all the experments. 5 The WebKB project: 6 The Ranbow tool: 6

58 4.2 Text Classfcaton Table 5 compares the classfcaton accuracy between dfferent methods. For each class, the baselne method s to use only the class name as query, submt to the search engne, and collect snppets as tranng data. The method from query expanson (QE) s to use pseudo-relevance feedback (PRF) for each class, where the terms wth hgh TF-IDF values n the snppets are selected as expanson terms whch used for acqurng tranng data. LC s the method mplemented based on LveClassfer [3]. Wth the concept herarchy of each class, all the concepts are used to combne wth the class name as a query, then submt to the search engne and collect the snppets as tranng data. From Table 5, we fnd that merely sendng class names as queres cannot retreve qualty tranng data. It only acheves the accuracy at 0.57 on average. Even f we expand the class names wth PRF (the method of QE), the accuracy cannot be mproved by QE n ths case. Ths s because PRF s a general soluton to the keyword msmatchng problem and thus would not be well appled to our classfcaton problem. LC manually labels some concepts of the class names, e.g., the concept of Archtecture s about computer archtecture, so that more relevant tranng data to orgnal classes can be fetched. LC gets hgher accuracy at about 0.7. But such concepts labeled by people are often few. Our CM method can dscover more useful common concepts, as shown n Table 6 g relevant data for each class when combned wth orgnal class name. CM, on the other hand, mght ntroduce nosy keywords often co- more common concepts, we collect more qualty tranng data by CM. The accuracy for CM comes to 0.76, whch s much better than LC and QE. orgnally hgh n baselne method. To realze what makes the result, we check the process for gettng tranng the snppets from search engne are sutable and wt Ths s related to the characterstc of the Web. To our experence, the documents ranked from search engnes mght be mostly relevant to the felds about computer or network. The results of Web pages classfcatons, as shown n Table 7, are smlar to prevous expermental results except the average accuracy obtaned here s lower n general due to the nosy Web pages that are unrelable for testng. Moreover, sometmes the concepts derved from the Web are not effectve n classfcaton (even they are correct). For example, from the Web, DM learns two dscrmnatve concepts of ut and mlwaukee for classes Texas and Wsconsn, respectvely. But ther mpacts on classfyng our testng data are futle. Although there are some noses n the Web page dataset and the concepts found are correct but less useful, CM+DM does an mprovement whle tranng by the snppets from these concepts, and surpasses the performance of LveClassfer (LC). By our methods, the qualty tranng data are fetched and the classfer learns better, thus become more robust for text classfcaton. 62

59 Table 5. Accuracy of classfcaton n CS papers. Baselne: Class Names, QE: Query Expanson, LC:LveClassfer, CM: Common concept method, and DM: Dscrmnatve concept method. Method A rchtecture IR Network Programmng Theory Avg. Baselne QE LC CM DM C M +D M Table 6. Extracted common concepts and dscrmnatve concepts of CS classes. Method Archtecture IR Network Programmng Theory CM By DMOZ: computers By Star Algorthm: conference, proceedng, web DM CM+DM archtecture, archtectural, archtects, arts, hstory, contemporary, desgn sca, energy, orented, adaptve, adede, ecaade, archtectural nfrared, satellte, vsble, mage, weather, thermal, cameras nvestor, nfrafed, retreval, ecr, trec, sgr, e usa, health, acton, sports, food, montor, global development, wreless, shows, server, frst, emal,mal tutoral, java, example, usng, orented, download, lnux ferment, orented, programmers, fnal, sggraph, graphcs, anmaton Table 7. Accuracy of classfcaton n Web pages. Baselne: Class Names, QE: Query Expanson, LC:LveClassfer, CM: Common concept method, and DM: Dscrmnatve concept method. Method Cornell Texas Washngton Wsconsn Avg. Baselne QE LC CM DM C M +D M graph, mathematcal, problems, lterary studes, poltcal, number, Number, unversty, crtcal, math, computatonal, theoretcal, complexty 4.3 Combnaton of Web Tranng Data for Text Classfcaton In ths experment, we want to realze how our methods help conventonal supervsed text classfcaton. We further conduct an experment to compare the performance between labeled data plus tranng data from the Web and the labeled data only. 63

60 Both datasets are dvded nto tranng and testng data. We combne the tranng data collected from the Web and sample % of the tranng data as a new tranng corpus. For comparson, we also tran a classfer wth just those % of the labeled data only. The value vares from to 00. =00 means to use all the labeled data from orgnal dvded tranng set, thus become a supervsed learnng. 5-fold cross valdaton s used to evaluate the classfcaton accuracy. Fgure and 2 show the two expermental results, respectvely. We fnd that the Web corpus words, when the labeled data s nsuffcent or even wth no qualty, samplng tranng data from the Web could substantally complements the manually-labeled data. The performance of both classfers ncreases when more labeled data are ncluded. However, when more and more manually-labeled data s gven, the mprovement by the Web becomes less obvous or even slghtly worse. Ths s because once we add more data from the Web, we also ntroduce the noses such that the accuracy glows slowly when the qualty documents are enough. In CS papers classfcaton, the performance reaches as hgher as all tranng data used when 50% labeled data s added. For Web pages classfcaton, we have the same performance as supervsed learnng when only use 20% labeled data. The performance even exceeds the result from supervsed learnng when 40% labeled data s joned. It explans more qualty data from the Web s helpful n classfcaton. Ths result also shows that usng all the labeled data s not always as good as expected because some of the labeled data are not n good qualty % 3% 5% 7% 9% 20% 40% 60% 80% 00% % 3% 5% 7% 9% 20% 40% 60% 80% 00% web+labeled pure labeled web+labeled pure labeled Fg.. Accuracy of tranng data from the Web plus % labeled data for CS papers classfcaton. Fg. 2. Accuracy of tranng data from the Web plus % labeled data for Web pages classfcaton. Ths experment shows us that the tranng data from the Web does help to mprove the text classfcaton. Moreover, we could use a very few number of labeled data, plus the Web corpus our methods collect, to tran a desrable classfer. The Web helps the classfers to learn the unseen concepts whch do not exst due to the nsuffcency or the unrelable qualty of labeled data. It also tells us that the sutable tranng data mght change by tme, thus the orgnal labeled data performs worse n new classfcaton tasks. From the result, we beleve that our methods can beneft the task of text classfcaton, and other advanced applcatons. 64

61 5. Related Work Text classfcaton has been extensvely studed for a long tme n many research felds. Conventonally, supervsed learnng s usually appled n text classfcaton [, 2]. Our work focuses on the problem of how to adequately acqure and label documents automatcally for classfcaton models. For the problem of automatc acqurng tranng data, prevous studes dscuss n two drectons. One s focused on augmentng a small number of labeled tranng documents wth a large pool of unlabeled documents [, 2, 3, 4, 5, 6, 7, 8]. Such work trans an ntal classfer to label the unlabeled documents and uses the newly-labeled data to retran the classfer teratvely. [] proposed by Ngam et al. use the EM clusterng algorthm and the nave Bayes classfer to learn from labeled and unlabeled documents smultaneously. [2, 3] proposed by Yu et al. effcently computes an accurate classfcaton boundary of a class from postve and unlabeled data. In [5], L et al. use postve and unlabeled data to tran a classfer and solve the lack of labeled negatve documents problem. Fung et al. n [7] study the problem of buldng a text classfer usng postve examples and unlabeled example whle the unlabeled examples are mxed wth both postve and negatve examples. [4] proposed by Ngam et al. starts from a small number of labeled data and employs a bootstrappng method to label the rest data, and then retran the classfer. In [8], Shen et al. propose a method to use the n-multgram model to help the automatc text classfcaton task. Ths model could automatcally dscover the latent semantc sequences contaned n the document set of each category. Yu et al. n [6] present a framework, called postve example based learnng (PEBL), for Web page classfcaton whch elmnates the need for manually collectng negatve tranng examples n preprocessng. Although classfyng unlabeled data s effcent, human effort s stll nvolved n the begnnng of the tranng process. In ths paper, we propose an acqurng process of tranng data from the Web, whch s fully automatc. The method trans a classfer well for document classfcaton wthout labeled data, whch s the manly dfferent part from the prevous work. Moreover, our experments show that the Web can help the conventonal text classfcaton. The tranng data acqured from the Web expand the coverage of classfer, whch substantally enhance the performance whle there s a lack of labeled data, or the qualty of labeled data s not well enough. Another drecton s focused on gatherng tranng data by the Web [3, 4, 5, 6]. In [3], Huang et al. propose a system, called LveClassfer, whch combnes relevant class names as queres based on a user-defned topc herarchy so that more relevant documents to the classes could be found from the Web. [4, 5, 6] proposed by Hung et al. presents an approach that assumes the search results ntally returned from a class name are relevant to the class. These search results are treated as auto-labeled and addtonal assocated terms wth the class names are extracted from the labeled data. Although the prevous works are smlar to our methods, all of them are human-ntervened. In ths paper, we propose a method whch automatcally fnds the assocated concepts for the related classes and tran a desrable classfer. The man contrbuton s that our method utlzes the relatonshp of classes and samples the Web n an automatc way for key concepts of each class, thus further fnd the 65

62 qualty tranng data from the Web. Wthout labeled data and assocated terms gven by human, our methods perform well and classfy documents accurately for the text classfcaton problem. 6. Dscussons and Conclusons In ths paper, we propose two methods to automatcally sample the Web and fnd qualty tranng data for text classfcaton. We frst examne the effects of dfferent search engnes, retreved data types, and szes of retreved data. Moreover, from the subset of documents and the method by assocated terms, we know that samplng the Web for concepts of classes and fetchng tranng data can substantally mprove the performance of classfcaton. It mght be hard to dstngush the classes wth the ambguty and close relatonshp wthout labeled data. By the dscoverng of common concepts and dscrmnatve concepts, the ambguty of class names s elmnated and more relevant concepts are utlzed for samplng sutable and qualty tranng data from the Web. Several experments conducted n ths work show that our methods are useful and robust for classfyng documents and Web pages. Furthermore, our experments show that the tranng data sampled from the Web helps the conventonal supervsed classfcaton, whch need qualty and labeled data. The result demonstrates that the qualty of labeled data mght not always desrable due to the lack of useful key concepts, and we can provde proper tranng data from the Web to further mprove the results of text classfcaton. In addtons, two dataset wth dfferent characterstcs are used for our experments and the analyss from dfferent dataset s carefully conducted n ths paper. Compared to prevous works, the advantage of our methods s the fully automatc processes durng the concepts expanson and the tranng data collectng. Our methods are ndependent of classfcaton models, thus exstng models can be ncorporated wth the proposed methods. However, our work has some lmtatons. The classes we choose are related to each other. In other words, the performance would be better whle the classes are n the same level n the herarchy of topc classes. Wth the relatonshps between the classes, our methods can perform the context-aware technque among the classes to acqure more relevant documents, makng the classfers robust. To go a step further, there are more challenges to choose the qualty documents n the tranng corpus sampled from the Web. We can also sample good tranng documents whle a pool of unlabeled data s provded. We beleve that these challenges are worth studed and would be the research drectons n our future work. References - n Proceedngs of the 22nd Annual Internatonal ACM SIGIR conference, 999, pp Informaton Retreval, vol., pp. 6990, 999. [3] C.-C. Huang, S.-L. Chuang, and L.- Creatng herarchcal text classfer Conference,

63 [4] C.-C. Huang, K.-M. Ln, and L.- acquston through web Intellgence, [5] C.-M. Hung and L.- Asa Informaton Retreval Symposum, 2004, pp based text classcaton n the absence of manually labeled tran Socety for Informaton Scence and Technology, pp. 8896, theoretc approach to on Informaton Systems, vol. 9(), pp. 27, 200. of the 6th Annual Internatonal ACM SIGIR Conference, 993, pp docume Proceedngs of the 9th Annual Internatonal ACM SIGIR Conference, 996, pp for statc and dynamc -SIAM Symposum on Dscrete Algorthms. In: ACM-SIAM Symposum on Dscrete Algorthms (999), 999. classfcaton from labeled and Learnng, vol. 39(2/3), pp. 0334, n Proceedngs of the Eghteenth Internatonal Jont Conference on Artfcal Intellgence, d of the 2th Annual Internatonal ACM Conference on Informaton and Knowledge Management, 2003, pp Workshop for Unsupervsed Learnng n Natural Language Processng, 999. of Internatonal Jont Conferences on Artfcal Intellgence, [6] H. Yu, J. Han, and K.- : Web page classfcaton IEEE Transactons on Knowledge and Data Engneerng, vol. 6, pp. 708, Proceedngs of 2st Internatonal Conference on Data Engneerng, [8] D. Shen, J.- mproved through the 22nd Internatonal Conference on Data Engneerng,

64 68

65 Query Formulaton by Selectng Good Terms Cha-Jung Lee, Y-Chun Ln, Ruey-Cheng Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty {cjlee00, y.crystal, Pe-Sen Lu Insttute for Informaton Industry Pu-Jen Cheng Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty Abstract It s dffcult for users to formulate approprate queres for search. In ths paper, we propose an approach to query term selecton by measurng the effectveness of a query term n IR systems based on ts lngustc and statstcal propertes n document collectons. Two query formulaton algorthms are presented for mprovng IR performance. Experments on NTCIR-4 and NTCIR-5 ad-hoc IR tasks demonstrate that the algorthms can sgnfcantly mprove the retreval performance by 9.2% averagely, compared to the performance of the orgnal queres gven n the benchmarks. Experments also show that our method can be appled to query expanson and works satsfactorly n selecton of good expanson terms. Keywords: Query Formulaton, Query Term Selecton, Query Expanson.. Introducton Users are often supposed to gve effectve queres so that the return of an nformaton retreval (IR) system s antcpated to cater to ther nformaton needs. One major challenge they face s what terms should be generated when formulatng the queres. The general assumpton of prevous work [4] s that nouns or noun phrases are more nformatve than other parts of speech (POS), and longer queres could provde more nformaton about the underlyng nformaton need. However, are the query terms that the users beleve to be well-performng really effectve n IR? Consder the followng descrpton of the nformaton need of a user, whch s an example descrpton query n NTCIR-4: Fnd artcles contanng the reasons for NBA Star Mchael Jordan's retrement and what effect t had on the Chcago Bulls. Removng stop words s a common way to form a query su appears obvously that terms contan and had carry relatvely less nformaton about the topc. Thus, we ta 69

66 When carefully analyzng these terms, one could fnd that the meanng of Mchael Jordan s more precse than that of NBA Star, and hence we mprove MAP by 4% by removng NBA Star. Yet nterestngly, the performance of removng Mchael Jordan s not as worse as we thnk t would be. Ths mght be resulted from that Mchael Jordan s a famous NBA Star n Chcago Bulls. However, what f other terms such as reason and effect are excluded? There s no explct clue to help users determne what terms are effectve n an IR system, especally when they lack experence of searchng documents n a specfc doman. Wthout comprehensvely understandng the document collecton to be retreved, t s dffcult for users to generate approprate queres. As the effectveness of a term n IR depends on not only how much nformaton t carres n a query (subjectvty from users) but also what documents there are n a collecton (objectvty from corpora), t s, therefore, mportant to measure the effectveness of query terms n an automatc way. Such measurement s useful n selecton of effectve and neffectve query terms, whch can beneft many IR applcatons such as query formulaton and query expanson. Conventonal methods of retreval models, query reformulaton and expanson [3] attempt to learn a weght for each query term, whch n some sense corresponds to the mportance of the query term. Unfortunately, such methods could not explan what propertes make a query term effectve for search. Our work resembles some prevous works wth the am of selectng effectve terms. [,3] focus on dscoverng key concepts from noun phrases n verbose queres wth dfferent weghtngs. Our work focuses on how to formulate approprate queres by selectng effectve terms or droppng neffectve ones. No weght assgnments are needed and thus conventonal retreval models could be easly ncorporated. [4] uses a supervsed learnng method for selectng good expanson terms from a number of canddate terms generated by pseudo-relevance feedback technque. However, we dffer n that, () [4] selects specfc features so as to emphasze more on the relaton between orgnal query and expanson terms wthout consderaton of lngustc features, and (2) our approach does not ntroduce extra terms for query formulaton. Smlarly, [0] attempts to predct whch words n query should be deleted based on query logs. Moreover, a number of works [2,5,6,7,9,5,6,8,9,20] pay attenton to predct the qualty or dffculty of queres, and [,2] try to fnd optmal sub-queres by usng maxmum spannng tree wth mutual nformaton as the weght of each edge. However, ther focus s to evaluate performance of a whole query whereas we consder unts at the level of terms. Gven a set of possble query terms that a user may use to search documents relevant to a topc, the goal of ths paper s to formulate approprate queres by selectng effectve terms from the set. Snce exhaustvely examnng all canddate subsets s not feasble n a large scale, we reduce the problem to a smplfed one that teratvely selects effectve query terms from the set. We are nterested n realzng () what characterstc of a query term makes t effectve or neffectve n search, and (2) whether or not the effectve query terms (f we are able to predct) can mprove IR performance. We propose an approach to automatcally measure the effectveness of query terms n IR, wheren a regresson model learned from tranng data s appled to conduct the predcton of term effectveness of testng data. Based on the measurement, two algorthms are presented, whch formulate queres by selectng effectve terms and droppng neffectve terms from the gven set, respectvely. The mert of our approach s that we consder varous aspects that may nfluence retreval performance, ncludng lngustc propertes of a query term and statstcal relatonshps between terms n a document collecton such as co-occurrence and context dependency. Ther mpacts on IR have been carefully examned. Moreover, we have conducted extensve experments on NTCIR-4 and NTCIR-5 ad-hoc IR tasks to evaluate the performance of the 70

67 proposed approach. Based on term effectveness predcton and two query formulaton algorthms, our method sgnfcantly mprove MAP by 9.2% on average, compared to the performance of the orgnal queres gven n the benchmarks. In the rest of ths paper, we descrbe the proposed approach to term selecton and query formulaton n Secton 2. The expermental results of retreval performance are presented n Sectons 3. Fnally, n Secton 4, we gve our dscusson and conclusons. 2. Term Selecton Approach for Query Formulaton 2. Observaton When a user desres to retreve nformaton from document repostores to know more about a topc, many possble terms may come nto the mnd to form varous queres. We call such set of the possble terms query term space T={t, n }. A query typcally conssts of a subset of T. Each query term t T s expected to convey some nformaton about the user nformaton need. It s, therefore, reasonable to assume that each query term wll have dfferent degree of effectveness n retrevng relevant documents. To explore the mpact of one query term on retreval performance, we start the dscusson wth a degeneraton process, whch s defned as a mappng functon takng the set of terms T as nput and producng set {{t }, {t 2 }{t n }} as output. Mathematcally, the mappng functon s defned as: DeGen(T) = {{x} xt}. By applyng the degeneraton process to the gven n terms n T, we can construct a set of n queres q = {q, q 2,, q,, q n }, where q = {t t, t + t n } stands for a query by removng t from orgnal terms T. Suppose query term space T well summares the descrpton of the user nformaton need. Intutvely, we beleve that the removal of a term (especally an mportant one) from T may result n a loss of nformaton harmng retreval effectveness. To realze how much such nformaton loss may nfluence IR performance, we conduct an experment on NTCIR-4 descrpton queres. For each query, we construct ts query term space T by droppng stop words. T s treated as a hypothetcal user nformaton need. The remanng terms n the descrpton queres are ndvdually, one at a tme, selected to be removed to obtan q. Three formulas are used to measure the mpact of the removng terms and defned as: g g g mn max a v g (T) mn (pf( q ) - pf(t))/pf(t) (T) max (pf( q ) - pf(t))/pf(t) (T) T (pf( q ) - pf(t))/pf( T) where pf(x) s a performance measurement for query x, g(t) computes the rato of performance varaton, whch measures the maxmum, mnmum and average performance gan due to the removal of one of the terms from T, and T s the number of query terms n T. 7

68 We use Okap as the retreval model and mean average precson (MAP) as our performance measurement for pf(x) n ths experment. The expermental results are shown n Fgure. When we remove one term from each of the 50 topcs {T}, n average, 46 topcs have negatve nfluence,.e., g avg (T)<0. Ths means that deletng one term from T mostly leads to a negatve mpact on MAP, compared to orgnal T. On the other hand, g max (T)>0 shows that at least the removal of one term postvely mproves MAP. By removng such terms we can obtan better performance. The phenomenon appears n 35 out of 50 topcs, whch s statstcally suggestve that there exsts nosy terms n most of user-constructed queres. In short, removng dfferent terms from each topc T causes MAP varaton n dfferent levels. Some query terms are hghly nformaton-bearng, whle others mght hurt MAP. It s worth mentoned that we conduct the same experment wth the Indr and TFIDF retreval models usng the Lemur toolkt [2]. The results are qute consstent over dfferent models. Ths characterstc makes t possble for the effectveness of a query term on IR to be learned and appled to query formulaton. 00% 80% 60% 40% 20% 0% - 20% - 40% - 60% - 80% - 00% - 20% 93.06% Gmax(q) Gmn(q) Gavg(q) q Precson 83.04% 66.44% 67.74% 66.89% 62.96%56.34% 57.42% 45.86% 32.7% 30.09% 35.77% 39.% 40.94% 38.06% 54.% 37.47% 22.56% 28.40% 3.97% 34.3% 32.89% 9.22% 20.45% 26.49% 29.68% 30.06% 35.33% 25.2% 33.67% 2.40% 8.4% 9.46%.04% 6.4% 4.65% 2.53% 5.9% 7.03% 5.82% 6.30% 2.42% 8.08% 2.38%.6% 0.06% 0.42% 7.23% 6.3% Fg.. MAP gan by removng terms from orgnal NTCIR-4 descrpton queres. 2.2 Problem Specfcaton When a user desres to retreve nformaton from document repostores to know more about a topc, many possble terms may come nto her mnd to form varous queres. We call such set of the possble terms query term space T = {t,. A query typcally conssts of a subset of T. Each query term t T s expected to convey some nformaton about the users nformaton need. It s, therefore, reasonable to assume that each query term wll have dfferent degree of effectveness n documents retreval. Suppose Q denotes all subsets of T, that s, Q =Power Set(T) and Q =2 n. The problem s to choose the best subset q* among all canddates Q such that the performance gan between the retreval performance of T and q (q Q ) s maxmzed: = {( )/()}. () where pf(x) denotes a functon measurng retreval performance wth x as the query. The hgher the score pf(x) s, the better the retreval performance can be acheved. An ntutve way to solve the problem s to exhaustvely examne all canddate subset members n Q and desgn a method to decde whch the best q* s. However, snce an exhaustve search s not approprate for applcatons n a large scale, we reduce the problem 72

69 to a smplfed one that chooses the most effectve query term t (tt) such that the performance gan between T and T-{t} s maxmzed: = {( ( { }))/()}. (2) Once the best t* s selected, q* could be approxmated by teratvely selectng effectve terms from T. Smlarly, the smplfed problem could be to choose the most neffectve terms from T such that the performance gan s mnmzed. Then q* wll be approxmated by teratvely removng neffectve or nosy terms from T. Our goals are: () to fnd a functon r: T R, whch ranks {t, based on ther effectveness n performance gan (MAP s used for the performance measurement n ths paper), where the effectve terms are selected as canddate query terms, and (2) to formulate a query from the canddates selected by functon r. 2.3 Effectve Term Selecton To rank term t n a gven query term space T based on functon r, we use a regresson model to compute r drectly, whch predcts a real value from some observed features of t. The regresson functon r: T R s generated by learnng from each t wth the examples n form of <f(t), ( ( { }))/()> for all queres n the tranng corpus, where f(t) s the feature vector of t, whch wll be descrbed n Secton 2.5. The regresson model we adopt s Support Vector Regresson (SVR), whch s a regresson analyss technque based on SVM [7]. The am of SVR s to fnd the most approprate hyperplane w whch s able to predct the dstrbuton of data ponts accurately. Thus, r can be nterpreted as a functon that seeks the least dssmlarty between ground truth y = (pft pf(t {t }))/pf(t) and predcted value r(t ), and r s requred to be n the form of w f(t )+b. Fndng functon r s therefore equvalent to solvng the convex optmzaton problem:,,,,, (, +,2 ). (3) subject to: y (w f(t )+b) +, (4) :,,,2 0 (w f(t )+b) y +,2. (5) where C determnes the tradeoff between the flatness of r and the amount up to whch devatons larger than are tolerated, s the maxmum acceptable dfference between the predcted and actual values we wsh to mantan, and, and,2 are slack varables that cope wth otherwse nfeasble constrants of the optmzaton problem. We use the SVR mplementaton of LIBSVM [8] to solve the optmzaton problem. Rankng terms n query term space T={t, n } accordng to ther effectveness s then equvalent to applyng regresson functon to each t ; hence, we are able to sort terms t T nto an orderng sequence of effectveness or neffectveness by r(t ). 2.4 Generaton and Reducton Algorthms Generaton and Reducton, as shown n Fg. 2, formulate queres by greedly 73

70 selectng effectve terms or droppng neffectve terms from space T based on functon r. When formulatng a query from query term space T, the Generaton algorthm computes a measure of effectveness r(t ) for each term t T, ncludes the most effectve term t * and repeats the process untl k terms are chosen (where k s a emprcal value gven by users). Note that T s changed durng the selecton process, and thus statstcal features should be re-estmated accordng to new T. The selecton of the best canddate term ensures that the current selected term t * s the most nformatve one among those that are not selected yet. Compared to generaton, the Reducton algorthm always selects the most neffectve term from current T n each teraton. Snce users may ntroduce nosy terms n query term space T, Reducton ams to remove such neffectve terms and wll repeat the process untl T -k terms are chosen. Algorthm Generaton Algorthm Reducton Input: T={t, t 2, n } (query term space) k (# of terms to be selected) q{ } for = to k do { } end Output q q q { } T T { } Input: T={t, t 2, n } (query term space) k (# of terms to be selected) q{ t, t 2, n } for = to n-k do { } end Output q Fg. 2. The Generaton Algorthm and the Reducton Algorthm 2.5 Features Used for Term Selecton q q { } T T { } Lngustc and statstcal features provde mportant clues for selecton of good query terms from vewponts of users and collectons, and we use them to tran functon r. Lngustc Features: Terms wth certan lngustc propertes are often vewed semantcs-bearng and nformatve for search. Lngustc features of query terms are manly nclusve of parts of speech (POS) and named enttes (NE). In our experment, the POS features comprse noun, verb, adjectve, and adverb, the NE features nclude person names, locatons, organzatons, and tme, and other lngustc features contan acronym, sze (.e., number of words n a term) and phrase, all of whch have shown ther mportance n many IR applcatons. The values of these lngustc features are bnary except the sze feature. POS and NE are labeled manually for hgh qualty of tranng data, and can be tagged automatcally for purpose of effcency alternatvely. Statstcal Features: Statstcal features of term t refer to the statstcal nformaton about the term n a document collecton. Ths nformaton could be about the term tself such as term frequency (TF) and nverse document frequency (IDF), or the relatonshp between the term and other terms n space T. We present two methods for estmatng such term relatonshp. The frst method depends on co-occurrences of terms t and t j (t j T, t t j ) and co-occurrences of terms t and T-{t } n the document collecton. The former s called term-term co-occur feature whle the latter s called term-topc co-occur feature. The second method extracts so-called context vectors as features from the search results of t, t j, and T-{t }, respectvely. The term-term context feature computes the smlarty between the context vectors of t and t j whle the term-topc context feature computes the smlarty 74

71 between context vectors of t and T-{t }. Term-term & term-topc co-occur features: The features are used to measure whether query term t tself could be replaced wth another term t j (or remanng terms T-{t }) n T and how much the ntenson s. The term wthout substtutes s supposed to be mportant n T. Pont-wse mutual nformaton (PMI), Ch-square statstcs (X 2 ), and log-lkelhood rato (LLR) are used to measure co-occurrences between t and Z, whch s ether t j or T-{t } n ths paper. Suppose that N s the number of documents n the collecton, a s the number of documents contanng both t and Z, denoted as a = #d(t,z). Smlarly, we denote b = #d(t,z) c = #d(t,z) and d = #d(t,z).e., Z=N-a-b-c. PMI s a measure of how much term t tells us about Z. PMIt, Z = log[p(t, Z)/pt p(z)] log[a N/a + b(a + c)] (6) X 2 compares the observed frequences wth frequences expected for ndependence. 2 t, Z = N a d b c 2 /[a + ba + cb + d(c + d)] (7) LLR s a statstcal test for makng a decson between two hypotheses of dependency or ndependency based on the value of ths rato. a N 2 log LLRt,Z =alog a+ba+c +blog b N a+bb+d c N +c log c+d(a + c) +dlog d N c+db+d We make use of average, mnmum, and maxmum metrcs to dagnose term-term co-occur features over all possble pars of (t,t j ), for any : (8) = (,, ), (9) = max, X, & = mn, X(, ) (0) where X s PMI, LLR or X 2. Moreover, gven T={t n } as a tranng query term space, we sort all terms t accordng to ther,, or, and ther rankngs vared from to n are treated the addtonal features. The term-topc co-occur features are nearly dentcal to the term-term co-occur features wth an excepton that term-topc co-occur features are used n measurng the relatonshp between t and query topc T-{ }. The co-occur features can be quckly computed from the ndces of IR systems wth caches. Term-term & term-topc context features: The co-occurrence features are relable for estmatng the relatonshp between hgh-frequency query terms. Unfortunately, term t s probably not co-occurrng wth T-{t } n the document collecton at all. The context features are hence helpful for low-frequency query terms that share common contexts n search results. 75

72 More specfcally, we generate the context vectors from the search results of t and t j (or T-{t }), respectvely. The context vector s composed of a lst of pars <document ID, relevance score>, whch can be obtaned from the search results returned by IR systems. The relatonshp between t and t j (or T-{t }) s captured by the cosne smlarty between ther context vectors. Note that to extract the context features, we are requred to retreve documents. The retreval performance may affect the qualty of the context features and the process s tme-consumng. 3. Experments 3. Experment Settngs Table. Adopted dataset after data clean. Number of each settng s shown n each row for NTCIR-4 and NTCIR-5 NTCIR-4 <desc> NTCIR-5 <desc> #(query topcs) #(dstnct terms) #(terms/query) Table 2. Number of tranng nstances. (x : y) shows the number of postve and negatve MAP gan nstances are x and y, respectvely Indr TFIDF Okap Orgnal 674(56:58) 702(222:480) 687(224:463 Upsample 036(58:5 960(480:480) 926(463:463 Tran 828(44:44) 768(384:384) 740(370:370 Test 208(04:04) 92(96:96) 86 (93:93) We conduct extensve experments on NTCIR-4 and NTCIR-5 Englsh-Englsh ad-hoc IR tasks. Table shows the statstcs of the data collectons. We evaluate our methods wth descrpton queres, whose average length s 4.9 query terms. Both queres and documents are stemmed wth the Porter stemmer and stop words are removed. The remanng query terms for each query topc form a query term space T. Three retreval models, the vector space model (TFIDF), the language model (Indr) and the probablstc model (Okap), are constructed usng Lemur Toolkt [2], for examnng the robustness of our methods across dfferent frameworks. MAP s used as evaluaton metrc for top 000 documents retreved. To ensure the qualty of the tranng dataset, we remove the poorly-performng queres whose average precson s below As dfferent retreval models have dfferent MAP on the same queres, there are dfferent numbers of tranng and test nstances n dfferent models. We up-sample the postve nstances by repeatng them up to the same number as the negatve ones. Table 2 summarzes the settngs for tranng nstances. 3.2 Performance of Regresson Functon We use 5-fold cross valdaton for tranng and testng our regresson functon r. To avod nsde test due to up-samplng, we ensure that all the nstances n the tranng set are dfferent from those of the test set. The 2 statstcs ( 2 [0, ]) s used to evaluate the predcton accuracy of our regresson functon r: 76

73 2 = ( y ) 2 ( y) 2, () where R 2 explans the varaton between true label =( ({ }))/() and ft value y =wf(t )+b for each testng query term t T, as explaned n Secton 2.2. y s the mean of the ground truth. Table 3. R 2 of regresson model r wth multple combnatons of tranng features. L: lngustc features; C: co-occurrence features; C2: context features Performance of One Group of Features Two Groups of Features Three Four (3+) All Regresson R 2 Model r L C C2 L&C L&C2 C&C2 L&C &C2 m-cl m-scs Indr TFIDF Okap Avg Table 3 shows the R 2 values of dfferent combnatons of features over dfferent retreval models, where two other features are taken nto account for comparson. Content load (Cl) [4] gves unequal mportance to words wth dfferent POS. Our modfed content load (m-cl) sets weght of a noun as and the weghts of adjectves, verbs, and partcples as 0.47 for IR. Our m-scs extends the smplfed clarty score (SCS) [9] as a feature by calculatng the relatve entropy between query terms and collecton language models (ungram dstrbutons). It can be seen that our functon r s qute ndependent of retreval models. The performance of the statstcal features s better than that of the lngustc features because the statstcal features reflect the statstcal relatonshp between query terms n the document collectons. Combnng both outperforms each one, whch reveals both features are complementary. The mprovement by m-cl and m-scs s not clear due to ther smlarty to the other features. Combnng all features acheves the best R 2 value n average, whch guarantees us a large porton of explanable varaton n y and hence our regresson model r s relable. 3.3 Correlaton between Feature and MAP Yet another nterestng aspect of ths study s to fnd out a set of key features that play mportant roles n document retreval, that s, the set of features that explan most of the varance of functon r. Ths task can usually be done n ways fully-addressed n regresson dagnostcs and subset selecton, each wth varyng degrees of complexty. One common method s to apply correlaton analyss over the response and each predctor, and look for hghly-correlated predctor-response pars. Three standard correlaton coeffcents are nvolved, ncludng Pearson's product-moment 77

74 correlaton coeffcent, Kendall's tau, and Spearman's rho. The results are gven n Fg. 3, where x-coordnate denotes features and y-coordnate denotes the value of correlaton coeffcent. From Fg. 3, two context features postvely- and hghly-correlated (>0.5) wth MAP, under Pearson's coeffcent. The correlaton between the term-term context feature (cosne) and MAP even clmbs up to 0.8. For any query term, hgh context feature value ndcates hgh devaton n the result set caused by removal of the term from the query topc. The fndngs suggest that the drastc changes ncurred n document rankng by removal of a term can be a good predctor. The tradeoff s the hgh cost n feature computaton because a retreval processng s requred. The co-occurrence features such as PMI, LLR, and 2 also behave obvously correlated to MAP. The mnmum value of LLR correlates more strongly to MAP than the maxmum one does, whch means that the ndependence between query terms s a useful feature acronym noun verb adj adv person org geo tme sze phrase llr llrmn llrmax llrmn_r llrmax_r pearson kendall spearman pm pmnc pmmn pmmax pmmn_r pmmax_r x2 x2nc x2mn x2max x2mn_r x2_max_r tf df cosne cosnenc cosne_mn cosne_max cosne_mn_r cosne_max_r m_cl m_scs Fg. 3. Three correlaton values between features and MAP on Okap retreval model postve, medum-degree correlaton (0.3<<0.5) wth MAP. Intutvely, a longer term mght naturally be more useful as a query term than a shorter one s; ths may not always be the case, but generally t s beleved a shorter term s less nformatve due to the ambguty t encompasses. The same ratonale also e, because terms of noun phrases usually refer to a real-world event, such mght turn out to be the key of the topc. nfluence to MAP than others do, whch shows hgh concordance to a common thought n NLP that nouns and verbs are more nformatve than other type of words. To our surprses, NE o not show as hgh concordance as the others. Ths mght be resulted from that the tranng data s not suffcent enough. Features m-scs whose correlaton s hghly notable have postve mpacts. It supports that the statstcal features have hgher correlaton values than the lngustcs ones. 3.4 Evaluaton on Informaton Retreval In ths secton, we devse experments for testng the proposed query formulaton algorthms. The benchmark collectons are NTCIR-4 and NTCIR-5. The experments can be dvded nto 78

75 two parts: the frst part s a 5-fold cross-valdaton on NTCIR-4 dataset, and n the second part we tran the models on NTCIR-4 and test them on NTCIR-5. As both parts dffer only n assgnment of the tranng/test data, we wll stck wth the detals for the frst half (cross-valdaton) n the followng text. The result s gven n Table 4. Evaluaton results on NTCIR-4 and NTCIR-5 are presented n the upper- and lower-half of the table, respectvely. We offer two baselne methods n the consder nouns as query terms snce nouns are clamed to be more nformatve n several prevous works. Besdes, the upper bound UB s presented n the benchmark: for each topc, we permute all sub queres and dscover the sub-query wth the hghest MAP. As term selecton can also be treated as a classfcaton problem, we use the same features of our regresson functon r to tran two SVM classfers, Gen-C and Red-C. Gen-C selects terms --R and Red-R denote our Generaton and Reducton algorthms, respectvely. The retreval results are presented n terms of MAP. Gan ratos n MAP wth respect to the two baselne methods are gven n average results. We use two-taled t-dstrbuton n the sgnfcance test for each method (aganst the BL) by vewng AP values obtaned n all query sesson as data ponts, wth p<0.0 marked ** and p<0.05 marked *. Table 4. MAP of baselne and multple proposed methods on NTCIR-4 <desc> regresson model. (+x, +y) shows the mprovement percentage of MAP correspondng to BL and BL2. TFIDF and Okap models have PRF nvolved, Indr model does not. Best MAP of each retreval model s marked bold for both collectons. Settngs Metho Indr TFIDF Okap Avg. NTCIR- UB BL <desc> BL Queres Gen-C (+8.38%,+0.2) Gen-R (+8.00%,+9.90) Red-C 0.9* (+5.60%,+7.46) Red-R (+5.94%,+7.80) NTCIR- 5 <desc> Queres UB BL BL Gen-C * (+9.42%,+8.65) Gen-R (+.9%,+.) Red-C (+7.5%,+6.76) Red-R (+7.89%,+7.3) From Table 4, the MAP dfference between two baselne methods s small. Ths mght be because some nouns are stll nosy for IR. The four generaton and reducton methods 79

76 sgnfcantly outperform the baselne methods. We mprove the baselne methods by 5.60% to.9% n the cross-valdaton runs and on NTCIR-5 data. Ths result shows the robustness and relablty of the proposed algorthms. Furthermore, all the methods show sgnfcant mprovements when appled to certan retreval models, such as Indr and TFIDF; performance gan wth Okap model s less sgnfcant on NTCIR-5 data, especally when reducton algorthm s called for. The regresson methods generally acheve better MAP than the classfcaton methods. Ths s because the regresson methods always select the most nformatve terms or drop the most neffectve terms among those that are not selected yet. The encouragng evaluaton results show that, despte the addtonal costs on teratve processng, the performance of the proposed algorthms s effectve across dfferent benchmark collectons, and based on a query term space T, the algorthms are capable of suggestng better ways to form a query. [4] proposed a method for selectng Good Expanson Terms (GET) based on an SVM classfer. Our approach s also applcable to selecton of query expanson terms. Gven the same set of canddate expanson terms whch are generated by conventonal approaches such as TF and IDF, GET-C runs the Gen-C method whereas GET-R runs the Gen-R on the expanson set (wth the NTCIR-4 5-fold cross valdaton regresson model). Table 5 shows the MAP results of the two methods and the baselne method (BL), whch adds all expanson terms to orgnal queres. From Table 5, GET-R outperforms GET-C under dfferent retreval models and data sets, and both methods mprove MAP by.76% to 3.44% compared to the baselne. Moreover, though extra terms are ntroduced for query formulaton, we can see that certan MAP results n Table 4 stll outperform those n Table 5 (marked talc). It s therefore nferred that, t s stll mportant to flter out nosy terms n orgnal query even though good expanson terms are selected. Fnally, note that we use the NTCIR-4 5-fold cross valdaton regresson model, whch s traned to ft the target performance gan n NTCIR-4 dataset, rather than nstances n the query expanson terms set. However, results n Table 5 show that ths model works satsfactorly n selecton of good expanson terms, whch ensures that our approach s robust n dfferent envronments and applcatons such as query expanson. Table 5. MAP of query expanson based on GET-C and GET-R model. (%) shows the mprovement percentage of MAP to BL. Sgnfcance test s tested aganst the baselne results. Settngs Method Indr TFIDF Okap Avg. NTCIR-4 BL <desc> GET-C ** 0.280** ** GET-R 0.260** ** ** NTCIR-5 <desc> BL GET-C GET-R 0.880* 0.98* 0.945*

77 We further nvestgate the mpact of varous rankng schemes based on our proposed algorthms. The rankng scheme n the Generaton algorthm (or the Reducton algorthm) refers to an nternal rankng mechansm that decdes whch term shall be ncluded n (or dscarded away). Three types of rankng schemes are tested based on our regresson functon r. max-order always returns the term that s most lkely to contrbute relevance to a query topc, mn-order returns the term that s most lkely to brng n nose, and random-order returns a randomly-chosen term. Fgure 4 shows the MAP curve for each scheme by connectng the dots at (, MAP () ),, (n, MAP (n) ), where MAP () s the MAP obtaned at teraton. It tells that the performance curves n the generaton process share an nterestng tendency: the curves keep gong up n frst few teratons, whle after the maxmum (locally to each method) s reached, they begn to go down rapdly. The fndngs mght nformally establsh the valdty of our assumpton that a longer query topc mght encompass more -and- rn does not look so obvous n the reducton process; however, f we take the dervatve of the curve at each teraton (.e., the performance gan/loss rato), we mght fnd t resembles the pattern we have dscovered. We may also fnd that, n the generaton process, dfferent rankng schemes come wth varyng - constantly provdes the largest performance boost, as opposed to the other two schemes. In the reducton process, - offers the most drastcally performance drop than the other two schemes do. Generally, n the generaton process, the best MAP value for each settng mght take place somewhere between teraton n/2 to 2n/3, gven n s the sze of the query topc Indr Indr max- order mn- order 0.5 random 0. max- oder mn- order random TFIDF TFIDF max- order mn- order random max- order mn- order random

78 Okap Okap max- order mn- order random 0. max- order mn- order random Fg. 4. MAP curves based on regresson model for descrpton queres of NTCIR-4 on Indr, TFIDF, and Okap models, each wth three selecton order. X coordnate s # of query terms; Y coordnate s MAP. 4. Dscussons and Conclusons In ths paper, we propose an approach to measure and predct the mpact of query terms, based on the dscovery of lngustc, co-occurrence, and contextual features, whch are analyzed by ther correlaton wth MAP. Expermental results show that our query formulaton approach sgnfcantly mproves retreval performance. The proposed method s robust and the expermental results are consstent on dfferent retreval models and document collectons. In addton, an mportant aspect of ths paper s that we are able to capture certan characterstcs of query terms that are hghly effectve for IR. Asde from ntutve deas that nformatve terms are often lengthy and tagged nouns as ther POS category, we have found that the statstcal features are more lkely to decde the effectveness of query terms than lngustcs ones do. We also observe that context features are mostly correlated to MAP and thus are most powerful for term dffculty predcton. However, such post-retreval features requre much hgher cost than the pre-retreval features, n terms of tme and space. The proposed approach actually selects local optmal query term durng each teraton of generaton or reducton. The reason for ths greedy algorthm s that t s napproprate to exhaustvely enumerate all sub-queres for onlne applcatons such as search engnes. Further, t s challengng to automatcally determne the value of parameter k n our algorthms, whch s selected to optmze the MAP of each query topc. Also, when applyng our approach to web applcatons, we need web corpus to calculate the statstcal features for tranng models. 5. References [] Allan, J., Callan, J., Croft, W. B., Ballesteros, L., Broglo, J., Xu, J., Shu, H.: INQUERY at TREC-5. In: Ffth Text REtreval Conference (TREC-5), pp (997) [2] Amat, G., Carpneto, C., Romano, G.: Query Dffculty, Robustness, and Selectve Applcaton of Query Expanson. In: 26th European Conference on IR Research, UK (2004) [3] Bendersky M., Croft, W. B.: Dscoverng key concepts n verbose queres. In: 3st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, 82

79 pp (2008) [4] Cao, G., Ne, J. Y., Gao, J. F., & Robertson, S.: Selectng good expanson terms for pseudo-relevance feedback. In: 3st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pp (2008) [5] Carmel, D., Yom-Tov, E., Soboroff, I.: SIGIR WORKSHOP REPORT: Predctng Query Dffculty - Methods and Applcatons. WORKSHOP SESSION: SIGIR, pp (2005) [6] Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D.: What makes a query dffcult? In: 29th annual nternatonal ACM SIGIR, pp (2006) [7] Carmel, D., Farch, E., Petruschka, Y., Soffer, A.: Automatc Query Refnement usng Lexcal Affntes wth Maxmal Informaton Gan. In: 25th annual nternatonal ACM SIGIR, pp (2002) [8] Chang, C. C., Ln, C. J.: LIBSVM: (200) [9] He, B., Ouns, I.: Inferrng query performance usng pre-retreval predctors. In: th Internatonal Conference of Strng Processng and Informaton Retreval, pp (2004) [0] Jones, R., Fan, D. C.: Query Word Deleton Predcton. In: 26th annual nternatonal ACM SIGIR, pp (2003) [] Kumaran, G., Allan, J.: Effectve and effcent user nteracton for long queres. In: 3st annual nternatonal ACM SIGIR, pp. --8 (2008) [2] Kumaran, G., Allan, J.: Adaptng nformaton retreval systems to user queres. In: Informaton Processng and Management, pp (2008) [3] Kwok, K., L.: A New Method of Weghtng Query Terms for Ad-hoc Retreval. In: 9th annual nternatonal ACM SIGIR, pp (996) [4] Loma, C., Ouns, I.: Examnng the Content Load of Part of Speech Blocks for Informaton Retreval. In: COLING/ACL 2006 Man Conference Poster Sessons (2006) [5] Mandl,T., Womser-Hacker, C.: Lngustc and Statstcal Analyss of the CLEF Topcs. In: Thrd Workshop of the Cross-Language Evaluaton Forum CLEF (2002) [6] Mothe, J., Tanguy, L: ACM SIGIR 2005 Workshop on Predctng Query Dffculty - Methods and Applcatons (2005) [7] Vapnk, V. N.: Statstcal Learnng Theory. John Wley & Sons (998) [8] Yom-Tov, E., Fne, S., Carmel, D., Darlow, A., Amtay, E.: Juru at TREC 2004: Experments wth Predcton of Query Dffculty. In: 3th Text Retreval Conference (2004) [9] Zhou, Y., and Croft, W. B.: Query Performance Predcton n Web Search Envronments. In: 30th Annual Internatonal ACM SIGIR Conference, pp (2007) [20] Zhou, Y., Croft, W. B.: Rankng Robustness: A Novel Framework to Predct Query Performance. In: 5th ACM nternatonal conference on Informaton and knowledge management, pp (2006) [2] The Lemur Toolkt: 83

80 84

81 (European Patent Offce EPO) (dctonary-based) (thesaurus-based) (knowledge-based) (corpus-based) [7] [6]

82 of course Of course, I wll help you. Of course I wll helpyou(lemmatzaton) ateeat (paragraph algnment) StanfordLexParser-.6 HowNet Ma[5] 2006 (source sentence) (target sentence) (cosne smlarty) [6] 2. (source language) (target language) [5]Champollon tf-df [8] 2007 Utyama Isahara[9] NTCIR (patent retreval) % 50% 80% 80% (Internatonal Patent Classfcaton IPC) 86

83 [9] 42 [5] [3] [] 700 [0]

84 HowNet 4.2 Dr.eye [2] Dr.eye [4] 2007 ============================= : chlorde shft ============================= : chlorde slver ============================= : chlorde stre ============================= : chlorde stre corroson crack HowNet[3][2] [6] [4][20]

85 4.3 StanfordLexParser StanfordLexParser-.6 (parse tree) StanfordLexParser-.6 StanfordLexParser-.6 plays hsfrendsplay he frend StanfordLexParser-.6 Java StanfordLexParser-.6 [5] Champollon Jm always plays baseball wth hs frends. (S(NP(NNP Jm))(ADVP(RB always))(vp(vbz plays)(np(nn baseball))(pp(in wth)(np(prp$ hs)(nns frends))))(..)) Jm always play baseball wth he frend. StanfordLexParser-.6 89

86 [] chp connecton chp connecton chp connecton (longest common subsequence) 90

87 4.4.2 [5] tf-df (term frequency - nverse document frequency) [8] tf () df df (2)(2) N w nd(w) w tf-df (3) = () () = log (2) () () = () () (3) [5] tf stf (segment-wde term frequency) (4) (4)e c e c e c stf 3[5] df dtf (nverse document term frequency) dtf (5)(5)T really reallygood 3e () e (6) stf dtf stf-dtf [5](6) e stf-dtf, = (4) = () (5) (, ) =, (6) = log(, ) (7) k k stf-dtf 0 k (8) = (7) cos = 2 2 (8) 9

88 (8) 0 0 () = () (9) () = log (0) () () = () () () [5] stf-dtf tf-df [5] tf ctf(9)(9) w SO(w) w L df dcf(0)(0)sn w n(w) w ctf-dcf ctf-dcf () tf-df tf-df tf-df [5] stf-dtf dtf stf-dtf stf dtf ctf-dcf ctf-dcf stop words stop words stop words stop words ctf-dcf ctf-dcf ctf-dcf 0 E = { [Increasng][LED][drectonalty][makes][LEDs][more][attractve][for certan] [applcatons][such as][ projectors] } C = { [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] } ctf-dcf = { 0, 0, 0, 0, 0.3, 0.3, 0.5, 0.5, 0.5, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8 } = { 0.0, 0.06, 0.09, 0.0, 0.0, 0.0, 0., 0., 0., 0.3, 0.3, 0.3, 0.3, 0.3, 0.3 }

89 0 ctf-dcf E C V E V C ctf-dcf 0 ctf-dcf ctf-dcf , + (,, ), + (,, ), + (,,, ), 2 +,,,, = 2, + (,,, ) (2), 3 + (,, 2, ) 3, + ( 2,,, ), 4 + (,, 3, ) 4, + ( 3,,, ) 3 [5] (2) (2)(, ) j, a b (,,, ),, (, ) (, ) j [9] (3) (3) = (3) 93

90 % %

91 2003 M 2003 S TIMSS M 2003 S 2003 MS 2003 MS (precson) (recall) A B C D Chang + [9] E Chang + F Chang + G Chang + + [5] Champollon [8] (Trends n Internatonal Mathematcs and Scence Study TIMSS) TIMSS2003 TIMSS2003 [8] StanfordLexParser [8]

92 % 20.6% 3% 3.3% 4.3%.8% 0.5% % 58.4% 70% 76.% 8.6% 82.5% 69.4% %.% 23.3% 9.6%.3% 4.3% 4.6% % 5.2% 2% 0% 2.%.4% 2.8% % 0.7% 0.7% 0% 0% 0% 0.5% % 4.0% %.0% 0.7% 0% 2.2% [8] Chang 7 A G Google % 2.2% Champollon

93 Champollon Champollon 3 3 Champollon Champollon 3 3 Champollon Champollon BLEU NIST TIMSS2003 Google BLEU NIST M 2003 S 2003 M NIST BLEU NIST BLEU NIST BLEU A B C D E F G Google S 2003 MS 2003 MS NIST BLEU NIST BLEU NIST BLEU A B C D E F G Google Google Google

94 BLEU NIST Google G TIMSS TIMSS2003 [5]stf-dtf NSC E MY2 [] [2] [3] [4] 數領 [5] [6] [7] [8] TIMSS 理論 [9] [0]

95 [] [2] [3] HowNet vsted on 8 August 2009 [4] Y. Lu, Q. Tan and K. X. Shen, Modern Chnese Word Segmentaton Specfcaton and Automatc Segmentaton Methods for Informaton Processng (n Chnese), Bejng: Qnghua Unversty and Nannng: Guangx Scence and Technology Press, 994. [5] X. Ma, Champollon: A Robust Parallel Text Sentence Algner, Proceedngs of the Ffth Internatonal Conference of the Language Resources and Evaluaton, , [6] C. D. Mannng and H. Schütze, Foundatons of Statstcal Natural Language Processng, The MIT Press, 999. [7] D. W. Oard, Alternatve Approaches for Cross-Language Text Retreval, Workng Notes of the Amercan Assocaton for Artfcal Intellgence Sprng Symposums on Cross-Language Text and Speech Retreval, 339, 997. [8] G. Salton and M. J. McGll, Introducton to Modern Informaton Retreval, McGraw-Hll, 986. [9] M. Utyama and H. Isahara, A Japanese-Englsh Patent Parallel Corpus, Proceedngs of the Eleventh Machne Translaton Summt, , [20] P. K. Wong and C. Chan, Chnese Word Segmentaton based on Maxmum Matchng and Word Bndng Force, Proceedngs of the Sxteenth Internatonal Conference of the Computatonal Lngustcs, ,

96 00

97 A Study on Identfcaton of Opnon Holders {cylee, NTCIR7 MOAT F Abstract The dentfcaton of opnon holders ams to extract enttes that express opnons n opnon sentences. In ths paper, the task of opnon holder dentfcaton s dvded nto two subtasks: the dentfcaton of author s opnons and the labelng of opnon holders. Support vector machne s adopted to dentfy author s opnons, and condtonal random feld model (CRF) s utlzed to label opnon holders. The proposed method acheves an F-score n NTCIR7 MOAT task at tradtonal Chnese sde. The proposed method acheves the best performance among partcpants who adopted machne learnng methods, and also ths performance was close to the best performance n ths task. In addton, the ambguous markngs of opnon holders are analyzed, and the best way to utlze the tranng nstances wth ambguous markngs s proposed. Keywords: opnon holders dentfcaton, opnon mnng, CRF, SVM. Web2.0 (opnon mnng) Km Hovy[] 2004 (opnon polarty) (opnon strength) (opnon holder) (opnon target) 0

98 2 2 2 (Anaphor) (Antecedent) 3 3 (nested structure) 3 3 // / / Pang Lee[2] (heurstc rule) () Yohe [3] Xu 02

99 Wong[4] (prefx) (suffx) Xu Wong NTCIR7 F () (maxmum entropy) (support vector machne algorthm) (condtonal random feld model) Km Hovy[5] (opnon words) (semantc role labelng) Km [6] [7] (syntactc features) Km NTCIR7 F Wu [8] L2-norm Breck Cho [9] [0] (dctonary-based) (dependency relaton) Meng Wang[] (operator) Lu Zhao[2] (poston) (contextual) () [3] 03

100 () (bnary classfcaton problem) Chang Ln[4] LIBSVM fhasi fhaswe fnumi fnumwe fhaspronoun fhasmanpronoun fnumpronoun fnummanpronoun fhasper fhasloc fhasorg fhasna fhasnb fhasnc fnumloc fnumorg fnumper fnumna fnumnb fnumnc fhasexclamaton! fhasqueston? fhascolon : fhasleftquotaton fhasrghtquotaton fnumchar fnumword fnumsubsen foperator 04

101 fword fpos fispronoun fisnoun fisper fisloc fisorg fafterparen fbeforecolon : fnearsenstart fsenlen fwordorder fwordperc fnearverb fnearverbpos fdstnearverb fhasopkw fhasposkw fhasnegkw fhasneukw fnearopkw fnearposkw fnearnegkw fnearneukw fnearopkwpos fnearposkwpos fnearnegkwpos fnearneukwpos fdstopkw fdstposkw fdstnegkw fdstneukw 05

102 (operator) () (Decson Tree Algorthm) Merswa [5] RapdMner CHAID CHAID (CHI Square Test) (Pruned Decson Tree) (sequental labelng problem) Lafferty [6] Kudo[7] CRF++ NTCIR7 Blum Mtchell[8] (co-tranng) (sem-supervsed learnng method) CRF () (H) (T) (I) (S) (O) CRF HITSO CHAID YES NO CHAID NO 06

103 (Nc) (Na) (Nb) 2 (N) (PAUSECATEGORY) (N) (Caa) (N) CRF H I T CRF O 4 4 (6) (6) () (3) () A B A B A B 07

104 ()NTCIR 7 NTCIR 7 NTCIR (NII Test Collecton for IR Systems) (Natonal Insttute of InformatcsNII) (Multlngual Opnon Analyss Task, MOAT) Sek [9] NTCIR7 3, NTCIR7 4 4,665 2,74 NTCIR7 NTCIR6 (Opnon Analyss Plot Task) NTCIR6 NTCIR 7 NTCIR6 29 9,240 5,453 () NTCIR 7 Ku Chen[20] (NTUSD) () NTCIR7 NTCIR6 NTCIR7 08

105 NTCIR7 (Precson) (Recall) F (F-score) (Accuracy) 5 5 NTCIR7 5% (9%) NTCIR 7 NTCIR 6 66% 5% 9% 35%?% 65% NTCIR6 NTCIR6 NTCIR6 F 73.24% F 8.04% NTCIR 6 NTCIR 7 F 79.98% NTCIR 6 NTCIR 7 F 69.68% 93.85% 79.98% 83.49% 64.87% 95.94% 77.40% 80.3% 50.52% 9.53% 65.0% 77.28% 09

106 F 79.98% F 77.40% F 65.0% () NTCIR7 NTCIR7 set F (set F-score) NTCIR7 F F set F set F setf = () CHAID CRF CHAID YES NO CRF (H) (I) (O) set F CHAID % CRF % CRF+CHAID % CHAID % CRF % CRF+CHAID % CRF set F 69.89% CHAID 2.73% CHAID CRF CRF H CRF CHAID 0

107 set F 67.83% 3% (254 ) % 2. 8.% % % % % HIO set F 70.57% 6 set F 72.03%.46% () A B B set F 73.40% A 2.48% B

108 set F A % B % A % B % () NTCIR 7 NTCIR7 NTCIR7 NTCIR7 NTCIR 7 NTCIR7 F 73.40% F 82.30% 8.90% 5.09% 5.6% NTCIR 7 F F % 82.30% 82.30% 9.88% 49.52% 28.38% % 57.84% 57.84% 3.03% 40.53% 9.72% % 54.9% 54.9% 8.22% 52.95% 4.23% % 48.6% 48.6% 8.4% 44.90% 3.78% % 73.40% 73.40% 2.38% 68.3% 20.97% % 82.54% 82.54% 29.92% 43.05% 35.3% % 58.72% 58.72% 20.5% 36.84% 26.35% % 56.47% 56.47% 6.78% 40.02% 23.65% % 50.3% 50.3% 4.43% 53.73% 22.75% % 70.40% 70.40% 9.80% 63.% 30.5% 2

109 NTCIR7 F [] S. M. Km and E. Hovy. Determnng the sentment of opnons. Proceedngs of the COLING conference, pp , 2004 [2] B. Pang and L. Lee. Opnon mnng and sentment analyss Foundatons and Trends n Informaton Retreval, Vol. 2, pp. -35, 2008S. M. Km and E. Hovy. Determnng the sentment of opnons. Proceedngs of the COLING conference, pp , 2004 [3] Y. Sek, N. Kando and M. Aono. Multlngual opnon holder dentfcaton usng author and authorty vewponts. Journal of Informaton Processng and Management, pp , 2009 [co-tranng] [4] R. Xu and K. F. Wong. Coarse-Fne opnon mnng WIA n NTCIR-7 MOAT task. Proceedngs of the Seventh NTCIR Workshop, pp , 2008 [5] S. M. Km and E. Hovy. Extractng opnons, opnon holders, and topcs expressed n onlne news meda text. Proceedngs of the Workshop on Sentment and Subjectvty n Text at the jont COLING-ACL conference, pp. -8, 2006 [6] Y. Km, Y. Jung and S. H. Myaeng. Identfyng opnon holders n opnon text from onlne newspapers. Internatonal Conference on Granular Computng, pp ,

110 [7] Y. Km, S. Km and S. H. Myaeng. Extractng topc-related opnons and ther targets n NTCIR-7. Proceedngs of the Seventh NTCIR Workshop, pp , 2008 [8] Y. C. Wu, L. W. Yang, J. Y. Shen, L. Y. Chen and S. T. Wu. Tornado n multlngual opnon analyss: a transductve learnng approach for Chnese sentmental polarty recognton. Proceedngs of the Seventh NTCIR Workshop, pp , 2008 [9] E. Breck and Y. Cho and C. Carde. Identfyng expressons of opnon n context. Proceedngs of the 20th Internatonal Jont Conference on Artfcal Intellgence, pp , 2007 [0] Y. Cho, C. Carde, E. Rloff and S. Patwardhan. Identfyng sources of opnons wth condtonal random felds and extracton patterns. Proceedngs of EMNLP conference, pp , 2005 [] X. Meng and H. Wang. Detectng opnonated sentences by extractng context nformaton. Proceedngs of the Seventh NTCIR Workshop, pp , 2008 [2] K. Lu and J. Zhao. NLPR at Multlngual Opnon Analyss Task n NTCIR7. Proceedngs of the Seventh NTCIR Workshop, pp , 2008 [3],, 2008 [4] C. C. Chang and C. J. Ln. LIBSVM: a lbrary for support vector machnes, 200 [5] I. Merswa, M. Wurst, R. Klnkenberg, M. Scholz and T. Euler. YALE: rapd prototypng for complex data mnng tasks In Proceedngs of the 2th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, pp , 2006 [6] J. Lafferty, A. McCallum, F. Perera. Condtonal random felds: probablstc models for segmentng and labelng sequence data In Proceedngs of Internatonal Conference on Machne Learnng, pp , 200 [7] T. Kudo, CRF++: yet another CRF toolkt [8] A. Blum and T. Mtchell. Combnng labeled and unlabeled data wth co-tranng. Conference on Computatonal Learnng Theory, pp , 998 [9] Y. Sek, D. K. Evans, L. W. Ku, L. Sun, H. H. Chen and N. Kando. Overvew of multlngual opnon analyss task at NTCIR-7. Proceedngs of the Seventh NTCIR Workshop, pp , 2008 [20] L. W. Ku and H. H. Chen. Mnng opnons from the web: beyond relevance retreval. Journal of Amercan Socety for Informaton Scence and Technology, pp ,

111 Tonal effects on voce onset tme: Stops n Mandarn and Hakka Ju-Feng Peng L-me Chen Y-Yun Ln Department of Foregn Languages & Lterature Natonal Cheng Kung Unversty leemay@mal.ncku.edu.tw Abstract Ths study examnes the nfluence of lexcal tone upon voce onset tme (VOT) n Mandarn and Hakka. Examnaton of VOT values for Mandarn and Hakka word-ntal stops /p, t, k, p h, t h, k h / followed by three vowels /, u, a/ n dfferent lexcal tones revealed that lexcal tone has a sgnfcant nfluence on the VOTs. The result s mportant because t suggests that future studes should take ts nfluence nto account when studyng VOT values for stops n tonal languages. In Mandarn, stops VOTs, orderng from the longest to the shortest, are n Tone 2, Tone 3, Tone, and Tone 4: ths sequence s the same as Lu, Ng, Wan, Wang, and Zhang s (2008) [] results. However, later t was found that the sequence results from the exstence of non-words. Because n order to produce non-words correctly, partcpants tended to pronounce them at a lower speed, especally those n Tone 2. Therefore, we further examned the data wthout non-words, n whch no clear sequence had been found. For Hakka, Post hoc tests (Scheffe) show that asprated stops n Tones 4 and 8 have sgnfcantly shorter VOT values than they have n other tones. Keywords: Voce onset tme, Mandarn tones, Hakka stops, Mandarn stops. Introducton The am of ths paper s to explore whether lexcal tones nfluence the VOT values for word-ntal stops. Ths ssue s mportant because VOT s consdered as a relable phonetc feature to dfferentate consonant stops ([2], [3], [4], [5], [6], [7]) and recently t has been used to study the language producton of patents wth language defcts or dsorders ([8], [9]). Among the languages beng nvestgated, some are tone languages,.e. Mandarn, Cantonese, and Tawanese. In a tonal language, the duraton of each lexcal tone s slghtly dfferent. Consequently, t s possble that lexcal tone wll affect stop s voce onset tme. However, few studes have taken ths factor nto consderaton whle studyng tone languages. It s hoped that wth data from Mandarn and Hakka, we can establsh the groundwork for future studes related to VOTs n tonal languages. If lexcal tone does have an nfluence on the VOT, t should be taken nto account when creatng stmulus words n future studes for tonal languages, thereby renderng studes more vald and relable.. Voce onset tme Lsker and Abramson (964) [2] have defned voce onset tme (VOT) as the 5

112 temporal nterval from the release of an ntal stop to the onset of glottal pulsng for a followng vowel. It has been consdered as a relable phonetc cue to categorzng the stop consonants,.e. voced vs. voceless or unasprated vs. asprated, n varous languages ([2], [3], [4], [5], [6], [7], [0]). Addtonally, by comparng VOT values for stops produced by natve and non-natve speakers for specfc languages, researchers have provded some suggestons for language learnng and teachng ([6], [], [2]). Moreover, recently researchers have studed aphasa, apraxa and stutterng patents producton defcts by observng ther VOT values for stops ([8], [9])..2 Factors affectng voce onset tme When nvestgatng stops, researchers found that the VOT values for stops vared n relaton to the place of artculaton. Cho and Ladefoged (999) [4], sorted out researchers fndngs, have clamed that the further back the closure, the longer the VOT ([2], [4], [6], [3]). That s velar stops have the longest VOT values, alveolar stops the ntermedate values, and blabal stops have the shortest values. However, there are some exceptons. Alveolar stops n Taml, Cantonese, Eastern Armenan, Hungaran, Japanese, and Mandarn, have shorter VOTs than blabal stops ([2], [3], [5], [7], [2], [4]). Lu et al. (2008) [] speculated that the VOT duratons may be affected by tone, because dfferent tones have dfferent fundamental frequences and ptch levels, whch are determned manly by the tenson of the vbratng structure. In order to acheve dfferent levels of tenson, dfferent amounts of tme mght be needed. Consequently, the VOT values may vary when they are n dfferent lexcal tone. Only a few studes have tred to examne whether lexcal tone nfluences VOT values. For example, Lu et al. (2008) [] studed the effect of tonal changes on VOTs between normal laryngeal and superor esophageal speakers of Mandarn Chnese, and reported that for normal laryngeal speakers there are sgnfcant dfferences of VOT values caused by lexcal tones. In addton, stops n Tone 4 have sgnfcantly shorter mean VOT values than stops n Tones 2 and 3. The study by Lu et al. [] s a poneerng pece of work n ths feld, but more evdence s stll needed. Therefore, by carryng out a systematc study wth respect to the nfluence of lexcal tone for stop s VOT usng two tonal languages,.e. Mandarn and Hakka, we try to verfy prevous fndngs n order to provde references for future lngustc studes on tonal languages..3 The features of Mandarn and Hakka Mandarn Chnese and Hakka are tonal languages, n whch a word s meanng can be changed by the tone n whch t s pronounced. Chao (967) [5] suggested a numercal notaton for lexcal tones: dvdng a speaker s ptch range nto four equal ntervals by fve ponts: low, 2 half-low, 3 mddle, 4 half-hgh, and 5 hgh. The numercal notaton ndcates how the ptches of a lexcal tone change. For example, the numercal notaton for Tone 2 n Mandarn s 35, whch represents that the ptch wll go from mddle to hgh. Table reveals the numercal notaton for each lexcal tone n Mandarn and Hakka. In Mandarn, there are four contrastng lexcal tones, Tone (hgh level), Tone 2 (md-rsng), Tone 3 (fallng-rsng), and Tone 4 (hgh-fallng). Sxan Hakka has sx contrasted lexcal tones, Tone (24), Tone 2 (3), Tone3 (55), Tone 4 (32), Tone 5 (), and Tone 8 (55). The ptch values for Tone 3 and Tone 7 are the same, therefore Tone 7 has been omtted. Although there are regonal dfferences for Hakka, Sxan Hakka was chosen as t s the most wdely used Hakka dalect n Tawan. 6

113 Table. The numercal notatons for lexcal tones n Mandarn [5] and Hakka [6]. Lexcal Tone Numercal Mandarn notaton Hakka (55) 55 Note: Those whch are underlned represent ptches that are short and rapd. Mandarn Chnese and Hakka have ther specfc tone sandh rules and one example from each language s lsted below. In Mandarn, Tone 3, whch has the longest duraton among the four lexcal tones, wll become Tone 2 whle t s followed by another Tone 3 [7]. The tone sandh rule for Sxan Hakka s as follows: Tone wll become Tone 5, when t precedes Tone, Tone 3, or Tone 8 [8]. Therefore, tone sandh rules are taken nto consderaton when makng stmulus words, and the combnatons that mght cause tonal change wll be avoded. Mandarn Tone 3 Tone 2 / Sxan Hakka Tone Tone 5 / Tone 3 Tone, Tone 3, Tone 8 2. Methodology Mandarn and Hakka word-ntal stops, unasprated /p, t, k/ and asprated /p h, t h, k h /, n combnaton wth three vowels /, u, a/ were studed. Except for partcpants and stmulus words, the methodology employed for both languages was the same. 2. Partcpants Mandarn and Hakka partcpants were dfferent. For Mandarn, there were ffteen male and ffteen female natve speakers recruted from college students and staff from an elementary school n Tanan Cty. All the partcpants grew up n Tawan, wth no hearng and speech defects. Ther ages ranged from 23 to 33 years (mean = 27.2 years). As for the Hakka, there were twenty-one partcpants, eleven men and ten women, from Maol, Pngtung, and Taoyuan County. Ther average age was ffty-one, the oldest beng eghty, and the youngest thrty-sx. As t was not easy to fnd fluent Hakka speakers ther age range was qute wde. 2.2 Stmul and procedure The speech stmul n both language were combnaton of sx stops /p, t, k, p h, t h, k h / and three vowels /, u, a/,.e., 8 combnatons. They were /p/, /pu/, /pa/, /t/, /tu/, /ta/, /k/, /ku/, /ka/, /p h /, /p h u/, /p h a/, /t h /, /t h u/, /t h a/, /k h /, /k h u/, and /k h a/. For Mandarn there were four contrastng lexcal tones, thus 72 monosyllabc words were created n total. Among them, 8 combnatons do not exst n Mandarn. As for Hakka, there were sx contrastng lexcal tones n Sxan Hakka, hence there were 08 monosyllabc words obtaned. Among these stmulus words, 2 words do not actually exst n Hakka. Chen et al. (2007) [4] has clamed that dsyllabc words can create a more natural-lke context for partcpants. Therefore, n order to make speakers produce the words more naturally, all the stmulus words were followed by another word and would become meanngful dsyllables. For example, Mandarn word, /p/, was followed by another word, /p h uo/ to become the exstng dsyllable, /p p h uo/ 7

114 (force). Some stmulus words n Hakka were tr-syllabc, due to the fact that no meanngful dsyllables were found. The stmulus words were arranged randomly, and the partcpants were asked to read t out loud at a normal speed. After fnshng, the partcpants were asked to read out the words for the second tme. Therefore, two groups of data were gathered for each partcpant. All the speech was recorded by a 24 bt WAV recorder, connected wth a AKG head-worn cardod condenser vocal mcrophone postoned of approxmately 05 cm from the partcpant s mouth n a quet room. 2.3 Data Measurement and analyss After recordng, data were edted nto ndvdual fles and analyzed usng the Praat software. VOT, measured n mllseconds (ms), was obtaned by measurng the temporal nterval between the begnnng of the release burst and the onset of the followng vowel as shown n Fgure. The values of both the waveform and spectrogram were recorded, but the VOTs were determned prmarly through waveform analyss; the values n the spectrogram were provded as references. If the values n waveform dffered from the values n the spectrogram by more than fve mllseconds, the data were re-measured to verfy accuracy. F 4 F 3 F 2 F Fgure. The spectrogram and waveform for the Mandarn word /pu au/ don t want. The values n the crcle are the startng and endponts of the VOT n the spectrogram. When analyzng the data, the VOT values for the mspronounced words were excluded, and the data for Hakka /p/ n Tone 8 were not analyzed because of wrong word-choosng. ANOVA test was used to examne whether or not there s a sgnfcant nfluence on stop s voce onset tme. In addton, the dfferences between the 8

115 examned targets were analyzed by Post Hoc tests (Scheffe). The measurements of stops VOT values were made by the same nvestgator. Furthermore, randomly selected 0% of each recordng were re-measured by another nvestgator to verfy the relablty of the results. Therefore, 7 Mandarn words and Hakka words for each recordng were re-measured. The nter-rater relablty was then examned by Pearson s product-moment correlatons. 3. Results Pearson s product-moment correlatons ndcated hgh nter-rater agreement for both the Mandarn and Hakka data (Mandarn: r =.995, p<.00; Hakka: r =.978, p<.00). Ths ndcates that the measurements were relable throughout. It was found that the mean VOTs for Mandarn stops get longer due to the exstence of non-words. Therefore, the data excludng non-words was further examned to verfy the results. For Hakka, there s no clear dfference because most of the non-words were pronounced ncorrectly. Therefore, most of the values of Hakka non-words are not ncluded n the analyss. 3. Lexcal tone and VOT n Mandarn Mandarn stops mean VOT values and standard devatons n each lexcal tone are shown n Table 2. ANOVA test reveals that lexcal tones have sgnfcant nfluences on the VOTs for stops (F(3,040)=2.68, p<.05 for unasprated stops; F(3,040)=8.934, p <.00 for asprated stops). When examnng the data wth non-words, t s shown that for both unasprated and asprated, stops n Tone 2 have the longest mean VOTs and stops n Tone 4 have the shortest mean VOTs. Stops VOT values orderng from the longest to the shortest are n Tone 2, Tone 3, Tone, and Tone4. Post hoc tests revealed that asprated stops n Tone 4 have sgnfcantly shorter mean VOTs than stops n Tone 2 and Tone 3 (p<.05). Table 2. Mandarn stops mean VOT values n ndvdual lexcal tones. All measurements are n mllseconds (ms). Wth non-words Wthout non-words unasprated stops mean SD asprated stops mean SD unasprated stops mean SD asprated stops mean Tone (.90) (25.53) 7.7 (9.95) (20.4) Tone (2.68) 0.02 (30.2) 3.99 (6.03) (23.3) Tone (3.35) (27.75) 7.00 (0.98) (23.49) Tone (9.94) 89.4 (25.72) 6.32 (9.07) (24.8) The results were verfed by examnng the data wthout non-words. In Fgures 2 and 3, t s noted that the values for the data wthout non-words are shorter than the values for the data wth non-words. It addtonally shows that unasprated stops n Tone have the longest mean VOT values and unasprated stops n Tone 2 have the shortest. Asprated stops n Tone 3 have the longest mean VOTs, whle those n Tone 4 have the shortest. ANOVA tests stll ndcate that lexcal tone has sgnfcant nfluences on the VOT values for stops (F(3,692)=4.800, p<.0 for unasprated stops; F(3,779)=2.953, p<.05 for asprated stops. Furthermore, Post Hoc tests show that unasprated stops n Tone 2 have sgnfcantly shorter mean VOTs than stops n Tone SD 9

116 and Tone 3 (p<.05), and asprated stops n Tone 4 have sgnfcantly shorter mean VOTs than stops n Tone 3 (p<.05) VOT (ms) w th non-w ords w thout non-w ords 0 Tone Tone 2 Tone 3 Tone 4 Fgure 2. The mean VOTs for Mandarn unasprated stops n ndvdual lexcal tones VOT (ms) w th non-w ords w thout non-w ords 60 Tone Tone 2 Tone 3 Tone 4 Fgure 3. The mean VOTs for Mandarn asprated stops n ndvdual lexcal tones. 3.2 Lexcal tone and VOT n Hakka The mean VOT values and standard devatons for Hakka stops n each lexcal tone are shown n Table 3. ANOVA tests show that lexcal tones have a sgnfcant nfluence on stop s VOTs (F(5,943)=3.52, p<.0 for unasprated stops; F(5,900)=37.365, p<.00 for asprated stops). In Fgures 4 and 5, t s shown that unasprated and asprated stops n Tone and Tone 5 have longer mean VOTs than stops n other tones. And the shortest mean VOTs for both unasprated and asprated stops are n Tone 8. Post hoc tests revealed that asprated stops n Tone 4 and Tone 8 have sgnfcantly shorter mean VOTs than n Tone, Tone 2, Tone 3, and Tone 5 (p<.00). 20

117 Table 3. Hakka stops mean VOT values n ndvdual lexcal tones. All measurements are n mllseconds (ms). Unasprated stops mean (SD) Asprated stops mean (SD) Tone 20 (.56) (25.8) Tone (8) (26.56) Tone (.02) 8.32 (23.73) Tone (9.44) (8.36) Tone (.43) (27.08) Tone 8 6. (7.98) 6.53 (20.36) The Mean VOTs for Hakka Unasprated Stops n Indvdual Tones VOT (ms) Tone Tone 2 Tone 3 Tone 4 Tone 5 Tone 8 Lexcal Tone Fgure 4. The mean VOTs for Hakka unasprated stops n ndvdual lexcal tones. The Mean VOTs for Asprated Stops n Indvdual Tones VOT (ms) Tone Tone 2 Tone 3 Tone 4 Tone 5 Tone 8 Lexcal Tone Fgure 5. The mean VOTs for Hakka asprated stops n ndvdual lexcal tones. 2

118 4. Dscusson and Concluson In the current study, ANOVA tests reveal that lexcal tone has a sgnfcant nfluence on the VOT values for Mandarn and Hakka stops. In Mandarn, the VOTs for both unasprated and asprated stops n Tone 2 have the longest mean VOT values, and n Tone 4 have the shortest mean VOT values. Stops VOTs, orderng from the longest to the shortest, are n Tone 2, Tone 3, Tone, and Tone 4. Ths sequence s the same as Lu et al. s (2008) [] results. But t s worth notng that n both studes, some of the stmulus words are non-words. Later, t was found that the sequence results from the exstence of non-words because n order to produce non-words correctly, partcpants tended to pronounce them at a lower speed, especally those n Tone 2. Therefore, we examned the data wthout non-words, n whch no clear sequence had been found. In general, ANOVA tests revealed that lexcal tones have sgnfcant nfluences on stops VOTs. Moreover, Post hoc tests show that unasprated stops n Tone 2 have sgnfcantly shorter mean VOTs than n Tones and 3; whle asprated stops n Tone 4 have sgnfcantly shorter mean VOTs than n Tone 3. As for Hakka stops, the exstence of non-words does not have a sgnfcant mpact. Post hoc tests show that asprated stops n Tones 4 and 8 have sgnfcantly shorter VOT values than stops n other tones. Hakka words n Tones 4 and 8 have smlar phonetc characterstcs, whch are short, rapd and ended by a stop. Ths may explan why Hakka stops n Tones 4 and 8 are shorter than stops n other tones. The results n ths study ndcate that lexcal tone has sgnfcant nfluence. Therefore, t s suggested that future studes should take the effects of lexcal tone nto consderaton n creatng the stmulus words of tonal languages when analyzng the VOT values for stops, n order to reduce the rsk of ntroducng expermental errors. However, n what way tone wll affect the VOT values for stops, further studes are needed. References [] H. Lu, M. L. Ng, M. Wan, S. Wang, and Y. Zhang, "The effect of tonal changes on voce onset tme n Mandarn esophageal speech," Journal of Voce, 2008, vol. 22, no. 2, pp [2] L. Lsker and A. S. Abramson, "A cross-language study of vocng n ntal stops: acoustcal measurements," Word, 964, vol. 20, pp [3] B. L. Rochet and Y. Fe, "Effect of consonant and vowel context on Mandarn Chnese VOT: producton and percepton," Canadan Acoustcs, 99, vol. 9, no. 4, pp [4] T. Cho and P. Ladefoged, "Varaton and unversals n VOT: evdence from 8 languages," Journal of Phonetcs, 999, vol. 27, pp [5] M. Gósy, "The VOT of the Hungaran voceless plosves n words and n spontaneous Speech," Internatonal Journal of Speech Technology, 200, vol. 4, pp [6] X.-R. Zheng and Y.-H. L, "A contrastve study of VOT of Englsh and Korean stops," Journal of Yanban Unversty, 2005, vol. 38, no.4, pp [7] T. J. Rney, N. Takag, K. Ota, and Y. Uchda, "The ntermedate degree of VOT n Japanese ntal voceless stops," Journal of Phonetcs, 2007, vol. 35, pp [8] L. Jäncke, "Varablty and duraton of voce onset tme and phonaton n stutterng and nonstutterng adults," Journal of Fluency Dsorders, 994, vol. 9, no., pp

119 [9] P. Auzou, C. Ozsancak, R. J. Morrs, M. Jan, F. Eustache, and D. Hannequn, "Voce onset tme n aphasa, apraxa of speech and dysarthra: a revew," Clncal Lngustcs & Phonetcs, 2000, vol. 4, no. 2, pp [0] P. A. Keatng, W. Lnker, and M. Huffman, "Patterns n allophone dstrbuton for voced and voceless stops," Journal of Phonetcs, 983, vol., pp [] T. J. Rney and N. Takag, "Global foregn accent and voce onset tme among Japanese EFL speakers," Language Learnng, 999, vol, 49, no. 2, pp [2] S. J. Lao, "Interlanguage producton of Englsh stop consonants: A VOT analyss," M. A. thess, Natonal Kaohsung Normal Unversty, Kaohsung, Tawan, [3] B. S. Rosner, L. E. López- Bascuas, J. E. García-Albea, and R. P. Fahey, "Voce-onset tmes for Castlan Spansh ntal stops," Journal of Phonetcs, 2000, vol. 28, pp [4] L.-M. Chen, K.-Y. Chao, and J.-F., Peng, "VOT productons of word-ntal stops n Mandarn and Englsh: a cross-language study," Proceedngs of the 9 th Conference on Computatonal Lngustcs and Speech Processng, 2007, pp [5] Y. R. Chao, Mandarn Prmer, Cambrdge: Harvard Unversty Press [6] S.-S. He, "A contrastve study of Tawan Hakka and Mandarn phoneme," In G.-S. Gu,Ed.. Introducton to Tawan Hakka, pp , Tape: Wu-Nan Book Inc [7] C.-C. Cheng, A Synchronc Phonology of Mandarn Chnese, The Hague: Mouton [8] R.-F. Chung, An ntroducton to Tawan Hakka phonology, Tape: Wu-Nan Book Inc [9] P. Boersma and D. Weennk, Praat: dong phonetcs by computer (Verson 5..2), [Computer program]. Avalable: 23

120 24

121 Latent Prosody Model-Asssted Mandarn Accent Identfcaton Yuan-Fu Lao, Shuan-Chen Yeh 2, Mng-Feng Tsa 3, We-Hsung Tng 4, and Sen-Cha Chang 5,2,3,4 Department of Electronc Engneerng, Natonal Tape Unversty of Technology 5 Advanced Technology Center, Informaton and Communcatons Research Laboratores, Industral Technology Research Insttute,2,3,4 yflao@ntut.edu.tw, 5 chang@tr.org.tw Abstract A two-stage latent prosody model-language model (LPM-LM)-based approach s proposed to dentfy two Mandarn accent types spoken by natve speakers n Manland Chna and Tawan. The frontend LPM tokenzes and jontly models the affectons of speaker, tone and prosody state of an utterance. The backend LM takes the decoded prosody state sequences and bulds n-grams to model the prosodc dfferences of the two accent types. Expermental results on a mxed TRSC and MAT database showed that fuson of the proposed LPM-LM wth a SDC/GMM+PPR-LM+UPR-LM baselne system could further reduced the average accent dentfcaton error rate from 20.7% to 6.2%. Therefore, the proposed LPM-LM method s a promsng approach. Keywords: Accent recognton, latent prosody model, Mandarn, Tawan. Introducton Over the past decades, many approaches have been proposed to deal wth language dentfcaton (LID) tasks. They tred to capture the specfc characterstcs of dfferent languages. These characterstcs roughly fall nto three categores: the phonetc repertore, the phonotactcs, and the prosody. The manstream system (as shown n NIST language recognton evaluaton (LRE) 2007) [] s usually based on the fuson of multple acoustc and phonotactc systems. Although LID s extensvely studed, less works have been done on accent dentfcaton (AID), especally for natve speakers, such as Amercan and Indan Englsh, Manland Chna and Tawan Mandarn, Hnd and Urdu Hndustan and Carbbean and non-carbbean Spansh. Comparng wth LID task, AID of natve speakers s more challengng because, () some lngustc knowledge, such as syllable structure, may be of lttle use snce natve speakers seldom make such mstakes; (2) dfference among those speakers s relatvely smaller than 25

122 that among foregn (non-natve) speakers. In other words, the capactes of the popular acoustc and phonotactc approaches may be lmted n ths case. Many approaches have been proposed to model the prosodc dfferences between languages, dalects or accents [2], recently. Most of them are based on drect modelng of surface prosodc features,.e., the raw prosodc features. For example, frame-level ptch flux features and GMMs were proposed n [3]; segmental-level ptch features were extracted usng Legendre polynomals and modeled by ergodc Markov model n [4]; and supra-segment-level prosodc features were captured by n-gram n [5]. Fgure. The block dagram of the proposed LPM-LM-based Mandarn accent dentfcaton system. Fgure 2. The block dagram of the proposed LPM framework (speaker factor s omtted to smply ths fgure). However, surface prosodc features are often affected by many other non-prosodc latent factors, such as channel, speaker, phonetc context, and so on. Therefore, t s necessary to apply some feature normalzaton methods [6] to allevate the unwanted affectons. To absorb those unwanted affectons, n ths study a two-stage latent prosody model-language model (LPM-LM)-based approach as shown n Fg. and 2 s proposed. The am s to dscrmnate two Mandarn accent types spoken by natve speakers n Manland Chna and Tawan. 26

123 In ths approach, the frontend LPM [7] tokenzes (wth the help of automatc speech recognzers (ASRs)) an nput utterance nto smaller prosodc unts (sub-syllable n our case) and artfcally ntroduces latent prosody states to represent the prosodc status of each token n an utterance. It then jontly models the affectons of speaker, tone and prosody state on surface prosodc features n order to decode more precse prosody state sequences of the utterance. The backend LM then takes the decoded prosody state sequences and bulds an n-gram to model the supra-segmental prosodc charactstcs of each accent type. In more detal, LPM as shown n Fg. 2 () ntroduces a two-level herarchcal structure of speech prosody [8] wth prosodc states and state transton probabltes and (2) descrbes the jont affectons of latent factors n a state by a varable-parameter probablty densty functon whose parameters vares as a functon of those latent factor-dependent parameters. The purpose s to explan the varant due to speaker, phonetc context and, especally, tone factors. It s worth notng that () the proposed LPM-LM framework s smlar to the popular parallel phone recognzer (PPR)-LM approach. However, the phone recognzers are replaced by automatc prosodc state tokenzers/labelers and, especally, (2) the LPM module could be traned n an unsupervsed way to avod any human annotaton efforts. Ths paper s organzed as follows. Secton 2 revews the LPM framework. Secton 3 dscusses the applcaton of LPM-LM on Mandarn AID. Secton 4 reports the expermental results on a Manland Chna and Tawan Mandarn corpus. Some conclusons are gven n the last secton. 2. Latent Prosody Model of Speech Prosody Based on the proposed LPM framework shown n Fg. 2, an nput tranng utterance s frst tokenzed nto a sequence of smaller prosodc unts (sub-syllable n ths case) ncludng voced and unvoced segments. For each token, a segment-level prosodc feature vector x s extracted (coeffcents of log-ptch and log-energy trajectores and the duraton of the segment). Here, the coeffcents of trajectores are computed usng Legendre polynomal functon from the raw log-ptch and log-energy contours. The speech prosody of an nput utterance s thus represented by a sequence of segment-level prosodc feature vectors,.e., X x,,..., N. n n n To well explan the varant of the observed prosodc feature vector sequence X of the utterance, several latent factors are ntroduced ncludng speaker s, tone T t,,..., n n N (or major/mnor stress n toneless language) and prosody state sequence Q q,,..., n n N (phonetc context s gnored n ths study). The probablty of X s defned as follows: px px, s TQ, ps, TQ, () s, QT, 27

124 Assume that each observed x n s dependent only on local prosodc state q n and tone t (and the speaker s ), the frst term n the rght hand sde of Eq. () s approxmated as n follows: N p X, s T, Q p x, s t, q (2) n n n n Assume that speaker, prosodc state and tone sequences are all ndependent varables and the probabltes of speaker s and tone sequence T are unform dstrbutons, the last term n the rght hand sde of Eq. () s approxmated as follows:,, N pstq pq pq q (3) n n n2 Fnally, the dstrbuton of the surface prosodc feature vector followng lnearly addtve [9] formulaton: n n s tn qn x n s modeled by the x y (4) where y n are prosodc feature vectors representng the normalzed prosodc contours of the n -th syllable n an utterance; s, t n and q n are the contrbutons of speaker s, prosody state q n and tone t n, respectvely. The normalzed ptch contour y n s approxmated usng a zero mean Gaussan dstrbuton N( y ; 0, ) (where s dagonal matrx), or equvalently the observed prosodc feature vector xn, n, n xn; s t q, n n n x s modeled by p s t q (5) By ths way, the lkelhood functon of an utterance gven an LPM s expressed by N,, n n n n n n n2 N L X p x st q p q p q q (6) Moreover, the optmal prosody state sequence n ˆQ of an utterance could be automatcally labeled usng a Vterb search algorthm (wth or wthout tone tags gven) whch maxmze the lkelhood functon L X,.e., N N ˆ Qargmaxlog p( x n s, tn, qn) pqpqn qn (7) Q n n2 3. LPM-based Mandarn Accent Identfcaton Mandarn spoken n Tawan exhbts several major prosody dfferences from the Mandarn spoken n Manland Chna [0]. Especally, people from Tawan usually speak slower wth a lower voce, and they sound soft and gentle; whle Manlanders have more ups and downs n ther ntonaton, and ther voces are hgher and faster. These characterstcs are lkely 28

125 attrbutable, at least n part, to nfluence from the Southern Fujanese dalect wdely spoken throughout Tawan. Snce there are prosodc dfferences between Manlander s and Tawanese Mandarn, a LPM-based accent dentfcaton approach s bult to dentfy these two Mandarn accent types. In the followng subsectons, the tokenzaton front-end and the speaker normalzaton parts of the proposed LPM-based approach and ts tranng procedure are descrbed n detal. 3.. Tokenzaton front-end The operaton of the tokenzaton front-end s shown n Fg. 3. It frstly extracts the raw prosodc contours (log-ptch and log-energy) of an nput utterance. The ptch and energy contours are then segmented by an ASR engne. The output s a sequence of voced and unvoced segments. Fgure 3. A typcal segmentaton results of the tokenzaton front-end (from top to bottom panel: spectrum, syllable and sub-syllable segmentatons, log-ptch and log-energy contours). For each voced segment, sx dmensonal prosodc features are extracted ncludng coeffcents of 3-order Legendre polynomal functon for approxmatng the log-ptch contour, the log-energy mean and duraton of the segment. On the other hand, for each unvoced segment, only ts log-energy mean and duraton are utlzed. 29

126 3.2. LPM tranng algorthm To estmate the parameters of the LPM, an unsupervsed sequental optmzaton procedure based on the maxmum lkelhood crteron s adopted. The tranng procedure sequentally decodes latent prosody state sequences usng Eq. (7) and updates the affectng factors (.e., tone and prosody state) to optmze the lkelhood functon n Eq. (6). In more detal, the sequental optmzaton tranng procedure executes the followng steps untl a convergence has been reached. It s worth notng that each step updates a subset of LPM parameters. Step 0: Intalzaton Derve the ntal affectng factors s and t n of tones by averagng all prosodc feature vector x n of a speaker or the whole tranng data, respectvely. Cluster and label the prosody state of each segment by vector quantzaton (VQ) usng the resdue prosodc feature vector x nxnst n and derve the ntal prosody state affectng factors q n. Derve the ntal covarance matrx. Step : Derve the ntal prosody state transton probabltes usng the statstcs of labeled prosody states. Re-Label Re-label the prosody state sequence of all utterance usng Eq. (7). Step 2: Re-Estmate Update the affectng factors s of speakers, Step 3: other parameters fxed. t n of tones or q n of prosody states wth all Update the covarance matrx and the prosody state transton probabltes. Iteraton Repeat step to 2 untl the lkelhood functon Eq. (6) s converged. 30

127 4. Expermental Results 4.. Corpus To evaluate the proposed LPM approach, two telephone speech corpora were mxed together, one s Mandarn across Tawan (MAT) [] released by Assocaton for Computatonal Lngustcs and Chnese Language Processng (ACLCLP), Tawan, and the other s 500-people telephone readng speech corpus (TRSC) [2] released by Chnese Corpus Consortum (CCC), Chna. There are about 4500 (MAT-2000+MAT-2500) Tawanese and 500 Manlander speakers n MAT and TRSC, respectvely. The mxed corpus s randomly dvded nto a tranng, a development and a test set. The detal of speaker and utterance nformaton s lsted n Table.. The evaluaton s executed utterance by utterance and the average length of an utterance s about 5 seconds. Table. Detal nformaton of the MAT ad TRSC corpora ncludng number of speakers and utterances. Tranng Development Test spk utt spk utt spk utt MAT TRSC LPM tranng results For all followng LPM experments, the number of prosody states was emprcally set to (8 for voced, 3 for unvoced states) and there are 5 dfferent tones n Mandarn. Fgure 4. The learnng curves of the LPMs tranng on MAT and TRSC tranng sets (left: MAT, rght: TRSC), respectvely. 3

128 Frst of the all, the learnng curves of the LPMs were examned. Fg. 4 shows the lkelhood functons on the MAT and TRSC tranng sets, respectvely, along wth the number of tranng teratons. It could be found from the fgure that LPMs converged quckly, especally for the TRSC set. Fgure 5. The learned tone affectng patterns on MAT and TRSC corpora (top 5 panels: MAT, bottom 5 panels: TRSC), respectvely. 32

129 After LPM tranng was converged, the learned 5 tone affectng patterns of Tawanese and Manlanders Mandarn, respectvely, were drawn n Fg. 5. It s found that the major tone dfferences between Tawan and Manland Chna s the pattern of tone 3 and 5. Ths s consstent wth common lngustc knowledge [0]. These results suggest that LPMs could automatcally learn the accent-specfc characterstcs of Tawanese and Manlanders Mandarn. We therefore expect that LPM-LM-based approach could be successfully used to dscrmnate these two Mandarn accents Acoustc and Phonotactc baselnes To set up a reference baselne, two popular phonotactc and one acoustc approaches were frst tested ncludng () PPR-LM, (2) unversal phone recognzer (UPR)-LM and (3) shfted delta cepstral (SDC)/Gaussan mxture model (GMM). For PPR-LM and UPR-LM, 39-dmensonal mel-frequency cesptrum coeffcent (MFCC) feature vectors were utlzed to tran the front-end phone recognzers. There are n total 50 phonemes n Mandarn for PPR-LM. But for UPR-LM, the number of phonemes s extended to 63 to reflect the major pronuncaton dfferences (retroflex and nasal-endngs sounds) between Manlander s and Tawanese Mandarn. All MFCCs were pre-processed by cepstral normalzaton (CN) to partally compensate the channel and database msmatch. Besde, tr-gram LM backbends were adopted for both PPR-LM and UPR-LM. Moreover, the parameters of SDC were emprcally set to and the number of mxtures n GMMs was 52. Table 2. Expermental results of the ndvdual acoustc, phonotactc and prosodc approaches and ther fuson on a mxed TRSC and MAT database. Approach Error (%) System Fuson Error (%) (): PPR-LM (5): ()+(2) 2.84 (2): UPR-LM (6): ()+(3) (3): SDC-GMM 29. (7): ()+(2)+(3) (4): LPM-LM 3.34 (8): (7)+(4)

130 Table 2 shows the performances of the ndvdual systems and ther fuson results. The fuson was done usng a softmax-output mult-layer perceptual (MLP) and traned wth the development sets. From Table 2, t s found that () PPRLM and UPRLM worked better than SDC/GMM and (2) the best performance, 20.68% error rate, was acheved by the fuson of the PPR-LM, UPR-LM and SDC/GMM systems Prosodc approach The proposed LPM-LM approach was then evaluated. In tranng phase, the correct tone tags were gven but n testng phase MLP-based tone recognzers are adopted to provde estmated tone tags onlne [7]. Table 2 shows the performances of the proposed LPM-LM and the fuson of LPM-LM wth the acoustc and phonotactc baselne. The fuson was also done usng the same softmax-output MLP and traned wth the development sets. Dfferent from acoustc feature, the prosodc feature extracts another characterstc (example: tone). From Table 2, t s found that LPM-LM worked compatble wth the SDC/GMM but s worse than the acoustc and phonotactc baselne. It was caused by just usng prosodc feature rather than strong acoustc feature. However, the fuson of LPM-LM and the acoustc and phonotactc baselne could further reduce the error rate from 20.68% to 6.8%. Ths result may suggest the complementary of those methods. 5. Conclusons In ths paper, a LPM-LM-based approach s proposed to dentfy two Mandarn accent types spoken by natve speakers n Manland Chna and Tawan. Expermental results on a mxed TRSC and MAT database showed that fuson of the proposed LPM-LM and a SDC/GMM+PPR-LM+UPR-LM baselne system could further reduced the average accent dentfcaton error rate from 20.7% to 6.2%. Therefore, the proposed LPM method s a promsng approach. 6. Acknowledgement Ths work was supported by the Natonal Scence Councl, Tawan, under the project wth contract NSC E MY2 and s a partal result of Project 8353C4220 conducted by ITRI under sponsorshp of the Mnstry of Economc Affars, Tawan, R.O.C. 34

131 References [] Language Recognton Evaluaton, Natonal Insttute of Standards and Technology, [2] Jean-Luc Rouas, "Automatc Prosodc Varatons Modelng for Language and Dalect Dscrmnaton," Audo, Speech, and Language Processng, IEEE Transactons on, vol. 5, pp , Aug [3] Bn Ma, Dongla Zhu, and Rong Tong, "Chnese Dalect Identfcaton Usng Tone Features Based on Ptch Flux," n Acoustcs, Speech and Sgnal Processng, ICASSP 2006 Proceedngs IEEE Internatonal Conference on, Toulouse, France, May 2006, pp. I-I. [4] Ch-Yueh Ln and Hsao-Chuan Wang, "Language Identfcaton Usng Ptch Contour Informaton n the Ergodc Markov Model," n Acoustcs, Speech and Sgnal Processng, ICASSP 2006 Proceedngs IEEE Internatonal Conference on, Toulouse, France, May 2006, pp. I-I. [5] Obuch, Y. and Sato, N, "Language Identfcaton Usng Phonetc and Prosodc HMMs wth Feature Normalzaton," n Acoustcs, Speech, and Sgnal Processng, Proceedngs. (ICASSP '05). IEEE Internatonal Conference on, Phladelpha, Mar. 2005, pp [6] Najm Dehak, Perre Dumouchel, and Patrck Kenny, "Modelng Prosodc Features Wth Jont Factor Analyss for Speaker Verfcaton," Audo, Speech, and Language Processng, IEEE Transactons on, vol. 5, no. 7, pp , Sept [7] Chen-Yu Chang, Xao-Dong Wang, Yuan-Fu Lao, Yh-Ru Wang, Sn-Horng Chen, and Kekch Hrose, "Latent Prosody Model of Contnuous Mandarn Speech," n Acoustcs, Speech and Sgnal Processng, ICASSP IEEE Internatonal Conference on, Hawa, Apr. 2007, pp. IV-625-IV-628. [8] Chu-yu Tseng, Shao-huang Pn, Yehln Lee, Hsn-mn Wang, and Yong-cheng Chen, Fluent speech prosody: Framework and modelng, Speech Commncaton, vol. 46:3-4, pp , Mar [9] Sn-Horng Chen, Wen-Hsng La, and Yh-Ru Wang, A statstcs-based ptch contour model for Mandarn speech, Journal of the Acoustcal Socety of Amerca, 7 (2), pp , Feb [0] Chn-Chn Tseng, Prosodc Propertes of Intonaton n Two Major Varetes of Mandarn Chnese: Manland Chna vs. Tawan, n Internatonal Symposum on Tonal Aspects of Languages: Wth Emphass on Tone Languages, Bejng, Chna, Mar. 2004, pp [] Hsao-Chuan Wang, Frank Sede, Chu-Yu Tseng, Ln-Shan Lee, MAT Desgn, Collecton, and Valdaton of a Mandarn 2000-Speaker Telephone Speech Database, n ICSLP 2000, Bejng, Chna, Oct. 2000, pp [2] 500-People TRSC (Telephone Read Speech, Corpus), Chnese Corpus Consortum, Chna,

132 36

133 Sample-based Phone-lke Unt Automatc Labelng n Mandarn Speech You-Yu Ln Insttute of Communcaton Engneerng, Natonal Chao Tung Unversty ross0927.cm97g@g2.nctu.edu.tw Yh-Ru Wang Insttute of Communcaton Engneerng, Natonal Chao Tung Unversty yrwang@mal.nctu.edu.tw (sample-based) (frame-based) (segment-based) TCC-300 HMM Abstract Ths paper presents a sample-based phone boundary detecton algorthm whch can mprove the accuracy of phone boundary labelng n speech sgnal. In the conventonal phone labelng method adopted the frame-based approach, some acoustc features, lke MFCCs, are used. And, the statstcal approaches are employed to fnd the phone boundary based on these frame-based features. The HMM-based forced algnment method s most frequently used method. The man drawback of the frame-based approach les n ncapablty of modelng rapd changes n speech sgnal; moreover, the tme resoluton of ths approach s too coarse for some applcatons. To overcome ths problem, a sample-wse phone boundary detecton framework s proposed n ths study. Frst, some sample-wse acoustc features are proposed whch can properly model the varaton of speech sgnal. The smple-based spectral KL dstance s frst employed for boundary canddates pre-selecton n order to reduce the complexty of sample-based methods. Then, a supervsed neural network s traned for phone boundary detecton. Fnally, the effectveness of the proposed framework has been valdated on automatc labelng of TCC-300 speech corpus. 37

134 sample-based KL Keywords: phone boundary segmentaton, sub-band sgnal envelope, sample-based spectral KL dstance, supervsed neural network TIMIT [] Model-based Metrc-based Model-based (maxmum lkelhood-traned Hdden Markov Model, ML-traned HMM) 20 ms 90%(ncluson rate) HMM (maxmum lkelhood, ML) (mnmum boundary error, MBE) HMM[2] HMM TIMIT MBE-HMM 0 ms 79.75% ML-traned HMM 7.23% 7.89% 20 ms (support vector machne, SVM)[3](neural network, NN)[4] HMM Metrc-based Rabner[5] (spectral transton measure) TIMIT 20ms 23.% (mssed detecton rate, MD) 22.0% (false alarm rate, FA) Kotropoulos[6] Kullback-Lebler(KL)(Bayesan Informaton Crteron, BIC) DISTBIC NTIMIT 25.7% MD 23.3% FA model-based metrc-based 38

135 frame-based (mel-frequency cepstral coeffcents, MFCC)(tme-spectrum) (uncertan) frame-based frame-based Artculaton Parameter (AP) landmark voce on-set sample-based frame-based HMM sample-based sample-based sample-based (mult-layer percepton, MLP) sample-based HMM-based forced algnment SVM (refnement) TCC-300 HMM-based forced algnment sample-based KL dstance MLP TCC-300 HMM-based forced algnment HMM MLP SAT SA HMM phone-lke unt algnment TCC-300 HMM (phone-lke level) HMM phone model tranng speaker adaptaton tranng(sat) SAT constrant MLLR(CMLLR) (CMLLR) HMM speaker-dependent HMM SAT HMM model adaptaton MLLR 39

136 HMM SAT HMM (force algnment) TCC-300 SAT SA HMM algnment MLP MLP (supervsed)mlp MLP 40

137 HMM algnment sample-based MLP (pre-selecton) (canddate)(target functon) MLP Vterb search sample-based AP sample-based AP AP sample-based Sample-based sample-based AP (sub-band sgnal envelope) (rate of rse, ROR)(spectral entropy) sample-based spectral KL dstance spectral flatness [7] AP (flter bank)[7] KHz KHz KHz KHz KHz KHz AP (envelope detector) (Hlbert transform) xn [ ] H([]) x n 0, n s even H([]) x n = x[] n h[] n and h[] n = () / nπ, n s odd 2[7] frame-based delta-term w w RORx[ n] = x n + / = w =w 2 [ ] (2) 4

138 x[n+]w sample-based 3 [9-0] 6 H s H = E [ n]log E [ n] (3) S j = ( ) e E[ n] = 6 e (4) j E [n] n 4Sample-based spectral KL dstance KL dstance (m n) spectral KL dstanced x (m,n) 6 E [ n] dx( m, n) = E[ n] E[ m] log = E[ m] ( ) (5) sample-based KL dstance 5Spectral flatness[] flatnessf F = 6 6 = = /6 E[ n] s 6 E[ n] s (6) S (slence) (short pause) F spectral flatness (thresholds) Sample-based sample-based sample-based TCC-300 sample-based 42

SAT SA HMM phone-lke unt algnment SAT(speaker adaptaton transform, feature MLLR) SA(speaker adaptaton, MLLR) HMM TCC-300 sample-based HMM sample-based Stop b p

139 SAT SA HMM phone-lke unt algnment SAT(speaker adaptaton transform, feature MLLR) SA(speaker adaptaton, MLLR) HMM TCC-300 sample-based HMM sample-based Stop b p d t g k Nasal m n (n_n) (ng) Frcatve f s x h sh Affrcate q j c z zh ch Lqud l r Vowel others HMM spectral flatness flatness flatness Spectral flatness 6 2 HMM 43

140 spectral KL dstance spectral KL dstance KL dstance frame-based sample-based sample-based KL dstance HMM 6 sample-based KL dstance HMM 44

141 KHz KHz spectral KL dstance MLP Sample-based sample-based (supervsed) MLP sample-based spectral KL dstance spectral KL dstance d ( n, n) < d ( n, n+ ), d ( n, n+ ) > d ( n+, n+ 2) and d ( n, n+ ) > Th (7) x x x x x d {c j ;j=,...,n c }(segment) segment-based 2 [c k-,c k ][c k,c k+ ] ES (k) k ([c k,c k+ ]) c ES() k = E[]/ n ckc k 2 nc = k + k ( ) (8) 30 (feature vector) k c k () { fb [ k], ror _ fb[ k], dk( n, n + ), F( k), env( k), ror _ env( k), Hs ( k), ror _ Hs ( k)}; =,...,6 (9) 45

142 (2) (ESp (k))(esn (k)) ck c k+ ( ) ( ) { ESp( k) = E[ n] / ck ck 2, ESn ( k) = E[ n] / ck+ ck 2 }; =,...,6 n= ck + n= ck + (0) 2 sample-based MLP MLP 9 (short pause, IS)(consonant, IC)(vowel, IV)(nasal endngs, IN) 4 (SC)(CV) (VN)(VS)(CP) lkelhood MLP HMM MLP () (2) MLP-based (3) MLP-based lkelhood Vterb search MLP (4) (2)(3) Sample-based sample-based MLP frame-based HMM TCC

3-4 23 200 6kHz 6 TCC-300 SAT SA HMM phone algnment HMM TCC-300 NICO Toolkt[2] MLP 30 50 9 MLP 3 30 6 sample-based spectral KL

143 kHz 6 TCC-300 SAT SA HMM phone algnment HMM TCC-300 NICO Toolkt[2] MLP MLP sample-based spectral KL dstancespectral flatness 6 MLP HMM sample-based MLP sample-based MLP HMM 00 ms Vterb search MLP Vterb search 00 ms HMM HMM sample-based HMM 47

144 HMM HMM HMM ()() 0-20ms HMM sample-based 20-30ms HMM HMM 0ms Stop 4.96 Nasal 5.95 Frcatve.3 Affrcate 8.92 Lqud 6.23 MLP 0ms Stop 2.62 Nasal 4.46 Frcatve 8.75 Affrcate 7.3 Lqud 2.70 TCC-300 SAT SA HMM phone algnment HMM 48

145 TCC-300 MLP-based sample-based HMM frame-based sample-based TIMIT [] Toledano, D.T.; Gomez, L.A.H.; Grande, L.V., "Automatc phonetc segmentaton," Speech and Audo Processng, IEEE Transactons on, vol., no.6, pp , Nov [2] J. -W Kuo and H.-M Wang, "Mnmum Boundary Error Tranng for Automatc Phonetc Segmentaton," The Nnth Internatonal Conference on Spoken Language Processng (Interspeech ICSLP), September [3] J.-W. Kuo, H.-Y. Lo, and H.-M. Wang, "Improved HMM/SVM methods for automatc phoneme segmentaton," n Proc. Interspeech, Antwerp, Belgum, 2007, pp [4] K.-S. Lee, MLP-based phone boundary refnng for a TTS database, IEEE Transactons on Audo, Speech and Language Processng, vol. 4, no. 3, pp , [5] Sorn Dusan and Lawrence Rabner, On the Relaton between Maxmum Spectral Transton Postons and Phone Boundares,, n Proc. Interspeech 2006, pp [6] Almpands, G., Kott, M., Kotropoulos, and C., "Robust Detecton of Phone Boundares Usng Model Selecton Crtera Wth Few Observatons," IEEE Transactons on Audo, Speech, and Language Processng, vol.7, no.2, pp , Feb [7] Sharlene A. Lu, Landmark detecton for dstnctve feature-based speech recognton,, J. Acoust. Soc. Am. 00 (5), November 996, pp [8] Hasegawa-Johnson, etc. "Landmark-Based Speech Recognton: Report of the 2004 Johns Hopkns Summer Workshop," Acoustcs, Speech, and Sgnal Processng, ICASSP vol., no., pp , March 8-23, 2005 [9] H. Msra, S. Ikbal, H. Bourlard, and H. Hermansky, Spectral entropy based feature for robust ASR,, n Proc. ICASSP 2004, pp [0] Ja-ln Shen, Jeh-weh Hung, Ln-shan Lee, "Robust Entropy-based Endpont Detecton for Speech Recognton n Nosy Envronments", Proc. ICSLP 998. [] J. D. Markel and A. H. Gray, "A spectral-flatness measure for studyng the autocorrelaton method of lnear predcton speech analyss," IEEE Trans. Acoust., Speech, Sgnal Processng, vol. ASSP-22, pp , June 974. [2] Nco Tool Kt : Avalable: 49

146 50

147 A Dscrete-cepstrum Based Spectrum-envelope Estmaton Scheme and Its Applcaton to Voce Transformaton Hung-Yan Gu and Song-Fong Tsa 5

148 // LPC 2 52

149 2. XkkN c c c N N j kn N n n Xk c e k N Xk XkXNk c k c N-k N Xk c cn kn cn k k N n N p S f c c fn p n n c k c k Xk L a k f k kl L p L f k klsf k a k L k wk ak S fk w k f k 53

150 amc W amc T WaMc T c c c T p p a a a L w W w L f f fp M fl fl flp c T T c M WM M Wa 2.2 (regularzaton) M T WM a k f k f k f k f k+ p L L wk ak S fk RSf k 54

151 3 40 RSf d RSf Sf df df T RSf c Uc U T T c M WM U M Wa p a k Sf 55

152 4 40 F F F a f f F f F a f F 4. p p p p p p p p 56

153 5 Nr L t Es k ak Stf t k Nr L Nr a t k t k Stf k t Es Es Es p

154 7 f k f warpf k k 8 40 L 58

155 9 40 f warpf 5. 59

156 0 : () 02,000Hz() 04,000Hz() 06,000Hz() 0,025Hz 5.2 vsf vcf vcf vs f vef vefvs f 60

157 vcf vef vcf vc vcvc 4 250Hz 5.4 f a k L fk a k k k 6

158 k + f a k L + t k k ht fk fk fk t fk t k L N ak ak ak t ak t k L N t N L L L L L a k k L L ht L ht ak t k t t N k t t f t k k k t k t k N k mvf + mvf KLmvf KU + t gt KU t bkt kt t N kkl t t k k k bk t t k t k t k ht gt t 5.5 k 62

159 5 6 63

160 Speech Communcatons: Human and Machne Int. Computer Musc Conference Int. Conference on Dgtal Audo Effects Int. Computer Musc Conference IEEE Sgnal Processng Letters Nonlnear Speech Modelng and Applcatons et al Harmonc plus Nose Models for Speech, Combned wth Statstcal Methods, for Speech and Speaker Modfcaton, Int. Conference on Spoken Language Processng Elements of Computer Musc Computer Musc Journal 64

161 {g964, chaoln, s944, s9538, I put a book on a table. I put a table on a book. [] [2] [3](context)

162 (deranged sentences) (scrambled sentences) Becker [4] Tree Adjonng Grammars(TAG)[5] TAG Mult-Component TAGFree Order TAG CYK () (anchors) n m (nm) m m!(m )(Stanford Parser)[6] (parse) Cocke-Younger-Kasam Algorthm( CYK )[7][8] m! CYK CYK CYK (a) (a)(b)(c)good students 66

163 n n! a I have never seen such good students. CYK b Such good students I have never seen. 0! c Never have I seen such good students. 0!( ) (adjectve phraseadjp)(noun phrasenp)(prepostonal phrasepp) (a)good students(a) (a)(b)(c)good students Students good such seen never have I.good students (a)good students CYK 7!(5040)6!(720) (a)( [9] Penn Treebank Tags) ADJP NP PP 2 PP( ROOT) NPADJP NPPP 3 5 PP NP PP 2 67

164 0 E={ } T 02 E T ADJPNP PP 05 ADJPNP PP T E 06 ADJPNP PP NPIgood students (a)i have never seen suchgood students 6! (a) 0 8 8!(40320) CYK n (computatonal complexty)( 3 ) (leaf node) (bottom up) (depth) m E={ } T E K={ } E my m n 02 K K 05 near near 2 06 R 07 uu m u 0 K R 2 K 3 near 4 near 2 5 R near 6 K 7 m 4 68

165 3 2 (NP(JJ good)(nns students)) near m m 4 7 u near 5 near K near m K 3 ( near) 2 (NP(JJ good)(nns students)) near (JJ such)(pp(jj such) (NP(JJ good)(nns students))) 6 K such good studentsi have never seensuch good students 5! R R R AB C A C near B( near ) B B A C( near 2) 5 R ((green)(and)(pnk)) green andpnk green pnkandgreen and pnk green and pnkandgreen pnkgreen and pnkgreen and pnk 2.3 probablstc context-free grammar( PCFG stochastc context-free grammarscfg)[7][8]

166 CYK % 70% % % % % (3528 ) CYK 20% 9.8%30% 40% 5.9%3.9% 70% ()

167 T T T T.2 T T T F 2. T T T T 2.2 T F F F 2.3 T T T T 2.4 T F F F 3. T T T T 3.2 T T T T 3.3 T F F F 4. T T T T 4.2 T T T F 5. T T T T 5.2 T T T T 6. T T T T 6.2 T T T F 6.3 T T T T 7. T T T T 7.2 T T F F 7.3 T F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T F m u F 4 F 7

168 30%40% ( 4 F) 0%26.3% 36.8% 47.4% 0%2.%3.6% 47.4% CYK 3. CYK 3. CYK (a) 2 VPVBP ADVPVP VP VBP ADVP VP NP PRP4329 S NP VP 3079 NP DT NN 245 PP IN NP I love you. 6 an apple n the mornng

169 2GB RAM Wndows XP Pentum G CYK CYK CYK CYK 7 CYK A a c B a c C a cd a c 3 a b ca b c 2 a b ca b c 3 bac acb NP VPSQ NP VP NP NP VPS NP VP CYK (overgeneraton [0]) CYK CYK CYK () n n- CYK () 5 5 n m n 73

170 5 n m T T T T % 78 % 96 3% %.2 T T T T 2. T T T T 2.2 F F F F % 20 7% 20 7% 20 7% 2.3 T T T T 2.4 T T T T 3. T T T T 3.2 T T T T % 48 40% 48 40% 72 60% 3.3 T T T T 3.4 T T T T 4. T T T T % F 6 25% F 6 25% F 6 25% F 4.3 F F F F 5. T T T T % F 20 7% F 20 7% F 20 7% F 5.3 F F F F 6. F F F F % 0 0% 0 0% 0 0% 6.2 F F F F 7. T T T T % 92 27% % % 7.2 F F F F 8. T T T T % 8 5% 24 20% 24 20% 8.2 F F F F 9. T T T T % 20 7% 20 7% 20 7% 9.2 F F F F m m! CYK m! CYK TF 5 CYK

171 (Iyou) CYK 50 CYK 4. CYK have (a)(c) have never(a) 8 CYK 8 9 () S ( 8 3 ) S S 8 3 (a) CYK (a)(b)(c) 8 S 3 0 CYK S={ } e 02 A R 05 A = {} S R 2 A 3 S 8 75

172 I have 2never seen 2such 2good 2students 2(I) II (a) S I 8 4 R 8 0 I never [3] 9 CYK

173

174 NSC E MY2 [] ( 2009/0/04) [2] ( 2009/0/04) [3] M.-S. Lu, Y.-C. Wang, J.-H. Ln, C.-L. Lu, Z.-M. Gao, and C.-Y. Chang. Supportng the translaton and authorng of test tems wth technques of natural language processng, Journal of Advanced Computatonal Intellgence and Intellgent Informatcs, 2(3), , [4] T. Becker, A. K. Josh, and O. Rambow. Long dstance scramblng and Tree Adjonng Grammars, Proceedngs of the Ffth Conference on European chapter of the Assocaton for Computatonal Lngustcs, 2-26, 99. [5] A. K. Josh. An Introducton to Tree Adjonng Grammars, Mathematcs of Language, 987. [6] The Stanford Natural Language Processng Group. The Stanford Parser: A statstcal parser. (Last vsted on 2009/05/0). [7] C. D. Mannng and H. Schütze, Foundatons of Statstcal Natural Language Processng, the MIT Press, 999. [8] D. Jurafsky and J. H. Martn, Speech and Language Processng: An Introducton to Natural Language Processng, Computatonal Lngustcs, and Speech Recognton, Prentce Hall, [9] Penn Treebank Tags. (Last vsted on 2009/08/9). [0] D. Ln. Prncple-based parsng wthout overgeneraton, Proceedngs of the Thrty-Frst Annual Meetng on Assocaton for Computatonal Lngustcs, 2-20, She can nether sng well nor dance beautfully..2 She can nether dance beautfully nor sng well. 2. They are sometmes late for work. 2.2 Sometmes they are late for work. 2.3 They are late for work sometmes. 2.4 They sometmes are late for work. 3. Only I saw the cake yesterday. 3.2 I only saw the cake yesterday. 3.3 I saw only the cake yesterday. 3.4 I saw the cake only yesterday 4. Solemnly the mnster addressed her congregaton. 4.2 The mnster solemnly addressed her congregaton. 4.3 The mnster addressed her congregaton solemnly. 5. I have never seen such good students. 5.2 Such good students I have never seen. 5.3 Never have I seen such good students. 6. One of my favorte hobbes s readng. 6.2 Readng s one of my favorte hobbes. 7. He goes to the lbrary every Sunday. 7.2 Every Sunday he goes to the lbrary. 8. The hotel s next to a move theater. 8.2 A move theater s next to the hotel. 9. I can agree n nether case. 9.2 In nether case can I agree. Fashon goes hand n hand wth compasson for lfe..2 Compasson for lfe goes hand n hand wth fashon. 2. Grls are born wth more senstve hearng than boys. 2.2 Boys are born wth more senstve hearng than grls. 2.3 Born wth more senstve hearng than boys are grls. 2.4 Born wth more senstve hearng than grls are boys. 3. Boys and grls should be educated n dfferent ways. 3.2 Grls and boys should be educated n dfferent ways. 3.3 In dfferent ways should boys and grls be educated. 4. Nether he nor I want to attend the meetng. 4.2 Nether I nor he wants to attend the meetng. 5. He fnally passed the exam because he studed hard. 5.2 Because he studed hard he fnally passed the exam. 6. Here s some good food for you to try. 6.2 Here some good food s for you to try. 6.3 Here some good food for you to try s. 7. A brown and whte dog s at your doorsteps. 7.2 A whte and brown dog s at your doorsteps. 7.3 At your doorsteps s a brown and whte dog 78

175 On the Use of Topc Models for Large-Vocabulary Contnuous Speech Recognton Kuan-Yu Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan Normal Unversty Berln Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan Normal Unversty (Language Model) N (N-gram) (Topc Model) (Probablstc Latent Semantc Analyss, PLSA)(Latent Drchlet Allocaton, LDA)(Word Topc Model, WTM) (Word Vcnty Model, WVM) N 79

176 (Acoustc Feature Vector) (Acoustc Model) (Language Model) N (N-gram)[] N N- (Multnomal) N N N ( N 2 3) () N (Word Class) N N (Class-based N-gram Model)[2] N (Aggregate Markov Model, AMM)[3] N N N [4, 5, 6] (Latent Semantc Analyss, LSA)[7] (Probablstc Latent Semantc Analyss, PLSA)[8, 9, 0](Latent Drchlet Allocaton, LDA)[, 2, 3, 4] 80

177 (Word Topc Model, WTM)[5, 6] (Word Pseudo-document) (Topc Models)[7] (Word Vcnty Model, WVM) N (Probablstc Latent Semantc Analyss, PLSA) (Latent Semantc Analyss, LSA) [5](Sngular Value Decomposton, SVD) (PLSA) [9] H H w P K w H P w T P T H P LS A k k k () T k P w T k T k w P T k H T k P w T k (Expectaton-Maxmzaton Algorthm)[8] H T k P T k H 8

178 N N (Lnear Interpolaton) H w P w H PLS A P w H P w H N gram PLS A P w H N 0 N gram P LS A N (Overfttng) (Unseen) (Global Maxmum) (Multple Random Intalzaton)(Unsupervsed Clusterng)[7] (Latent Drchlet Allocaton, LDA) [] D M d N d d d T d, n D PLSA (2) 82

179 P M N d D P P T P w T, d d d,n d d,n d,n d d n Td,n LD A (3) (Varatonal Bayesan Expectaton Maxmzaton) [7, 8] (Gbbs Samplng)[9, 0] H H H H (Word Topc Model, WTM) N [6] M w j w (Word Pseudo-document) w w w j M w j w P K w M P w T P T M W TM w j k k k w (4) j P w T k T k T k K w T k P M w j w H w w P w T k H [5] T k P T k H w j j j 83

180 P T k H w j T k P T M k w j (Word Vcnty Model, WVM) w w j P w, w j P K w, w P w T P T P w T W V M j k k k j k (5) (5)(4) P T k P w T k P w M W TM w j P T M k W TM w j P w T W TM k w j w P w w j P T k P w T k P P P w,w j K P w T P T P w T k k j k K P T P w T k j k k w w (6) W V M j w j k w w, w,, w 2 H w H H w (6) P w H P w w P w w P w (7) w W V M 2 W V M 2 W V M -, 2,, H T k P T H P T w k j j k j P w T P T j k k j j K P w T P T j k k k (8) P T k P w T k 84

181 P K w H P w T T H W V M k k k (9) P (9) w w w j P w, w j S w w n, n w, w nw, w w w j V w w j w, (5) L W V M w, w j V j P W V M w, w P w, w j w j L n log (0) W V M j W V M (Graphcal Model). (PLSA)(a) (WTM) (b) j 85

182 N K M K K N K 2 N K K N K (LDA) (c) (d) 2. V N V w, w,, w N D M 2 D d, d,, d M 2 K P T d k m P w T N K M K n k P T M k w n' P w T N K n k 2 K N K K N K P T k P w T n k ( M N ) ( N M ) 86

183 (a) (PLSA) (b) (WTM) (c) (LDA) (d) (WVM) 3. (PLSA) (WTM) (LDA) [] (WVM) (7)(6) (Mel-frequency Flter Bank) (Heteroscedastc Lnear Dscrmnant Analyss, HLDA) 87

184 (Maxmum Lkelhood Lnear Transformaton, MLLT) INITIAL 38 FINAL INITIAL FINAL (Slence) INITIAL FINAL 2 INITIAL [9] 5 (Hdden Markov Models) INITIAL-FINAL 2 28 (Maxmum Lkelhood Estmaton, MLE) (Mnmum Phone Error, MPE)[20] MATBN [2] MATBN MATBN MATBN (Trgram Language Model) SRI Language Modelng Toolkt[22] (2) MATBN MATBN (Perplexty, PP) (Character Error Rate, CER) S 2 j j j s () s j j 2,, j 0 j j j j 0.6 (7)(8) 88

185 MATBN MATBN NOWnews ,80 6,06 6,494,075,409 (32 Topcs) 4 Topc 8 Topc 3 Topc 4 Topc 23 (word) (weght) (word) (weght) (word) (weght) (word) (weght) baselne MATBN MATBN NOWnews CER(%) PP CER(%) PP CER(%) PP Trgram null () (ED)(ML) 32 4 (Topc Score) w T k [23] TS w,t k M M m m c c w, d P T d w, d P T d m m k k m m (2) 89

186 cw, w m d m d T P d k m d m T k 4 0 Topc 8 Topc 3 Topc 4 Topc 23 (Perplexty, PP) MATBN MATBN MATBN WVM(ML) WTM(ED)() WTM(ED) WTM(ML) (WVM(ML))(WTM(ED)) 23.% 2.4%(WTM(ED)) 26.%(WVM(ML)) 24.2% MATBN 20.22%MATBN 20.08% % 5.2% (WTM(ML)) 3.9%(WTM(ED)) 5.5%(WVM(ML)) 4.0%(WVM(ED)) 5.0% 64 (WTM(ED)) 90

187 MATBN MATBN PLSA LDA WTM(ML) WTM(ED) WVM(ML) WVM(ED) CER(%) PP CER(%) PP CER(%) PP CER(%) PP CER(%) PP CER(%) PP 8 topcs topcs topcs topcs topcs (L) MATBN (CER(%)) L topcs topcs topcs topcs topcs L L L % L % 6 (Contemporary) MATBN (NOWnews) NOWnews MATBN NOWnews 9

188 NOWnews (PP) NOWnews PLSA LDA WTM(ML) WTM(ED) WVM(ML) WVM(ED) 8 topcs topcs topcs topcs topcs NOWnews NOWnews (WVM(ML))(WTM(ML)) (WVM(ML))(WTM(ML)) (WVM(ML)) 7.4% (WTM(ML)) 6.8%.3% 9.9% (Word Vcnty Model, WVM)(Perplexty) (Character Error Rate) 5.0% NSC E MY3 NSC E MY3NSC S [] F. Jelnek, "Up from trgrams! - the struggle for mproved language models," n Proc. of Eurospeech, 99. [2] PF. Brown, VJ. Della Petra, PV. desouza, JC. La, and RL. Mercer, "Class-based n-gram models of natural language," Computatonal Lngustcs, 8(4):467479, 92

189 December, 992 [3] -order Markov models for statstcal laproc. of EMNLP, 997. [4] R. Iyer and M. Ostendorf, Modelng long dstance dependence n language: topc Transactons, 999. [5] Speech Communcaton, [6] Proc. of the IEEE, [7] J.R. Bellegarda, Latent Semantc Mappng: Prncples and Applcatons. Morgan and Claypool, [8] n Proc. of SIGIR, 999. [9] Machne Learnng, 200. [0] - Proc. of Eurospeech, 999. [] Journal of Machne Learnng Research, [2] model adaptaton usng varatonal Bayes n Proc. of Interspeech, [3] Techncal Report. [4] n Proc. of the Natonal Academy of Scences, [5] H.- Word topcal mxture models for dynamc language model adaptaton n Proc. of ICASSP, [6] B. Chen, "Latent topc modelng of word co-occurrence nformaton for spoken document retreval," n Proc. of ICASSP, [7] McNamara, S. Denns, & W. Kntsch (Eds.), Handbook of Latent Semantc Analyss. Hllsdale, NJ: Erlbaum. [8] Journal of Royal Statstcal Socety, 977. [9] B. Chen, J.-W. Kuo, and W.--Drven Approaches Proc. of ICASSP, [20] -Based Dscrmnatve Tranng of Acoustc Models for Mand Proc. of ROCLING, (n Chnese) [2] H.-M. Wang, B. Chen, J.-W. Kuo and S.- Internatonal Journal of Computatonal Lngustcs & Chnese Language Processng, [22] A. Stolcke, SRI Language Modelng Toolkt, [23] T.-H. L, M.-H. Lee, B. Chen and L.- Herarchcal Topc Organzaton and Vsual Presentaton of Spoken Documents Usng Probablstc Latent Semantc Analyss (PLSA) for Effcent Retreval/Browsng Applcatons n Proc. of Eurospeech,

190 94

191 Improvng Translaton Fluency wth Search-Based Decodng and a Monolngual Statstcal Machne Translaton Model for Automatc Post-Edtng Jng-Shn Chang Sheng-San Ln Department of Computer Scence & Informaton Engneerng Department of Computer Scence & Informaton Engneerng Natonal Ch Nan Unversty Natonal Ch Nan Unversty, Unv. Road, Pul, Nantou 545, TAIWAN, Unv. Road, Pul, Nantou 545, TAIWAN jshn@cse.ncnu.edu.tw s @ncnu.edu.tw Abstract The BLEU scores and translaton fluency for the current state-of-the-art SMT systems based on IBM models are stll too low for publcaton purposes. The major ssue s that stochastcally generated sentences hypotheses, produced through a stack decodng process, may not strctly follow the natural target language grammar, snce the decodng process s drected by a hghly smplfed translaton model and n-gram language model, and a large number of nosy phrase pars may ntroduce sgnfcant search errors. Ths paper proposes a statstcal post-edtng (SPE) model, based on a specal monolngual SMT paradgm, to translate dsfluent sentences nto fluent sentences. However, nstead of conductng a stack decodng process, the sentence hypotheses are searched from fluent target sentences n a large target language corpus or on the Web to ensure fluency. Phrase-based local edtng, f necessary, s then appled to correct weakest phrase algnments between the dsfluent and searched hypotheses usng fluent target language phrases; such phrases are segmented from a large target language corpus wth a global optmzaton crteron to maxmze the lkelhood of the tranng sentences, nstead of usng nosy phrases combned from blngually wordalgned pars. Wth such search-based decodng, the absolute BLEU scores are much hgher than automatc post edtng systems that conduct a classcal SMT decodng process. We are also able to fully correct a sgnfcant number of dsfluent sentences nto completely fluent versons. The BLEU scores are sgnfcantly mproved. The evaluaton shows that on average 46% of translaton errors can be fully recovered, and the BLEU score can be mproved by about 26%. Keywords: Translaton Fluency, Fluency-Based Decodng, Search-Based Decodng, Statstcal Machne Translaton, Automatc Post-Edtng Introducton and Motvaton. Fluency Problems wth Statstcal Machne Translatons Translaton fluency of Machne Translaton systems s a serous ssue n the current SMT research works. Wth the research efforts for the past tens of years, the performances are stll far from satsfactory. In translatng Englsh to Chnese, for nstance, the BLEU scores [6] range only between 0.2 and 0.29 [22, 5, 7], dependng on test sets and numbers of reference translatons. Such translaton qualty s extremely dsfluent for human readers. We therefore propose a statstcal post-edtng (SPE) model, based on a specal monolngual SMT framework, for mprovng the fluency and adequacy of translated sentences. 95

192 The classcal IBM SMT models [, 2] formulate the translaton problem of a source sentence F as fndng the best translaton E* from some stack decoded hypotheses, E, such that: E* argmax PrE F E () argmaxpr F E Pr E where E E:target sentence Pr and F E : Translaton Model ( TM) F:source sentence Pr E : Language Model ( LM) The arg max operaton mples to generate canddate target sentences E of F so that the E SMT model can score each one, based on the TM and LM scores and select the best canddate. The process of canddate generaton s known as the decodng process. The conventonal decodng process s sgnfcantly affected by the TM and LM scores; only those canddates that satsfy the underlyng crtera of the TM and LM wll receve hgh scores. Unfortunately, to make the SMT computatonally feasble, the TM and LM are hghly smplfed. Therefore, the canddates are not really generated based on target language grammar, but based on the model constrants. For nstance, the classcal SMT model does not prefer word re-orderng wth long dstance movement. Such canddates are then not generated regardless of the possblty that the target grammar mght prefer them..2 LM and Decodng There are three drectons to mprove the translaton fluency wth the classcal SMT model, Equaton (). Frstly, we can mprove the Translaton Model (TM) to ft the source-target transfer process. Secondly, we can mprove the Language Model (LM) to respect the target language grammar. Fnally, we could try to generate better and much more fluent canddates n the decodng process so that the TM and LM can select the real best one from fluent canddates, rather than from junk sentences. The research communtes normally focus on the TM and LM components by assumng that there are good ways to generate good canddates for scorng. Actually, most attenton s pad to the Translaton Model (TM); LM and decodng were not ganng the same weght. In partcular, people tend to thnk that the canddate generaton process guded by the hghly smplfed TM and LM wll eventually generate good canddates. Unfortunately, to make the computaton feasble, the classcal SMT models have very low expressve power n the Translaton Model (TM) and Language Model (LM) components. It formulates the TM n terms of the fertlty probablty, lexcal translaton probablty and dstorton probablty [, 2]. A word-based 3-gram model s usually used as the language model (LM). Longer n-grams are used at hgher tranng cost and severe data sparseness. In fact, the canddates of the target sentence, whch are hdden n the arg max operator, E are generated as a stochastc process n most SMT today. Startng from a partcular state, the next word s predcted based on a local n-gram wndow wthn a dstance allowed by the dstorton crteron; the possble paths are exploted usng stack decodng, beam search or other searchng algorthms. The canddates generated n ths way thus may be only pecewse consstent wth the target language grammar, but may not be really globally 96

193 grammatcal or fluent. Ths means that the TM and LM are not scorng a complete sentence but some segments pasted by the n-gram LM. It s then not lkely to be fluent all the tme. Ths decodng process therefore sometmes falls nto the garbage-n and garbage-out stuaton. No matter how well-formulated the TM and LM may be, f the stochastcally generated canddates do not nclude the correct and fluent translaton, the system wll eventually delver a garbage output, that s, a dsfluent sentence, as the best one. Ths knd of error s known as searchng error. Because the TM and LM have lmted expressve power to descrbe the real crtera that carry the generaton process, the decodng process mght only generate nosy sentence segments and thus dsfluent sentences for scorng. Ths could lead to bad performance n terms of BLEU score or human judgments. Phrase-based SMT had partally resolved the expressve power ssue of TM and LM by usng longer word sequences. However, the acquston of phrases has ts own problems. In partcular, most phrase-based SMT acqures the phrase pars by conductng blngual word algnment frst. Adjacent words are then connected n some heurstc ways [2, 3, 4, 5], whch do not have drect lnk wth the source or target grammar, to form the phrases. The phrases generated n ths way normally do not satsfy any global optmzaton crtera related to the target grammar, such as maxmzng the lkelhood of the target language sentences. The qualty of such phrases s therefore greatly affected by the word algnment accuracy; and, the phrases for the target language sde may not really respect the target grammar. Under such crcumstances, a huge number of nosy phrases wll be ntroduced and sgnfcantly enlarge the searchng space. The stochastcally generated phrase sequences thus may not correspond to good canddate sentences ether. To summarze, the applcaton of word-for-word or phrase-to-phrase translaton (wth nosy phrases) plus a lttle bt local word/phrase re-orderng n classcal SMT mght not generate fluent target sentences that respect the target grammar. In partcular, many target specfc lexcal tems and morphemes cannot be generated through ths knd of models. If they do, they may be generated n very specal ways. Ths could be a sgnfcant reason why the SMT models do not work well after the long perod of research. The mplcaton s that we mght have to examne the arg max operaton, that s, the E decodng or searchng process, n the classcal SMT models more carefully. We should try decodng method that respect target grammar more, nstead of followng the crtera set forth by the TM and LM of the SMT model, whch encode hghly smplfed verson of the target grammar. Only wth a decodng process that respect the target grammar, wll the system generate fluent canddates at the frst place before submttng the canddates to the TM and LM for scorng. Furthermore, a phrase-based language model, nstead of word-based n-gram model for the target sde may mprove the fluency of machne translaton further snce more context words can be consulted, f the phrases are not nosy. To avod a huge number of nosy source-dependent phrases that mght be harmful for fluency and searchng, such phrases may better be traned from a target corpus, nstead of beng acqured from blngually wordalgned chunks. 97

194 .3 Statstcal Post-Edtng Model Based on Monolngual SMT Instead of developng new models for the TM and LM, an alternatve to mprove the translaton fluency s to cascade an Automatc Post-Edtng (APE) module to the translaton output of an MT/SMT system. Whle the classcal SMT models may not be sutable for drectly generatng fluent translaton, due to the lmted expressve power of the TM and LM and search errors of the decodng process, an SMT or ts varant may be suffcent for rerankng hypotheses n the automatc post edtng purposes, f approprate hypotheses generaton mechansm s avalable. Actually, we can regard a post-edtng process as a translaton process from dsfluent sentence to fluent sentence. Ths s partcularly true f the dsfluency s lmted to local edtng operatons lke nserton of target specfc morphemes, deleton of source-specfc functon words, and lexcal substtuton from many possble lexcal choces. These knds of errors are often seen n MT/SMT systems. Inspred by the above deas, ths paper propose a statstcal post-edtng (SPE) model based on a monolngual SMT paradgm for mprovng the translaton fluency of an MT system, nstead of mprovng the TM drectly. In ths SPE model, the searchng or decodng s a fluency-based search. We search fluent translatons, based on the lexcal hnts of the dsfluent sentence, from a large target text corpus or from the Web. Therefore, all canddates wll be fluent ones. The best hypotheses reranked best by the SPE model wll then serve as the post-edted verson of the dsfluent sentence. Sometmes, a searched sentence may not have a hgh translaton score to justfy tself as an approprate translaton. For nstance, the target sentence pattern may be correct but dfferent lexcal choces have been made. In ths case, automatc local edtng s appled to the weakest algnments to ncrementally patch the target sentence pattern wth rght target lexcal tems. By combnng the grammatcal (and fluent) sentence pattern of the searched sentence and the rght lexcal tems from the dsfluent sentence, the dsfluent translaton could be repared to a fluent one ncrementally. Ths may nclude some local nserton, deleton and lexcal substtuton operatons over phrase pars that are unlkely to be translaton of each other. To really mprove the fluency ncrementally, the local edtng process s appled n a manner that wll monotoncally ncrease the lkelhood of the ncrementally repared sentence. To respect the target grammar further, the repar s phrase-based. In other words, phrasebased n-gram language model (n=) s used n the translaton score so that the lkelhood of the repared target sentence s ncrementally ncreased durng the local edtng process. In parallel wth the development of our work, a few APE systems were also proposed [7, 20, 2, 8] wth good results. Publcly avalable SMT systems (lke Portage PBMT, Moses, etc.) are used drectly as the post-edtng module. They are traned usng human post-edted target sentences wth ther un-edted MT outputs to learn the translaton knowledge between dsfluent (source) and fluent (target) sentences [20]. Alternatvely, they may be traned usng standard parallel corpora (Europarl, News Commentary, Job Bank, Hansard, etc.) where the dsfluent sentences are generated usng a rule-based MT (lke SYSTRAN) or other SMT [2]. Therefore, these works requre substantal human post-edtng costs to tran the SMT. Or they need a szable parallel corpus for tranng, whch may not be avalable to many language pars. In addton, t requres an RBMT or SMT pre-traned for translatng the source corpus, whch may not be avalable to many language pars. Most mportantly, these frameworks use 98

195 the same decodng process as well as the TM and LM of the orgnal SMT to generate ther post-edtng hypotheses. Therefore, the prevously dscussed performance ssues that apply to classcal SMT wll also apply to such APE modules. The cascade of an SMT as an APE module mght mply the use of a system wth low BLEU performance to correct the outputs wth low BLEU scores. The mprovement could thus be substantally lmted. Ths may be seen from the fact that the contrbuton of the APE becomes neglgble as the tranng data s ncreased [2]. In contrast, we dscard the stochastc decodng process, whch mght generate dsfluent hypotheses, but search a large corpus for hghly smlar sentences to the dsfluent sentence, and thus wll have raw hypotheses wth hgh BLEU scores. Addtonal local edtng wll further mprove the fluency. Furthermore, our proposal can generate nterestng error patterns automatcally usng the target language corpus alone. Therefore, the APE module can be constructed wthout a real MT system (although t would be better to have one n order to correct the specfc errors of a specfc system.). The followng sectons wll dscuss the formulaton n more detals. 2 Problem Formulaton for SPE In our work, we propose to adopt a Statstcal Post-Edtng (SPE) Model to translate dsfluent -to-. As wll be seen later, t can be traned wth a Monolngual SMT Model. Gven a dsfluent sentence E translated form a source sentence F, the automatc post-edtng problem can be formulated as fndng the most fluent sentence E* from some canddate sentences E such that: E* a r g m a x P r E E ' E arg max Pr E E ' E PrE (2) As usual, we wll refer Pr(E E) as the translaton model (TM), and Pr(E) as the language model (LM) of the SPE model. We thus encountered the same SMT problems to formulate the TM, LM and the decodng (or searchng) process. 2. Order-Preserved Translaton Model The automatc post-edtng problem s ntutvely easer than SMT snce we can assumes that the dsfluency s due to some local edtng errors, such as ms-nserton or ms-deleton of functon words, and wrong lexcal choces. Under ths assumpton, we can formulate the TM as: Pr E' E A Pr E', A E max Pr E', A E A Ep As E' p Pr E', A E s ' Pr Ep E p (3) 99

196 In Eqn. (3), phrase-algned phrase pars are represented by Ep and Ep for the dsfluent and fluent versons, respectvely. We assume that the most lkely algnment A s, among all generc algnment pattern A, between E and E s an order-preserved or sequental algnment between ther consttuents. We further assume that ths most lkely algnment has much hgher probablty than other algnments such that we dont have to sum over all generc algnment patterns. In the post-edtng context, ths assumpton may be reasonable f the dsfluency results from smple local edtng operatons. In partcular, f we are usng phrase-based algnment, the word order wthn the phrases can be gnored. The order preservaton assumpton wll be even more reasonable. We therefore assume that the TM s the product of the probabltes of sequentally algned target phrase pars. The phrase segmentaton model for dvdng E or E nto phrases wll be further detaled later when dscussng the target phrase-based LM. Gven the segmented phrases, the best sequental algnment can easly be found usng a standard dynamc programmng algorthm for fndng the shortest path. The TM for the SPE model s specal n that the tranng corpus can be easly acqured from a large monolngual corpus wth fluent target sentences. Generatng a dsfluent verson of the fluent monolngual corpus automatcally based on some error model of the translaton process wll make ths possble. One can then easly acqure the model parameters for translatng dsfluent sentences nto fluent ones through a smlar tranng process for a standard SMT. In comparson wth standard SMT tranng, whch requres a parallel blngual corpus, the monolngual corpus s much easer to acqure. 2.2 Target Phrase-Based Language Model To respect the fluency of the target language n the decodng process, the language model score Pr(E) should be evaluated based on long target language phrases, Ep, nstead of target words. The phrases should also be defned ndependent of source-language n order not to ntroduce a huge number of nosy phrases as PBSMT normally dd. The proposed LM for the current SPE, whch s responsble for selectng fluent target segments, s therefore a phrasebased ungram model, nstead of the wdely used word-based n-gram model. In other words, we have Pr E Pr Ep. EpE To avod source-language dependency, we also decded not to defne target phrases n terms of chunks of blngually algned words. Instead, the best target phrases are drectly traned from the monolngual target corpus by optmzng the phrase-based ungram model. * In other words, the best phrase sequence p n for an n-word sentence w, wll be the sequence, m among all possble phrase segmentaton, p, such that: m n * p arg max Pr p w arg max Pr p m m p p Fortunately, extractng monolngual phrases usng the phrase-based un-gram model can be done easly. The tranng method s just lke the word based un-gram word segmentaton model [4], whch was frequently used n Chnese word segmentaton tasks. Unsupervsed tranng s easy for ths. Upon convergence, a set of well-formed phrases can be acqured. (Ths set of phrases wll be called a phrase example base, PEB. Phrases n the PEB wll be used later n the Local Edtng Algorthm for post-edtng.). 200

197 Snce a phrase traned n ths way can be longer than a 3-gram pattern, the modelng error could be reduced to some extend. Furthermore, the number of such phrases wll be much smaller than those randomly combned phrases acqured from word-algned word chunks. As a result, the estmaton error due to data sparseness wll be sgnfcantly reduced too. Unlke the rare parallel blngual tranng corpus, the amount of such target language corpora s extremely large. Therefore, fluent phrases can be extracted easly. Wth phrases as the basc lexcal unt, SPE model wll reduces to E* argmax Pr Ep ' Ep Pr Ep (4). E Ep As Ep ' Snce a phrase can cover more than 3 words, the selected phrases mght be more fluent than word trgrams. Such phrases wll ft target grammar better and therefore wll prefer more fluent target sentences n general. 2.3 Search-Based Decodng for Fluency One key ssue that causes dsfluency s the decodng process used n classcal SMT. Most decodng process regard target sentence generaton as a stochastc process, and only local context of fnte length wndow s consulted whle decodng. Therefore, the target sentences generated n ths way are usually not fluent. Our work proposes to search fluent translaton canddates from a huge target sentence base or from web documents, nstead of usng tradtonal decodng methods to generate the translaton canddates. Snce the large corpus and the Web documents are produced by natve speakers, the target sentences thus searched are most lkely fluent wth hgh BLEU scores. Our current work smply used a heurstc matchng score to extract a set of canddate sentences for a dsfluent sentence. The canddates are then re-ranked usng the translaton score defned by the SPE model. The best canddate wll be regarded as the post-edted verson of the dsfluent sentence f the translaton score s hgher than a threshold. Otherwse, t wll be locally edted to ncrementally ncrease ts translaton score. The matchng score s smply the number of dentcal word tokens n two sentences, whch s normalzed by the average length of the two sentences. In other words, t s the percentage of word matches between two sentences. We searched the canddate translatons from the Academa Snca Word Segmentaton Corpus, ASWSC-200 [6], as well as Chnese webpages ndexed by Google. (We assume that the target language s Chnese.) Dfferent query strngs wll result n dfferent returned pages. Totally, we have tred 4 models for searchng: () Model C: search the corpus (only) for Top-N hypotheses (N=20). (The length dfference must not be greater than two words.) (2) Model C+W: search the corpus and the web for addtonal N hypotheses by submttng the complete dsfluent target sentence as-s to Google. (3) Model C+W+P: ncludng partal matches aganst substrngs of the dsfluent target sentence, where L- words n the dsfluent sentence are successvely deleted and then submtted as query strngs to the search engne. (L: number of words n dsfluent sentence) (4) Model C+W+Q: adjacent words n the deleted dsfluent sentence are quoted as a sngle query token before submsson so that the search engne wll match more exactly. Even wth such a heurstc search, a substantal number of fluent sentences smlar to the dsfluent sentences can be found for re-rankng and local edtng. 20

198 2.4 Local Edtng If exact translaton s found durng searchng, the searchng process tself s exactly a perfect translaton process. If hghly smlar sentences are found, smple lexcal substtuton or automatc post-edtng [9, ] mght patch the searched fluent sentences nto correct translatons. Some prevous works for automatc post edtng have been restrcted to specal functon words, such as the Englsh artcle the/a [9, 0], the Japanese case markers and Chnese classfer or partcle de [8]. The automatc post-edtng model here s ntended to resolve general edtng errors that are frequently made by a machne translaton system. Brefly, the best sentence E * eb n the searched canddates wll be output as the translaton of the dsfluent translaton E f the translaton score assocated wth the SPE model s hgher than a threshold. (The set of canddate translaton sentences s called ts example base, thus the subscrpt eb.) Otherwse, the automatc local edtng algorthm wll fnd the weakest phrase algnments and fx them one-by-one to maxmze the translaton score. An algnment phrase par <Ep, Ep> s sad to be weak f ts local algnment score Pr(Ep Ep) x Pr(Ep) s small and thus contrbutes lttle to the global translaton score for the sentence par <E, E>. When the weakest par, (Ep- Ep-) wth the lowest local algnment score s dentfed, we should try to replace Ep-, the most questonable phrase n the fluent (yet ncorrect) example sentence E, wth some canddates that would make the patched example sentence more lkely to be the translaton of E. There are some reasons why the algnment (Ep- Ep-) s the weakest. Frst of all, Epmght not be the rght phrase, and should be replaced by Ep- to make the fluent sentence E also the correct translaton of E. Second, Ep- mght not be the correct translaton of some source phrase. In ths case, the most lkely translaton(s) of Ep-, called Ep+, should be used to replace Ep-. Thrd, Ep- s a more approprate phrase than Ep+. In ths case, t should be retaned and next weakest algnment par be repared. As a result, potental canddates for replacng Ep- wll nclude Ep-, Ep+ and Ep- tself. The best substtuton wll be the phrase that maxmzes Pr(Ep Ep) x Pr(Ep). Actually, many phrases n the PEB can be a more fluent verson of Ep-. Currently, the 20 best matches wll play the role of Ep+ durng local edtng. And the local edtng algorthm wll successvely edt weaker algnments untl the (monotoncally ncreasng) translaton score s above some threshold. The algorthm s outlned as follows. 202

199 Local Edtng Algorthm Input E and E * eb Step Fnd the weakest algnment entry n E from the < E, Ep ' argmnpr Ep ' Ep Pr( Ep) Ep' E ' E * eb > algnment. Step 2 Identfy Ep that s the phrase n E * eb algned wth Ep '. Ep algn Ep' Step 3Fnd the fluent phrase Ep of Ep ' from PEB. Ep P E B Ep' Step 4Select the best substtuton among Ep-, Ep+ and Ep- whch maxmze the translaton score: * E arg max Pr E' E Pr( E) Step 5Cut ps E Ep', Ep, Ep ps Ep from E * eb and paste Eps to E * eb. * * E eb E eb ( Ep) ( Eps) (Repeat untl the translaton score Pr(ExPr(E) reaches some threshold.) Constraned Decodng Note that, local edtng s appled only to a local regon of the example sentence based on the dsfluent sentence. Intutvely, those sentences searched from a text corpus or from the Web corpus wll be much more fluent than stochastcally combned sentences from the SMT decodng module. Even f local edtng s requred, the repar wll be qute local. The search space for reparng wll be sgnfcantly constraned by words n the most lkely example sentence. Such a searchng and local edtng combnaton can thus be regarded as a constraned decodng. The searchng error can thus be reduced sgnfcantly n comparson wth the large search space of the decodng process of a typcal SMT. 2.5 Generatng Faulty Sentences The TM parameters can actually be traned from an E'-to-E monolngual Machne Translaton System, where E' can be derved by applyng to E some commonly found edtng operatons n the SMT translaton process. The operatons mght nclude the nserton of target specfc lexcon, deleton of source specfc lexcon, local reorderng of words and substtuton of lexcal tems. In the current work, we apply three knds of edtng operatons to the fluent sentences n a monolngual corpus to smulate frequently found errors n an MT system. The fluent and ts dsfluent versons are then phrase segmented so that the sentences are represented by phrase tokens (nstead of word tokens). Such fluent-dsfluent (E-E) target sentence pars are then traned usng the GIZA++ algnment tools [2, 3, 4, 5]. Upon convergence, the translaton model between the sentences to be post-edted and ther correct translaton can readly be acqured. 203

200 The three edtng operatons nclude: () Inserton: The nserton errors wll occur when an MT system translates a source word nto a target word whle t should not be translated. For nstance, the Englsh nfntve to need not be translated nto any Chnese word most of the tme. But the blngual dctonary may ndcate the possblty to translate t nto (chu). We therefore automatcally nsert the Chnese words to smulate such an error. (2) Deleton: The deleton error occurs when a target specfc word s not generated n the translaton. For nstance, the Chnese classfers have no correspondence n the Englsh language. We therefore delete the followng classfers from fluent Chnese sentences to create nstances wth deleton errors:,,,,,. (3) Substtuton: When a translaton system chooses a wrong lexcal tem, a typcal substtuton error wll occur. To smulate the substtuton errors, Chnese words n the fluent sentences are lookup aganst an Englsh-Chnese dctonary. Chnese words that are also the translaton of the Englsh word are then substtuted to smulate the substtuton error. For nstance, s a Chnese translaton for the Englsh word problem. But problem also has other translatons, lke and. These words are therefore used to smulate the substtuton errors. In our smulaton, the top-30 most frequently used Chnese words are adopted to smulate the substtuton errors. Wth dsfluent sentences created from fluent sentences wth the above frequently encountered translaton errors, an automatc statstcal post-edtng model can readly be traned usng state-of-the-art algnment tools. 3 Experments To see the performance of the current SMT-based SPE model, about 300,000 word segmented Chnese sentences from the Academa Snca [6] was used as our target sentence corpus. The corpus has about 2,450,000 word tokens, and the vocabulary sze s about 83,000 word types. 0% of the sentences are used as the test set and 90% are used for tranng. The 3 types of errors are appled to the testng sentences ndependently. For each error type, 00 sentences are randomly selected for evaluatng automatc post edtng. The performance s evaluated n terms of two crtera. The frst crteron s the number (percentage) of fully corrected dsfluent sentences from the test set. By fully corrected, we mean that the sentence corrected by the statstcal post edtng (SPE) system s completely the same as ts orgnal fluent verson. Table ndcates the performance n terms of the error correcton capablty. Error types Searchng Models C C+W C+W+P C+W+Q Substtuton Deleton Inserton Average Table. Number of fully corrected sentences wth dfferent searchng models (N=00) 204

201 Note that, even wth the very smple mnded searchng method, the SPE was able to correct, on average, about 48% of the faulty sentences to ther fluent verson f the search space s suffcently large (wth the C+W+Q searchng model). The performance ncreases wth the search space. And the performance s ncreased at most by 62%, 2% and 7.5 %, respectvely for the substtuton, deleton and nserton errors when the Web corpus s ncluded to the search space. Obvously, the substtuton s the hardest to resolve whle nserton error seems to be easer to resolve. The second evaluaton crteron s the mprovement n the BLUE score wth respect to the un-corrected test sentence. Table 2 shows the BLEU scores for the varous searchng models. The frst column labeled as E(ts) lsts the BLEU scores for the test sentences that has not been post-edted. By searchng for fluent translaton and applyng local edtng, the BLEU scores are mproved wth ncreasng search space. The best performance s to ncrease the BLEU scores by 5%, 38% and 26% respectvely for the three types of errors. On average, the mprovement s about 26%, whch s substantal. On the other hand, the absolute changes are 9.4, 22.8 and 6.9 ponts n BLEU score, respectvely. Error Types BLEU Scores E(ts) C C+W C+W+P C+W+Q Substtuton Deleton Inserton Table 2. BLEU Scores for Varous Searchng Models Note that, wth search-based decodng, the absolute BLEU scores are much hgher than automatc post edtng systems that smply cascade a classcal SMT module to the output of an MT/SMT [20, 2, 8]. Although the experment settngs are not the same and thus cannot be compared drectly, the results to have hgher absolute BLEU scores can be expected snce searched sentences are almost always fluent, whether they are post-edted or not. Obvously, wth the same tranng corpus, the search space and the searchng method play mportant roles n mprovng the performance. The ncluson of the web corpus does mprove the performance sgnfcantly. It was reported n [9] that well formulated query strngs can effectvely mprove searchng accuracy. Therefore, by usng better searchng strategy, part of the translaton problems for fluent translaton mght be resolved as a searchng and automatc post-edtng problems. Currently, a statstcal searchng model specfc for the fluency-based decodng s beng developed. 4 Concludng Remarks In ths paper, we propose not to generate sentence hypotheses for APE systems by usng conventonal SMT decodng process, snce such a decodng process tends to lead to an openended search space. It s not easy to generate fluent sentence hypotheses under such crcumstances due to the large search error. We propose to search sentence hypotheses, from a large target text corpus or from the web, based on the words n the dsfluent translatons, snce the potental canddates wll mostly be fluent. A statstcal post-edtng model s also proposed to re-rank the searched sentences, and a local edtng algorthm s proposed to automatcally recover the translaton errors when the searched sentence s not a good 205

202 translaton. Wth the SPE, the local edtng algorthm tres to maxmze the translaton score for each local edtng. It therefore mproves the translaton fluency ncrementally. Snce the TM can be traned from an automatcally generated fluent-dsfluent parallel corpus, tranng such a system s easy. The evaluaton shows that, on average, 46% of translaton errors can be fully recovered, and the BLEU score can be mproved by about 26%. The absolute BLEU s also hgh wth the search-based decodng process n comparson wth conventonal decodng process. References [] Brown, Peter F., J. Cocke, Stephen A. Della Petra, Vncent J. Della Petra, Frederck Jelnek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossn, approach Computatonal Lngustcs, 6(2):7985, 990. [2] Brown, Peter F., Stephen A. Della Petra, Vncent J. Della Petra, and R. L. Mercer, mathematcs of statstcal machne translaton: Parameter Computatonal Lngustcs, 9(2):2633, 993. [3] Chang, Jng-Shn and Chun- A Chnese-to-Chnese Statstcal Machne Translaton Model for Mnng Synonymous Smplfed- Proceedngs of Machne Translaton Summt XI, pages 8-88, Copenhagen, Denmark, 0-4, September, [4] Chang, Tung-Hu, Jng-Shn Chang, Mng-Yu Ln and Keh-Yh Su, Statstcal Models for Word Segmentaton and Unknown Word Resoluton, Proceedngs of ROCLING-V, pp , Tape, Tawan, R.O.C., 992. [5] Chang, Davd, -Based Model for Statstcal Machne Translaton, Proc. ACL-2005, pages , [6] CKIP 200, Academa Snca Word Segmentaton Corpus, ASWSC-200, ( ), Chnese Knowledge Informaton Processng Group, Acdema Snca, Tape, Tawan, ROC, 200. [7] Dugast, L., J. Senellart, P. Koehn,-Edtng on SYSTRANS's Rule-Based Translaton System, Proceedngs of the Second Workshop on Statstcal Machne Translaton, 2nd WSMT, pp , Prague, Czech Republc, June [8] Isabelle, P., G. Goutter, M. Smard, Automatc Post-Edtng, Proceedngs of MT Summt XI, pp , Copenhagen, Denmark, 0-4 Sept [9] Knght, Kevn, and Ishwar Chander, Post-Edtng of Documents, n Proceedngs of the Twelfth Natonal Conference on Artfcal Intellgence, pp , CA, USA, 994. [0] Lee, J., Automatc Artcle Restorat n Proc. HLT-NAACL 2004 Student Research Workshop, Boston, MA, , May, [] Lltjós, Aradna Fontós,-Edtng to Improve MT Automated Post-Edtng Workshop, AMTA, Boston, USA, August 2, [2] Och, Franz Josef, Chrstoph Tllmann, and Hermann Ney, Improved Algnment Models for Statstcal Machne Translaton, n Proc. EMNLP/WVLC,

203 [3] Och, Franz Josef and Hermann Ney, comparson of algnment models for statstcal Proc. COLING on Computatonal Lngustcs, pages , Saarbrücken, Germany, August, [4] Och, Franz Josef and Hermann Ney, Proceedngs of the 38th Annual Meetng of the ACL, pages , [5] Och, Franz Josef and Hermann Ney, template approach to statstcal Computatonal Lngustcs, 30:47449, [6] Papnen, K., S. Roukos, T. Ward, and W. J. Zhu, BLEU: a method for automatc evaluaton of machne translaton, In Proceedngs of ACL-2002, 40th Annual Meetng of the Assocaton for Computatonal Lngustcs pp. 338, [7] Shen, -LL/AFRL IWSLT-2006 MT System,Proc. of the Internatonal Workshop on Spoken Language Translaton (IWSLT) 2006, pp. 7-76, Kyoto, Japan, 27 November [8] Sha, Mn-Shang, Usng Phrase Structure and Fluency to Improve Statstcal Machne Translaton, Master Thess, Computer Scence and Informaton Engneerng, Natonal Cheng Kung Unversty, Tanan, Tawan, ROC, June, [9] Shh, Shu-Fan, A Query Augmentaton Model for Answerng Well-Defned Questons, Master Thess, Department of Computer Scence and Informaton Engneerng, Natonal Ch Nan Unversty, Tawan, ROC, July, [20] Smard, M., G. Goutter, P. Isabelle, -Based Post- Proceedngs of NAACL-HLT 2007, pp , Rochester, NY, Aprl [2] Smard, M., N. Ueffng, P. Isabelle, R. Kuhn,-Based Translaton wth Statstcal Phrase-Based Post- Proceedngs of the Second Workshop on Statstcal Machne Translaton, 2nd WSMT, pp , Prague, Czech Republc, June [22] Zhou, Yu, Chengqng Zong, and Bo Xu, In Proceedngs of IEEE Internatonal Conference on Systems, Man & Cybernetcs (SMCC2004), Hague, Netherlands,

204 208

205 Mnmally Supervsed Queston Classfcaton and Answerng based on WordNet and Wkpeda Joseph Chang Tzu-Hs Yen Rchard Tzong-Han Tsa* Department of Computer Scence and Engneerng, Yuan Ze Unversty, Tawan {s95533, *correspondng author (WordNet),58,865 Abstract In ths paper, we ntroduce an automatc method for classfyng a gven queston usng broad semantc categores n an exstng lexcal database (.e., WordNet) as the class tagset. For ths, we also constructed a large scale entty supersense database that contans over.5 mllon enttes to the 25 WordNet lexcographer s fles (supersenses) from ttles of Wkpeda entry. To show the usefulness of our work, we mplement a smple redundancy-based system that takes the advantage of the large scale semantc database to perform queston classfcaton and named entty classfcaton for open doman queston answerng. Expermental results show that the proposed method outperform the baselne of not usng queston classfcaton. Keywords: queston answerng, queston classfcaton, semantc category, WordNet, Wkpeda.. Introducton Queston classfcaton s consdered crucal to the queston answerng task due to ts ablty 209

206 to elmnatng answer canddates rrelevant to the queston. For example, answers to personquestons (e.g., Who wrote Hamlet?) should always be a person (e.g., Wllam Shakespeare). Common classfcaton strateges ncludes semantc categorzaton and surface patterns dentfcaton. In order to fully beneft from queston classfcaton technques, answer canddates should be classfed the same way as questons. Surface patterns dentfcaton methods classfes questons to sets of word-based patterns. Answers are then extracted from retreved documents usng these patterns. Wthout the help of external knowledge, surface pattern methods suffer from lmted ablty to exclude answers that are n rrelevant semantc classes, especally when usng smaller or heterogeneous corpora. An other common approach uses external knowledge to classfy questons to semantc types. In some prevous QA systems that deploy queston classfcaton, named entty recognton (NER) technques are used for selectng answers from classfed canddates. State-of-the-art NER systems produce near human performances. Good results are often acheved by handcrafted complex grammar models or large amount of hand annotated tranng data. However, most hgh performance NER systems deal wth a specfc doman, focus on homogeneous corpora, and support a small set of NE types. For example, n the Message Understandng Conference 7 (MUC-7) NER task, the doman s Arplane crashes, and Rocket/Mssle Launches usng news reports as the corpus. There are only three NE classes contanng seven sub classes: ORG, PERSON, LOCATION, DATE, TIME, MONEY, PERCENT. Notce that n the seven subclasses, only three of them are NEs of physcal objects, others are number based enttes. Ths s apparently nsuffcent for canddates flterng for general queston answerng. Owng to the need of wder range NE types, some of the later proposed NE classes construct of up to 200 sub classes, but NER systems targetng these types of fne-graned NE classes may not be precse enough to acheve hgh performance. The amount of supported classfcaton types greatly nfluences the performance of QA systems. A coarse-graned classfcaton achevng hgher precson, may stll be weak n excludng mproper answers from further consderaton. A fne-graned classfcaton may seem a good approach, but the cost of hgh-precson classfcaton may be too hgh to produce actual gan n QA systems. Moreover, n open doman QA, answers are not necessarly NEs nor can they be captured by usng smple surface patterns. Usng a small set of NE types to classfy questons has ts lmts. We randomly analyzed 00 queston/answer pars from the Quz-zone Web ste ( only 70% of them are NEs. Ths shows beng able to classfy common nouns s stll very mportant n developng QA systems. In order to support more general queston anwerng, where the answer can be NEs and common nouns, we took the approach of usng fner-graned semantc categores n an exstng lexcal database (.e., WordNet). WordNet s a large scale, hand-crafted lexcal ontology database wdely used n solvng natural language processng related tasks. It provdes taxonomy of word senses and relatons of 55,327 basc vocabulares that can be used as an semantc taxonomy for entty classfcaton. However, n the later sectons of ths paper, we wll show that WordNet leave room for mprovement n queston classfcaton and 20

207 answer valdaton, and more enttes, especally NEs, are needed to acheve reasonable coverage for answer canddates flterng. Wth ths n mnd, we turn to Wkpeda, an onlne encyclopeda compled by mllons of volunteers all around the world, consstng artcles of all knds. It has become one of the largest reference tool ever. It s only natural that many researchers have used Wkpeda to help perform the QA task. Usng WordNet semantc categores and rch nformaton from Wkpeda, we propose an mnmally supervsed queston classfcaton method targetng at the 25 WordNet lexcographer s fles for queston classfcaton. Expermental results show promsng precson and recall rates. The method nvolve extendng WordNet coverage and producng the tranng data automatcally from queston/answer pars, and tranng a maxmum entropy model to perform for classfcaton. The rest of the paper s organzed as follows. In the next secton, we revew related work n queston classfcaton and queston answerng. In Secton 3 we explan n detal the proposed method. Then, n Secton 4 we report expermental results and conclude n Secton Related Work Text Retreval Conference (TREC) has been one of the major actve research conferences n the feld of queston answerng. The early tasks n the queston answerng track n TREC focuses on fndng documents that contan the answer to the nput queston. No further extracton of exact answers from the retreved documents s requred. In an effort to foster more advanced research, the TREC 2005 QA Task focuses on systems capable of returnng exact answers rather than just the documents contanng answers. Three types of questons are gven, ncludng FACTOID, LIST, and OTHER. For every set of questons a target text s also gven as the context of the set of questons. LIST questons requre multple answers for the topc, whle FACTOID questons requred only one correct answer. Therefore, many consder LIST questons are easer. More recent TREC QA Tasks focuses on complex, nteractve queston answerng systems (cqa). In cqa Tasks, fxed-format template questons are gven (e.g. What evdence s there for transport of [drugs] from [Mexco] to [the U.S.]?). Complex questons are answerable wth several sentences or clauses. (e.g. Unted States arrested 67 people - ncludng 26 Mexcan bankers) The desgn of an nteractve query nterface s also a part of ths task. In ths paper, we focus on the ssue of classfyng questons n order to effectvely dentfy potental answers to FACTOID and LIST questons. More specfcally, we focus on the frst part of queston answerng task, namely dentfyng the semantc classes of the queston (and answer) that can be used to formulate an effectve query for document retreval and to extract answers n the retreved documents. The body of QA research most closely related to our work focuses on the framework of representng types of questons and automatc determnaton of queston types from the gven queston. Ravchandran and Hovy [2002] proposed a queston classfcaton method that does not rely on external semantc knowledge, but rather classfes a queston to dfferent sets of 2

208 surface patterns, e.g. ENTITY was born n ANSWER, whch requres ENTITY as an anchor phrase from the gven queston and mpose no constrant on the semantc type of ANSWER. In contrast, we use a szable set of queston and answer pars to learn how to classfy a gven queston nto a small number of types from the broad semantc types n the exstng lexcal knowledge base of WordNet. In a study more closely related to our work, Caramta and Johnson [2003] used WordNet for taggng out-of-vocabulary term wth supersense for queston answerng and other tasks. They dscovered t s necessary to augment WordNet by employng complex nferences nvolvng world knowledge. We propose a smlar method WkSense that uses Wkpeda ttles to automatcally create a database and extend WordNet by addng new Wkpeda ttles tagged wth supersenses. Our method, whch we wll descrbe n the next secton, uses a dfferent machne learnng strategy and contextual settng, under the same representatonal framework Once the classes of the gven questons have been determned, typcal QA systems attempt to formulate and expand the query for each type of queston or on a queston by queston bass. Kwok et al. [200] proposed a method that matches the gven queston heurstcally aganst a sem-automatc constructed set of queston types n order to transform the queston to effectvely queres, and then extract potental answers from retreved documents. Agchten, Lawrence, and Gravano [2004] used queston phrases (e.g., what s a n the queston What s a hard dsk? ) to represent the queston types and learn query expanson rules for each queston type. Prager et al. [2002] descrbe an automatc method for dentfyng semantc type of expected answers. In general, query expanson s effectve n brngng more relevant document to the top-ranked lst. However, the contrbuton to the overall queston answerng task mght be margnal only. In contrast to the prevous work, we do not use queston types to expand queres, but rather use queston types to flter and re-rank potental answers, whch may contrbute more drectly to the performance of queston answerng. Indeed, effectve explct queston classfcaton s crucal for pnpontng and rankng answers n the fnal stage of answer extracton. Ravchandran and Hovy [2002] proposed a method for learnng untyped, anchored surface patterns n order to extract and rank answers for a gven queston type. However, as they ponted out, wthout external semantc nformaton, surface classfcaton suffers from extractng answer of mproper class. Example shows a where-s queston (e.g. Where s Rocky Mountans?) may be classfed to the pattern ENTITY n ANSWER ( Rocky Mountans n ANSWER ), but wth the retreved text...took photos of Rocky Mountans n the background when vstng..., the system may mstakenly dentfes background as the answer. Intutvely, by mposng a semantc type of LOCATION on answers, we can flter out such nose (background belongs to the type of COGNITION accordng to WordNet). In contrast, we do not rely on anchor phrases to extract answers but rather use queston types and redundancy to flter potental answers. Another effectve approach to extract and rank answers s based on redundancy. Brll, Ln, Banko, Dumas and Ng [200] proposed a method that uses redundancy n two ways. Frst, relevant relaton patterns (lngustc formulatons) are dentfed n the retreved documents, redundances are counted. Second, answer redundancy s used to extract relevant The data of WkSense wll be made avalable to the publc n the near future 22

209 answers. Dstance between answer canddates and query terms are also consdered n the proposed method through re-weghtng. In our QA system, we use a smlar approach of answer redundancy as our base lne. In contrast to the prevous research n queston classfcaton for QA systems, we present a system that automatcally learns to assgn multple types to a gven questons, wth the goal of maxmzng the probablty of extractng answers to the gven queston. We explot the nherent regularty of questons and more or less unambguous answer n the tranng data and use semantc nformaton n WordNet augmented wth rch named enttes from Wkpeda. 3. Proposed Methods In ths secton, we descrbe the proposed method for supersense taggng of Wkpeda artcle ttles, mnmally supervsed queston classfcaton, and a smple redundancy based QA system for evaluaton. 3. Problem Statement and Datasets We focus on deployng queston classfcaton to develop an open doman, general-purpose QA system. Wkpeda ttles, Wkpeda categores and YAGO are used n the process of generatng WkSense. For queston classfcaton, the 25 lexcographer's fles n WordNet (supersenses) are used as the targetng class tagset. Both WordNet and WkSense are used to generate the tranng data for classfyng questons. person cognton tme event feelng communcaton possesson attrbute quantty shape artfact locaton object motve plant act substance process anmal relaton food state phenomenon body group Table. The 25 lexcographer's fles n WordNet, or supersenses. At run tme, we contnue to use both WkSense and WordNet for answer canddates flterng. Ether the Web s used as the corpus, and Google s used as the nformaton retreval engne. 23

210 ) Generate Large Semantc Category from Wkpeda ttles (WkSense) (Secton 3.2.) 2) Tranng of Queston Classfer usng WkSense and WordNet (Secton 3.2.2) 3) Redundancy QA System wth Queston Classfcaton (Secton 3.3) Fg. Out lne of the proposed method for QA system constructon 3.2 Tranng Stage The tranng stage of the proposed QA system conssts of two man steps: generaton of large scale, semantc category usng Wkpeda (WkSense) and tranng of fne-graned queston classfer usng WkSense and WordNet. Fgure shows the steps of our tranng process and QA system Automatc Generaton of Large Scale Semantc Category from Wkpeda In the frst stage of the tranng process (Step () n Fgure ), we generate a large scale, fnergraned supersense semantc database from Wkpeda. Wkpeda currently conssts of over 2,900,000 artcles. Every artcle n Wkpeda s hand tagged by volunteers wth up to a few dozens of categores. There are 363,64 dfferent categores n Wkpeda, some used n many artcles, whle many are used n only a handful of artcles. These categores are a mxed bag of subject areas, attrbutes, hypernyms, and edtoral notes. In order to utlze the nformaton provded n Wkpeda categores, Suchanek, Kasnec, and Wekum [2007] developed YAGO as an ontology wth lnks from Wkpeda categores to WordNet senses, thereby resolvng the ambgutes that exst n category terms (e.g., Captals n Asa s related to captal cty, whle Venture Captal s related to fund). Although YAGO only covered 50% (82,945) of the Wkpeda categores, these categores covers of substantal part of Wkpeda artcles. By usng ths characterstc n combnaton wth YAGO, we use votng to heurstcally determnes whch of the 25 WordNet lexcographer fles the ttles belongs to. Fgure 2 shows the algorthm for categorzng Wkpeda ttles usng ts Wkpeda categores and YAGO. 24

211 procedure WkSense(Wkpeda, YAGO, WordNet) Declare Tags as lst Declare Results as lst for each Artcle n Wkpeda: Ttle := ttle of Artcle () Intalze Vote as an empty dctonary for each Category n Artcle: (2) f Category s supported by YAGO: (3a) WordNetSense = YAGO(Category) Append WordNetSense to Tags (3b) WordNetSuperSense = WordNet(WordNetSense) (4) Vote[WordNetSuperSense]++ Class := supersense wth most votes n Vote (5) append <Ttle, Class, Tags> to Results (6) return Results Fg 2. Generaton of WkSense usng Wkpeda ttles/categores and YAGO For every artcles n Wkpeda, we use a dctonary to keep track of whch supersense has the hghest votes (Step ()). In Step (2), all the category n the artcle are checked f they are supported by YAGO. Supported categores are than transformed to WordNet senses through YAGO n Step (3a). The transformed senses are than transformed agan by WordNet to ts correspondng supersense n Step Step (3b), and the supersense s voted once (Step (4)). Once all categores has been checked, ttle and ts supersense wth the hghest votes s recorded, we also recored all the transformed WordNet senses for future uses (Step (5)). After all the artcles n Wkpeda are processed, all the recorded results are returned n Step (6). In the entre process, WordNet s only used to transform a word sense to ts supersense (lexcal fle). We show the classfcaton process and results of three example ttles n Wkpeda n Table 2. None of these ttles are n the WordNet vocabulary. Wk Ttle Categores Zenth Electroncs Consumer_electroncs_brands, Electroncs_companes_of_the_Unted_States, Companes_based_n_Lake_County_Illnos, Amateur_rado_companes, Companes_establshed_n_98, Goods_manufactured_n_the_Unted_States Senses company# (3), electroncs_company# (), good# (), :trade_name# () Supersense noun.group (4), noun.attrbute (), noun.communcaton () Wk Ttle Categores Paul Joron Conscousness_researchers_and_theorsts, Artfcal_ntellgence_researchers, Belgan_wrters, Belgan_socologsts, Belgan_academcs Senses research_worker# (2), wrter# (), socologst# (), academcan#3 () Supersense noun.person (5) 25

212 Wk Ttle Categores Hsnchu Ctes n Tawan Senses cty# () Supersense noun.locaton () Table 2. Example of Wkpeda ttles classfcaton for generatng WkSense Mnmally Supervsed Queston Classfcaton In the second and fnal stage n the tranng process (Step (2) n Fgure ), we use WordNet and the prevously ntroduced WkSense to automatcally create tranng data. Fgure 3 shows the tranng algorthm for constructng queston classfcaton method. We use the Maxmum Entropy Model to construct a sngle classfer wth multple outcomes (Step ()). The nput of ths stage ncludes a semantc database to determne the outcomes and a set of queston/answer pars. For each queston/answer pars, we frst determne whether the answer s lsted n the nput semantc database, unsupported queston/answer pars are neglected (Step (2)). In Step (3), a lsted answer s transform nto ts supersense usng semantc database as outcome(step (3a)), features are extracted from queston (Step (3b)). Fnally, extracted features and transformed outcome s used as an event to tran the classfer n Step (4). After all the lsted queston/answer pars has been processed, the traned classfer s returned. procedure QC Tran(SemantcCategory, QASet) () Declare Classfer as Maxmum Entropy Model for each <Q, A> n QASet: (2) f A s not supported by SemantcCategory: contnue (3a) Outcome := SemantcCategory(A) (3b) Features := ExtractFeatures(Q) (4) Classfer.AddEvent(Features, Outcome) Classfer.Tran() (5) return Classfer Fg 3. Mnmally Supervsed tranng method of queston classfer. Most of the concepts n WordNet are basc vocabulares. Only few name enttes can be found n WordNet, whereas Wkpeda contans a large amount of NEs. For nstance NEs lke Charles Dckens (wrter) s n both WkSense and WordNet vocabulary, whle Elton John (snger), Brothers n Arms (song) or Ben Nevs (mountan) can only be found n WkSense. However, WordNet, beng handcrafted, stll have much hgher accuracy on basc words and phrases. Therefore we use both WkSense and WordNet to cover common nouns as well as NEs. There are three man features used n the tranng stage: () the supersense of NEs 26

213 found n the gven queston (2) the queston phrase of the gven queston (3) any words n the gven queston. Queston Named Entty Class QuestonPhras e In klometres, how long s the Suez Canal? noun.artfact how-long The acton n the flm "A Vew To A Kll" features whch brdge? noun.communcaton whch-brdge Whch famous authour was marred to Anne Hathaway? noun.person whch-author Table 3. Example questons and features At runtme, classfcaton outcomes wth probablty hgher than a threshold are retreved. The value of the thresholds are set to a number of multples unform-dstrbuton probablty. In Secton 4, we show expermental results of the proposed methods performed at dfferent threshold. 3.3 Redundancy Based Queston Answerng System We use Google as our document retreval engne to search the entre Web. Only the snppets of the top 64 retreval results are used. After retrevng snppet passages, we take advantage of the large amount of retreved text to extract canddate and rely on redundancy to produce the answer. Prevous work shows that answer redundancy s an effectve technque for the QA task (Brll et al. [200]). Once answer canddates are extracted and redundancy counted, canddates are reranked based on queston classfcaton results. We retan and make use of several predcted queston types (wth probablty hgher than a threshold), n other words, the gven queston may be classfed to multple classes. Ths s reasonable due to the characterstc of our class tagset. Consder the queston Where were Prnce Charles and Prncess Dana marred?. It may be answered wth ether name of a cty (London), or name of a church (St Paul's Cathedral), therefore the queston type could be ether LOCATION or ARTIFACT. After the passages are retreved, answer canddates are extracted and classfed usng WordNet and WkSense. Fnally, we re-rank the 20 most frequent canddates by order the canddates n descendng order of queston type probablty, and then by frequency counts. Fnally, we produce the top n canddates as ouput. 4. Expermental Results Evaluaton In ths secton r, we descrbe expermental settngs and evaluaton results. In Secton 4., we descrbe n detal the expermental settngs and evaluaton matrces. Then evaluaton results and analyss of WkSense and queston classfcaton are dscussed n Secton 4.2 and Secton 4.3. Fnally, we report the performance of the classfer on a smple redundancy based QA system and evaluate ts effectveness n Secton

214 4. Expermental Settng and Evaluaton Matrces In the frst experment, we explan and analyss the result and coverage of WkSense, whch s then used n the second experment to classfy questons n addton to WordNet. We collected 5,676 queston/answer pars as the tranng data from the Quz-zone Web ste ( an onlne quz servce wth popular culture and general knowledge questons desgned to be answered by human. To evaluate our method, one tenth of the queston/answer pars s separated from the tranng data as the evaluaton data. Correct classes of the questons are labeled by human judges n order to evaluate the performance of queston classfcaton. We then used the proposed mnmally supervsed tranng method to generate two queston classfers based on dfferent database settng. In the frst experment, we only used WordNet to generate data to tran the frst classfer (baselne), and then compared the classfer wth the second classer traned on both WordNet and WkSense. The purpose s to show the amount of mprovement contrbuted by WkSense, f any. Snce WordNet s constructed by human, we consder t to have hgher precson. Therefore, WordNet s used when conflctng arses between WkSense and WordNet. The results of both classfer are presented and compared n term of recall and precson rates. 4.2 WkSense An mplementaton of the proposed method classfes about 55% of all ttles n Wkpeda, resultng a large scale, fner-graned, supersense semantc category contanng,58,865 enttes. Unclassfed ttles are usually caused by artcles wth lttle or no categores so ther semantc type can not be accurately determned. However, the result does not mply the classfcaton method has low coverage. Unlke most offlne encyclopedas, Wkpeda s an ongong collaboratve work. Thousands of new and unfnshed artcles are created by volunteers or robots daly. The Wkpeda edtoral prncple state that every Wkpeda artcle should belong to at least one category, therefore uncategorzed ttles usually belongs to artcles stll n the early stage of development (called stubs n the Wkpeda communty). 4.3 Queston Classfcaton In ths secton, we report the evaluaton results on usng the traned classfer to classfy questons. Fgure 4 shows the results of the two classfers n terms of recall and retreval sze at dfferent level of threshold (n multples of 0.04, the average probablty). At same recall performance, the lower retreval sze results n hgher precson. As Fgure 5 shows, hgher precson s acheved wth hgher threshold, tradng off recall. Notce that the recall of both classfers gradually decreases when threshold ncreases from one to fve tmes of unform probablty. Above threshold 5, recall of both classfers decreases rapdly. Consderng recall beng crucal to queston classfcaton task n order to prevent early elmnaton of the correct answer canddates, we focus our analyss on thresholds lower than 5. We can see that the 28

215 precson ncreases for both classfers as threshold ncreases. The combned classfer was able to acheve slghtly hgher recall and hgher precson of 9% at threshold of 2 tmes of unform probablty. Fg 5. Performance n terms of precson and recall at dfferent threshold. 4.4 Queston Answerng In ths experment, we frst run our QA system wth out any queston classfcaton as our baselne. We then run the same system on the same evaluaton dataset usng two dfferent queston classfer, one traned by WordNet and the other traned on WordNet plus WkSense. Threshold Top MRR Threshold Top MRR Baselne 34% 0.45 Baselne 34% % % % % % % % % % % (a) WordNet (b) WkSense + WordNet Table 4. Top precson and MRR result of deployng the 2 classfers 29

216 Table 4 lsts Top precson and MRR of our baselne system and the system wth the two classfers at varyng thresholds. As we can see, by ncludng queston classfcaton, both systems performed better than baselne. Wth the enhancement of WkSense, results n Table 4(b) acheve sgnfcantly hgher MRR and top precson comparng to system wth a classfer traned on WordNet only (see Table 4(a)). The best performance of both MRR and top precson was acheved by the system wth both WkSense and WordNet. At threshold of 2.25, the MRR was hgher than the baselne by 0.06, and top precson s hgher by 9%. 5. Conclusons Many future research drectons present themselves. For example expandng the coverage of WkSense usng other characterstcs of Wkpeda, such as nternal lnk structure, artcle contents, nformaton boxes and Wkpeda templates, mnmally supervsed tranng for automatcally supersense taggng on Wkpeda ttle, and a more complex QA system that take full advantage of fner-graned classfcaton. In summary, we have ntroduced a method of mnmally supervsed tranng for fnegraned queston classfcaton usng an automatcally generated supersense category (WkSense) and WordNet. The method nvolves supersense taggng of answers to generate tranng data, and usng Maxmum Entropy model to buld queston classfers. We have mplemented and evaluated the proposed methods usng a smple redundancy based QA system. The results show the method substantally outperforms the baselne of now usng queston classfcaton. References []! E. Agchten and S. Lawrence and L. Gravano, Learnng to Fnd Answers to Questons! on the Web, ACM Transactons on Internet Technology (TOIT), volume 4, pp ,!2004 [2] E. Brll and J. Ln and M. Banko and S. Dumas and A. Ng, Data-Intensve Queston! Answerng, In Proceedngs of the Tenth Text REtreval Conference (TREC),!pp , 200 [3]! M. Caramta and M. Johnson, Supersense Taggng of Unknown Nouns n WordNet,!Conference on Emprcal Methods on Natural Language Processng (EMNLP), pp.!68-75, 2003 [4]! C. Fellbaum, Wordnet: An Electronc Lexcal Database, ISBN: X, May 5,!998 [5]! C. Kwok and O. Etzon and D. S. Weld, Scalng queston answerng to the web, ACM! Transactons on Informaton Systems (TOIS), Volume 9, Issue 3, pp , 200 [6]! MetaWeb Technologes, Freebase Wkpeda Extracton (WEX) verson June 6, 2009,!

217 [7]! J. Prager and J. Chu-Carroll and K. Czuba, Statstcal answer-type dentfcaton n open-!doman queston answerng, Proceedngs of the second nternatonal conference on! Human Language Technology Research, pp , 2002 [8]! F. M. Suchanek and G. Kasnec and G. Wekum, Yago - A Core of Semantc Knowledge,! 6th nternatonal World Wde Web conference (WWW), 2007 [9]! D. Ravchandran and E. Hovy, Learnng Surface Text Patterns for a Queston Answerng! System, Proceedngs of the 40th Annual Meetng of the Assocaton for Computatonal!Lngustcs (ACL), pp. 4-47, July

218 222

219 We-Bn Lang, Yu-Cheng Hsao, and Chung-Hsen Wu Department of Computer Scence and Informaton Engneerng, Natonal Cheng Kung Unversty E-mal: (semantc slot) 48.% 8.9% 33.8% Abstract Ths paper presents a dalogue act detecton approach usng sentence structures and partal pattern trees to generate canddate sentences (CSs). A syntactc parser s utlzed to convert the CSs to sentence grammar rules (SRs). To avod the confuson between dalogue ntentons, the K-means algorthm s adopted to cluster the sentence structures of the same dalogue ntenton based on the SRs. Fnally, the relatonshp between these SRs and the ntentons s modeled by a latent dalogue act matrx. Moreover, for the applcaton to a travel nformaton dalogue system, optmal dalogue strateges are traned usng the partally observable Markov decson process (POMDP) for robust dalogue management. In evaluaton, compared to the semantc slot-based method whch acheves 48.% dalogue act detecton accuracy, the proposed approach can acheve 8.9% accuracy, wth 33.3% mprovement. Kewords: Dalogue act, partal pattern tree, sentence structure, POMDP [](MIT)[2] 223

220 (AT&T)[3] (NICT)[4] Phlps [5] [6][7][8] [9]Cho [0] (machne learnng) (fnte state machne) [4](partal observaton Markov decson process, POMDP)[] (slot-fllng) 2. audo-technca AT KHz 44 (dalogue turn),586 ( Dalogue Turn )(dalogue act, DA) 2.2 (Semantc Class) (slot-fllng) (Dalogue Act)(Acton) 224

221 DA [2](task) (slot) 2 j DA () DA DA 2 20 [3] 2 3 DA Acton 225

U U (nput processng)(dalogue management, DM)(output processng) U (automatc speech recognzer, ASR) W (spoken language understandng, SLU) c DA C SLU SLU DA C POMDP (strategy) (dalogue act hstory) DA H

222 U U (nput processng)(dalogue management, DM)(output processng) U (automatc speech recognzer, ASR) W (spoken language understandng, SLU) c DA C SLU SLU DA C POMDP (strategy) (dalogue act hstory) DA H (acton) a t (travel nformaton database) Context t (text-to-speech syntheszer, TTS) U 3. (SLU) U DA H DA*(detecton crteron) DA* = argmax P( DA C U, DAH ) DA DA DA DA C U c DA W U ˆ W (2)(3)(4) DA * = argmax P ( DAC, W U, DAH ) DA W argmax max P( DAC, W U, DAH ) DA W (3) = argmax max P( DAC W, U, DAH ) P( W U, DAH ) DA W (4) W U W (2) 226

223 DA H (5)P(DA C W )(Bayes decson rule)(6) DA* = argmax max PDA ( C W) PDA ( C DAH) PW ( U) DA W P( W ) ( ) argmax max DAC P DA = C P( DAC DAH ) P( W U ) DA W P ( W ) (5) (6) P(W )P(DA C )(equal pror) DA* argmax max P( W U) P( W DAC ) P( DAC DAH ) DA W P(W U) U W P(W DA C ) c DA (probablty of DA detecton) P(DA C DA H )(dalogue hstory) 3.2 (Dalogue Act Detecton) (7) 2 2 DA (sentence rule, SR) Stanford [4] DA (nducton) DA DA K-means DA 3.3 (Partal Pattern Trees, PPT) (optonal phrase, OP) (man phrase, MP) (partal pattern, PP) MP PP MPOP (deleton error) PP OP PPT PP 227

224 ( ) DA [5][6] PP (recovery) [7][8] PP Trans OP MP Trans = { OP, OP2,..., OP, MP, OP,..., OP } NB NB + NB + NA (8) NB NA MP MP OP PP MP OP OP Fller Fller(F) ABC AC OP B MP PP ABC ABF FBC FBF 3.4 (Fller Spottng) ASR (word-graph)[9](rescorng) word-graph SLU PP (dversty) SLU (χ 2 -test) (mean) µ (standard devaton) σ k 2 ( x = µ ) χ σ = 2 Fller Fller 3.5 (Sentence Rule Generaton) (9) [4] PCFG (Probablstc Context Free Grammar) PCFG (Stochastc Language Models, SLM) SLM 228

225 [20][2] 2 NP NN NP NN NP NN 3 () Root IP(2) IP VP (3) VP ADVP VP(4) ADVP AD (5) VP VV NP(6) NP NN 3.6 (Inducton) 3 L Rule DA φ, = φ 2, L Q φ L, φ φ φ, 2 2, 2 L, 2 φ, Q φ 2, Q φl, Q Φ L Q L Q L Q φ l,q l Rule l q DA φ l,q φ = ( ε ) P( Rule DA l, q l l q ) (0) () P(Rule l DA q ) C( Rulel, DAq ) P( Rulel DAq ) = C( Rule, DA ) k k q (2) C(Rule l, DA q ) Rule l DA q ( -ε l ) (Entropy) ε l Q C( Rulel, DAq ) C( Rulel, DAq ) ε l = log logq Q Q q= C( Rulel, DA ) C( Rulel, DA ) (3) = = 229

226 3.7 (Sentence Clusterng) DA DA DA {,2,3}{4,5,6} DA 2 {3,4,7} DA K-means centrod centrod q DA Trans Trans Φ q =( φ, qδ, φ2, qδ2,, φl, qδl) (4) φ l,q Φ L Q φ l,q δ l Trans Rule l 0 2 K Trans Trans j K = Φq Φq Trans k= Trans j Φq, Φq G k ( G, G,, G )* argmax Smlarty(, ) K G k k Smlarty( ) Cosne Measure (5) Trans Trans j Smlarty( q, q ) = Trans Trans j q q Trans Trans j q q (6) 3.8 (Latent DA Model, LDAM) LDAM L M ν, ν, 2 ν, M = ν 2, ν 2, 2 ν 2, M ν L, ν L, 2 ν L, M LDAM L M L M L M ν lm l m DA ν lm ν lm = ( εl ) P( Rulel DAm ) P(Rule l DA m ) (7) (8) 230

227 P( Rulel DAm ) = C( Rulel, DAm ) C( Rulek, DAm ) k C(Rule l, DA m ) l m (-ε l ) (Entropy) ε l ε C C( Rule = l, DAm ) C( Rulel, DAm ) l log logc C C c= C( Rulel, DA ) C( Rulel, DA ) = = C K-means DA DA C LDAM c column SLU 3.9 (9) (20) DA P(W DA C ) PW ( DA) P( Rule DA) PS ( C DA) (2) c W c W c Rule W W Rule W c Cosne Measure P( RuleW) DAc) = T RuleWDAc RuleW DAc SC W W DA wn P(SC W DAc) = P(SC j ) w wn W P(SC n j ) W n w n j 3.0 ASR DA t DA DA C DA t- t DA H t- PDA ( t = DAC DA = DAH) = PDA ( t DA, DA2,, DAt ) = PDA ( t DAt ) (24) (22) (23) 23

POMDP POMDP (hdden varable) (belef functon)(tuples) {S,A,R,T,O} S (belef functon) b (mantan) DA A R(s,a)=r s a r T(s,a,s)=P(s t+ =s s t,a t ) t s a t+ s(observaton) O POMDP P(o s,a) t a t+ s POMDP 4

228 POMDP POMDP (hdden varable) (belef functon)(tuples) {S,A,R,T,O} S (belef functon) b (mantan) DA A R(s,a)=r s a r T(s,a,s)=P(s t+ =s s t,a t ) t s a t+ s(observaton) O POMDP P(o s,a) t a t+ s POMDP 4 4 [] ' ' ' ' b ' ( s ' ) = P ( s o, a, b ) = k P ( o s, a ) P ( s ' s, a ) b ( s ) s S ' ( ' ' ' b s ) P ( o s, a ) P ( s ' s, a ) b ( s ) (Optmal value functon) * V ( b ) = max[ r ( s, a ) b ( s ) + γ a A s S 4.2 ' ' o, s, s ' p ( o ' s, a ) p ( s ' (25) ' ' s, a ) b ( s ) V ( b ( s ))] (26) 5 slot acton 5 (dalogue)(transton) 232

229 + 0, f 0, f r = 5, f 00, f + 00, f ' ' P o s, a) P( DA DA, a) (28) ( o ' s ' PDA ( C SDA, H)( perr ) C PDA ( o' DAs', a) PDA ( o' DAs' ) = perr C ( PDA ( C U, DAH)) DAu P errc P(DA C U, DA H )(-P errc ) POMDP [22] (27) (29) 5. (ASR) ASR (feature extracton) (acoustc model, AM)(language model) HMM Tool Kt(HTK) ASR ASR TCC300 (seed) AM 5 (rght-context dependent) ntal 38 (rght-context ndependent) fnal 3 5 (state) 32 mxtures (adaptaton) 39 (MFCC) 0.97 HTK Book TCC300 TCC300 un-gram b-gram 84.33% 93.2% DA 6(a) DA 233

230 (turns) 6(b) DA DA DA (turn) POMDP 92 DA ()(semantc slot, SS) semantc slot (2) SS (Stanford Parser, SP) LDAM (3) SSSP (PPT) LDAM (4) SSSPPPT (sentence clusterng, SC) LDAM 749.6% 76.2%8.6% 82.9% 7 DA 37 DADA DA DA () DA 3 DA 0% DA(2) () DA (3) PPT DA ()(2) DA(4) LDAM LDAM (2)(3)(36) 4 POMDP POMDP POMDP POMDP 234

7() SS (2) SS SP (3) SSSP PPT (4) SSSPPPT SC 4 POMDP POMDP 8.9% 48.% 33.8% () (2) POMDP (3) [] X.-D Huang, Alex Acero, H.-Wd Hon, Spoken Language Processng, Prentce-Halln, Inc. 200 [2] J.-J. Lu, Y.-S.

Avalable: http://www.research.att.com/algot/hmhy/ [4] C. Hor, K. Ohtake, T. Msu, H. Kashoka, S. Nakamura, Dalog Management usng Weghted Fnte-State Transducers, Interspeech, 2008 [5] S. Bennacef, L.

231 7() SS (2) SS SP (3) SSSP PPT (4) SSSPPPT SC 4 POMDP POMDP 8.9% 48.% 33.8% () (2) POMDP (3) [] X.-D Huang, Alex Acero, H.-Wd Hon, Spoken Language Processng, Prentce-Halln, Inc. 200 [2] J.-J. Lu, Y.-S. Xu, S. Seneff, and Vctor Zue, Ctybrowser : A multmodal restaurant gude n Mandarn, n Proc. Internatonal Chnese Spoken Language Processng, [3] AT&T(2002) How May I Help You? [Onlne]. Avalable: [4] C. Hor, K. Ohtake, T. Msu, H. Kashoka, S. Nakamura, Dalog Management usng Weghted Fnte-State Transducers, Interspeech, 2008 [5] S. Bennacef, L. Devllers, S. Rosset, and L. Lamel, Dalogue n the RAILTEL Telephone-Based System, n Proc. of ICSKP 96, vol., pp , 996 [6] C.-J. Lee, E.-F. Huang, and J.-K. Chen, A Mult-keyword Spotter for the Applcaton of the TL 235

232 Phone Drectory Assstant Servce, n Proc. Workshop on Dstrbuted System Technologes & Applcatons, pp , 997 [7],,, 2005 [8] T.-H. Chang, C.-M. Peng, Y.-C. Ln, H.-M. Wang and S.-C. Cheh, The Desgn of a Mandarn Chnese Spoken Dalogue System, n Proc. COTEC 98, Tape 998, pp.e2-5.e2-5.7 [9],,,, n Proc. ROCLING XV, Hsnchu, Tawan, [0] W.-S. Cho, H. Km, and J.-Y. Seo, An Integrated Dalogue Analyss Model for Determnng Speech Acts and Dscourse Structures, the Insttute of Electroncs, Informaton and Communcaton Engneers (IEICE), 2005 [] J. D. Wllams, and Steve Young, Partally Observable Markov Decson Processes for Spoken Dalog Systems, Computer Speech and Language, [2] Davd R. Traum, Speech Act for Dalogue Agents, Kluwer Academc Publshers, 999. [3] Y.-C. Xao, MHMC Annotaton of MHMC Travel Corpus, [Onlne]. Avalable: [4] Stanford Parser [Onlne]. Avalable: [5] E. Shrberg and A. Stolcke, Word Predctablty After Hestatons: A Corpus-Based Study, n Proc. Internatonal on Conference Spoken Language Processng (ICSLP), pp , 996. [6] M Su, M. Ostendorf, and H. Gsh, Modelng Dsfluences n Conversatonal Speech, n Proc. Internatonal on Conference Spoken Language Processng (ICSLP), vol, pp , 996. [7] T.R. Nesler and P.C. Woodland, Varable-Length Category N-gram Language Models, Computer, Speech and Language, vol. 2, pp. -26, 999. [8] J. S. Hamaker, Towards Buldng a Better Language Model for Swtchboard: the POS Taggng Task, n Proc. Internatonal Conference on Acoustcs, Speech, and Sgnal Processng(ICASSP), pp , 999. [9] F. Wessel, R. Schluter, K. Macherey, and H. Ney, Confdence Measures for Large Vocabulary Contnuous Speech Recognton, IEEE Trans. on Speech and Audo Processng, vol. 9, no. 3, pp , 200 [20] Dan Klen, and C. D. Mannng, Fast Exact Inference wth a Factored Model for Natural Language Parsng, n Advances n Neural Informaton Processng Systems, [2] Dan Klen, and C. D. Mannng, Accurate Unlexcalzed Parsng, n Proc. the 4st Meetng of the Assocaton for Computatonal Lngustcs, pp , [22] POMDP [Onlne]. Avalable: 236

233 Mandarn/Englsh Mxed-Lngual Speech Recognton System on Resource-Constraned Platforms We-Tyng Hong Department of Communcatons Engneerng Yuan Ze Unversty Hong-C Chen Department of Communcatons Engneerng Yuan Ze Unversty I-Bn Lao Telecommuncaton Laboratores, Chunghwa Telecom Wern-Jun Wang Telecommuncaton Laboratores, Chunghwa Telecom RASTA [] RASTA Keyword 237

234 2006 [2-3] (multlngualty) (Fuson) 2008 [4] PDA 2004 [5] (monolngual) 985 Hggns Wohlford[6](keyword Spottng) (fller template) 989 Rohlcek [7] 990 Rose Paul [8] HMM Base 996 Camnero [9] (Garbage Model)(utterance verfcaton) 990 Frank k.soong [0] N-Best 2004 Xe Lngyun [] (Prefx)(Man) (Tree Lexcon) (Beam Search) (Frame) 238

235 2- Input Speech (Feature extracton) (Beam Search) (Sort) Output Sentnece Mandarn database (MAT 2000) Englsh database (EAT+TWENG) Acoustc Model Tranng Acoustc Model Tranng (Tree Lexcon) Tranng 2- Recognton Lnux PC 2. (Gauss Mxture Model) (Laplacan Dstrbuton) probablty densty functon x u Lap( x, u, v) exp 2v v () u locaton parameterv scale parameter D x (mxture) j D bj( x) max Lapxd; uj, m, d; vj, m, d m d D 2v xd u exp v d jmd,, jmd,, j, m, d (2) 239

236 u jmd,, v jmd,, j m (mxture) d locaton parameter scale parameter k bj ( x) log k arg max Lap x ; u ; v m D d j, m, d j, m, d (3) d D xd uj, k, d log bj( x) Cj, k (4) v d jkd,, C jk, HMM j state k mxture ( x ) (overflow xd, uj, k, d, vj, k, d, cj, k 6-bt x ' ' ' ' d, uj, k, d, vj, k, d, cj, k log( bj ( x)) 6-bt Beam Search Vterb Search log-lkelhood (accumulated array) 32-bt overflow 2 s jkd,, v ' jkd,, >> bt-shft rght D ' ' ' log( bj( x)) Cj, k j, k, d xd uj, k, d s (5) d 2.2 (Prefx Word)(Man Word) Emal 2-2 T 240

237 Root : Root 2 : Root Root 2 Emal T T T IY MH EY LT T T 2-2() (Phone Model) Intal 2 Fnal 4/ Fnal Intal 2 Intal sl In In 2 Fn Fn 2 Fn 3 Fn 4 sl 2-3 sl state state 2 sl (BeamWdth)(Pseudo Code) 24

238 Stept=0 (t= ndex) Step2 Step3t= Step2 Step5 Step6 Step7 Step4t=2 t=t Step7 Step5 Step6 Step5- - Step6 Step7 Step5 Step6 Step Root : Root 2 : Root Root () S0SS2 242

239

240 2-7 Keyword Keyword Keyword 2 Keyword 2 slence Keyword k slence slence Keyword k slence Bypass Bypass Garbage Emal... (RCD)

241 Emal Emal Bypass Emal Bypass Garbage

242 Tme 2-0 (6) L ' E L ' E L (6) (7)(8) E L ' n L (7) ' n L ' fn L (8) ' fn 246

243 L ' n L ' fn 2000 (Mandarn speech database Across Tawan, MAT2000)[2] 8KHz (Englsh Across TawanEAT)[3](TWENG) (PSTN)(GSN) 8kHz kHz ME_Speech Corpus F Th D t 8kHz(frame) 2 (Mel-scale cepstrum) 2 (delta Mel-scale cepstrum) (delta log energy) (delta delta log energy) 26 RASTA 00 2 (ntal)38 4 (fnal) (slence) 34 2 (Phone) (slence) (mxture) (beam search) PDA

244 % 3 7.5% % 248

245 %55.4% % % 249

246 [] H. Hermansky, N. Morgan, "RASTA Processng of Speech," IEEE Transactons SAP, vol. 2, pp , Oct 994. [2] C. L. Huang, C-H Wu, "Phone Set Generaton Based on Acoustc Contextual Analyss for Multlngual Speech Recognton," ICASSP, vol. 4, pp , [3] C. L. Huang, C-H Wu, "Generaton of Phonetc Unts for Mxed-Language Speech Recognton Based on Acoustc and Contextual Analyss," Computers, IEEE Transactons, vol. 56, pp , [4] Po-Y Shh, Jhng-Fa Wang, and Hsao-Png Lee, "Acoustc and Phoneme Modelng Based on Confuson Matrx for Ubqutous Mxed-Language Speech Recognton," SUTC, pp , June [5],, "," TEPS, pp , [6] A. L. Hggns, R. E. Wohlford, "Keyword Recognton Usng Template Concatenaton," ICASSP, vol. 0, pp , 985. [7] J. R. Rohlcek, W. Roukos, and H. Gsh, "Contnuous Hdden Markov Models for Speaker Independent Word Spottng," ICASSP, pp , 989. [8] R. Rose, D. Paul, "A Hdden Markov Model Based Keyword Recognton System," ICASSP, vol., pp , 990. [9] J. Camnero, C. de la Torre, L. Vllarruba, C. Martn, and L. Hernandez, "On-lne Garbage Modelng wth Dscrmnant Analyss for Utterance Verfcaton," ICSLP, vol. 4, pp. 2-24, Oct 996. [0] T. Svendsen, F. K. Soong, and H. Pumhagen, "Optmzng Baseforms for HMM-Base Speech Recognton," Proceedngs of EuroSpeech, pp , 995. [] X. Lngyun, D. Lmn, "Effcent Vterb Beam Search Algorthm Usng Dynamc Prunng," ICOSP, vol., pp , [2] H. C. Wang, F. Sede, C. Y. Tseng, and L. S. Lee, "MAT2000 Desgn, Collecton, and Valdaton on a Mandarn 2000-speaker Telephone Speech Database," ICSLP, pp , Bejng, Chna, [3] 250

247 A Study of Sub-band Feature Statstcs Compensaton Technques Based on a Dscrete Wavelet Transform for Robust Speech Recognton 25

248 252

249 253

250 x[n] HL( z) 2 2 GL( z) H H z H L z G H z G L z G L z HH z G H z HL z H z H z H j j L H e H e HH () z 2 2 GH () z H L y[n] 254

251 2 H H z H L z xnn x n N /2 x 2 n N /4 LL L xn H H H L 2 2 x n xn F s H H H L 2 2 xn N 0, F s 2 Hz x n N 2 Fs 4 Hz, Fs 2 Hz x 2 n N 4 Fs 8 Hz, Fs 4 Hz x 3 n N 8 Fs 6 Hz, Fs 8 Hz x 4 n N 8 0, F s 6 Hz xn F 2 s F s xn x n Fs 4 Hz, Fs 2 Hz H H Level Level 2 Level 3 x2 n H L 2 2 x3 x4 n n 255

252 k x k n xn k 2 k h, 0 -, k, n m x m k L m x k n k h 2 k n m x m, k L. m h k, 2 k n h k 2k n h k, n h k n xn x k n L x n 2 GH x2 n 2 G H xn x 3 n 2 G H 2 x 4 n 2 GL Level 3 Level Level 2 L k L xn g k, n 2 mx k m g k n 2 mx L m k0 m m GL 2 g k, 2 k n g k 2k n g k, n g k n G H z HH z G L z HL z N 8N 4N 2 N xn G L 256

253 m x n; n N, 0 m M, N M m xn L x n, L L xn xn F s 0, Fs / 2 0, F / 2 f = L s F /2, /2 f 2,3,, L s F L s L 2 2 x n x n x n x n x n x n c n x n, s x n, t, t, s,s 2,s x n,t 2,t x n x n x n x n F F x n Xt, Xs, F Xs,. x n F x n Xt,. 257

254 x[n] x n L xn H (z) H H (z) :lowpassanalyssflter :hghpassanalyssflter H (z) H (z) 2 2 H (z) H (z) 2 :downsampler :normalzatonprocess 2 :upsampler 2 2 H (z) H (z) x 2 n x 2 n 2 n x3 x4 n x n 2 x 2 n 2 x n 3 x n 4 G (z) G H (z) 2 2 G (z) G H (z) G (z) G H (z) :lowpasssynthessflter :hghpasssynthessflter 2 2 G (z) G H (z) xn level level 2 level 3 level 3 level 2 level DWT decomposton feature statstcs normalzaton (MVN, HEQ) IDWT reconstructon 258

255 c c c c c c c c 259

256 c c 260

展开

SVM [6] PCA+SVM 79.75% 9 FERE FERE. PCA LDA Adaboost SVM 5 1 SVM Moghaddam [6] M (x,y ) x R N y x y {0,1} M f ( x) = y α k( x, x ) + b x k f(x) = 1 x

SVM [6] PCA+SVM 79.75% 9 FERE FERE. PCA LDA Adaboost SVM 5 1 SVM Moghaddam [6] M (x,y ) x R N y x y {0,1} M f ( x) = y α k( x, x ) + b x k f(x) = 1 x 100084 PCA Fsher FLD Adaboost SVM 9 FERE Adaboost 1. Gollomb [1] SEXNE 30 30 Cottrell [] BP Edelman [3] Alce [4] PCA PCA [5] Moghaddam [6] (SVM) RBF Fsher FLD FERE 3.4% Shakhnarovch [7] Adaboost 78% ±15