NCMMSC 01 20-22 NOV 01, Shenzhen china Mandarin pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and systems Department of Computer Science Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn,http:/sp.cs.tsinghuaeducn/fzheng
Mandarin Pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/ NCMMSC’01 20-22 NOV 01, Shenzhen, China
Motivation o In spontaneous speech, pronunciations of individual words are different there are often 今 Sound changes,and 今 Phone changes Change includes insertion deletion and substitution ☆上 or chinese an additional accent problem even people are speaking mandarin due to different dialect backgrounds(in Chinese, 7 major dialects) colloquialism, grammar, style a Goal: modelling the pronunciation variations s Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to Finding solutions to the pronunciation modelling theoretically and practically Center of speech Technology, Tsinghua University Slide 2
Center of Speech Technology, Tsinghua University Slide 2 Motivation ❑ In spontaneous speech, pronunciations of individual words are different, there are often ❖ Sound changes, and ❖ Phone changes. ❖ For Chinese ➢ an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) ➢ colloquialism, grammar, style ❑ Goal: modelling the pronunciation variations ❖ Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. ❖ Finding solutions to the pronunciation modelling theoretically and practically Change includes insertion, deletion and substitution
Overview Authors Paper Source Database Method WER T. Fukada. Y. Sagisaka Automatic generation of a pronunciation dictionary based Japanese AnN 75.54% (ATR, Japan) on a pronunciation network( EuroSpeech97) Prediction 6744% M-K LIu Bo Xu Mandarin accent adaptation based on CI/cD Shangha Confusion 45.13% (NLPR, China) pronunciation modeling(ICASSP2000) Accent(Intel MatrIx 40.24% M Saraclar(CLSP, JHU) Pronunciation modeling by sharing Gaussian densities Switchboard Gaussian 50.10% H Nock(CUED, Cam, UK)I across phonetic models(EuroSpeech99) 48.70% K Ma, G. Zavaliagkos Pronunciation modeling for large vocabulary Switchboard 5460% (GTE /BBN, USA) conversational speech recognit ion(ICSLP'98) Callhome 5349% M. Riley(AT&T Labs) Stochastic pronunciation modelling from hand-labelled TIMIT+ICSIDecision 44.66% W. Byrne(CLSP, JHU) phonetic corpora(Speech Communicaion, 1999(29) Tree 44.05% D. Povey, P.C. Wooland Improved discriminative training techniques for large Discriminant.60% ( CUED, Cambridge, UK) vocabulary continuous speech recognit ion(ICASSP'2001) Switchboard Training 44.30% T Hain P C Woodland New features in the cu-htk system for transcription of NIST Hubs VTLN 5160% CUED, Cambridge, UK) conversational telephone speech(ICASSP 2001) (Telephone) MMIE 4700% Center of speech Technology, Tsinghua University Slide 3
Center of Speech Technology, Tsinghua University Slide 3 Overview Authors Paper Source Database Method WER T. Fukada, Y. Sagisaka (ATR, Japan) Automatic generation of a pronunciation dictionary based on a pronunciation network (EuroSpeech’97) Japanese Spontaneous ANN Prediction 75.54 % 67.44 % M-K Liu, Bo Xu (NLPR, China) Mandarin accent adaptation based on CI/CD pronunciation modeling (ICASSP’2000) Shanghai Accent (Intel) Confusion Matrix 45.13 % 40.24 % M. Saraclar (CLSP, JHU) H. Nock (CUED, Cam., UK) Pronunciation modeling by sharing Gaussian densities across phonetic models (EuroSpeech’99) Switchboard Gaussian Sharing 50.10 % 48.70 % K. Ma, G. Zavaliagkos (GTE / BBN, USA) Pronunciation modeling for large vocabulary conversational speech recognition (ICSLP’98) Switchboard Callhome Lexical Adaptation 54.60 % 53.49 % M. Riley (AT&T Labs) W. Byrne (CLSP, JHU) Stochastic pronunciation modelling from hand-labelled phonetic corpora (Speech Communicaion, 1999 (29)) TIMIT + ICSI Decision Tree 44.66 % 44.05 % D. Povey, P.C. Wooland (CUED, Cambridge, UK) Improved discriminative training techniques for large vocabulary continuous speech recognition (ICASSP’2001) NAB, Switchboard Discriminant Training 46.60 % 44.30 % T. Hain, P.C. Woodland (CUED, Cambridge, UK) New features in the cu-htk system for transcription of conversational telephone speech (ICASSP’2001) NIST Hub5E (Telephone) VTLN MMIE 51.60 % 47.00 %
Necessity to establish a new annotated spontaneous speech corpus a The existing databases(incl. Broadcast News, CallHome, CallFriend, ..)do not cover all the Chinese spoken language phenomena pl , Sound changes: voiced, unvoiced, nasalization ,s Phone changes: retroflexed, OoV-phoneme a The existing databases do not contain pronunciation variation Intormation for use of bootstrap training o A Chinese annotated Spontaneous Speech(CAss) Corpus was established before wsoo on lsp in jhu Completely spontaneous(discourses, lectures, . Remarkable background noise, accent background Recorded onto tapes and then digitalized Center of speech Technology, Tsinghua University Slide 4
Center of Speech Technology, Tsinghua University Slide 4 ❑ The existing databases (incl. Broadcast News, CallHome, CallFriend, …) do not cover all the Chinese spoken language phenomena ❖ Sound changes: voiced, unvoiced, nasalization, … ❖ Phone changes: retroflexed, OOV-phoneme, … ❑ The existing databases do not contain pronunciation variation information for use of bootstrap training ❑ A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU ❖ Completely spontaneous (discourses, lectures, ...) ❖ Remarkable background noise, accent background, ... ❖ Recorded onto tapes and then digitalized Necessity to establish a new annotated spontaneous speech corpus
Chinese Annotated Spontaneous speech (CASS) Corpus o CAss w/Five-Tier Transcription 令 Character level base form Syllable(or Pinyin) Level (w/tone base form Initial/Final (F level w/time boundary for baseform 令 SAMPA- C Level surface form 今 Miscellaneous level used for garbage modeling Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur(unclear), modal, smack, non-Chinese xample Character 我们 认 点 SⅤable wo3 menO rent shio alan rer CASS Syllable wo3 menO duol ren 4 shio diana ren2 IF uom@_nt uo z'@_n i't iE n z'@ GIF uo @n tvu z@_ zan Misc noise Center of speech Technology, Tsinghua University Slide 5
Center of Speech Technology, Tsinghua University Slide 5 ❑ CASS w/ Five-Tier Transcription ❖ Character level : base form ❖ Syllable (or Pinyin) Level (w/ tone) : base form ❖ Initial/Final (IF) Level : w/ time boundary for baseform ❖ SAMPA-C Level : surface form ❖ Miscellaneous Level : used for garbage modeling ➢ Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese ❖ Example Character 我 们 多 认 识 点 人 Syllable wo3 men0 duo1 ren4 shi0 dian3 ren2 CASS Syllable wo3 men0 duo1 ren4 shi0 dianr3 ren2 IF uo m @_n t uo z` @_n s` i` t iE_n z` @_n GIF uo @_n t_v uo z` @_n s`_v t_v ia` z` @_n Misc noise mum Chinese Annotated Spontaneous Speech (CASS) Corpus
SAMPA-C: Machine readable Ipa a Phonologic consonants 23 a Phonologic vowels o Initials 21 口 finals 38 口 Retroflexed finals 38 o Tones and silences a Sound changes a Spontaneous phenomenon labels Center of speech Technology, Tsinghua University Slide 6
Center of Speech Technology, Tsinghua University Slide 6 ❑ Phonologic Consonants - 23 ❑ Phonologic Vowels - 9 ❑ Initials - 21 ❑ Finals - 38 ❑ Retroflexed finals - 38 ❑ Tones and Silences ❑ Sound Changes ❑ Spontaneous Phenomenon Labels SAMPA-C: Machine Readable IPA
Key points in PM (1) a Choosing and generating speech recognition unit (SrU set , So as to well describe the phone changes and sound changes ,s Could be syllable, semi-syllable, or INITIAL/FINAL a Constructing a multi-pronunciation lexicon(MPL) s a syllable-to-sru lexicon to reflect the relation between the ammatical units and acoustic models a Acoustically modeling spontaneous speech Theoretical framework . s CD modeling confusion matrix; data-driven Center of speech Technology, Tsinghua University Slide 7
Center of Speech Technology, Tsinghua University Slide 7 Key Points in PM (1) ❑ Choosing and generating speech recognition unit (SRU) set ❖ So as to well describe the phone changes and sound changes ❖ Could be syllable, semi-syllable, or INITIAL/FINAL. ❑ Constructing a multi-pronunciation lexicon (MPL) ❖ A syllable-to-SRU lexicon to reflect the relation between the grammatical units and acoustic models ❑ Acoustically modeling spontaneous speech ❖ Theoretical framework ❖ CD modeling; confusion matrix; data-driven
Key points in PM (2) a Customizing decoding algorithm according to new lexicon Improved time-synchronous search algorithm to reduce the path expansion(caused by CD modeling) a based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simul taneously in the path a Modifying statistical language model W=arg max P(X W)P(W) W= arg max P(XIn)P() W W=Baseform() w=argmax P(X)(W)P(W) W=Baseform(l Center of speech Technology, Tsinghua University Slide 8
Center of Speech Technology, Tsinghua University Slide 8 Key Points in PM (2) ❑ Customizing decoding algorithm according to new lexicon ❖ Improved time-synchronous search algorithm to reduce the path expansion (caused by CD modeling) ❖ A* based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simultaneously in the path ❑ Modifying statistical language model ˆ arg max ( | ) ( ) W W P X W P W = ( ) ˆ argmax ( | ) ( ) W Baseform V W P X V P V = = ( ) ˆ argmax ( | ) ( | ) ( ) W Baseform V W P X V P V W P W = =
Establishment of multi-Pron Lexicon a Two major approaches ☆ Define ed by linguists and phonetist Data-driven confusion matrix. rewritten rules decision tree 口 Our metho Find all possible pronunciations in SAMPA-C from database Reduce the size according to occurring frequencies Center of speech Technology, Tsinghua University Slide g
Center of Speech Technology, Tsinghua University Slide 9 ❑ Two major approaches ❖ Defined by linguists and phonetists ❖ Data-driven: confusion matrix, rewritten rules, decision tree ... ❑ Our method: ❖ Find all possible pronunciations in SAMPA-C from database ❖ Reduce the size according to occurring frequencies Establishment of Multi-Pron. Lexicon
Surface form for IF and syllable o Learning pronunciations Definition of Generalized Initial-Finals(GIFs) Collect all of them and choose the ts canonical most frequent ones ts v voiced as GIFs ts changed ts v changed to voiced ch canonica 7 troflexed or changed to ' e changed . Definition of Generalized Syllables(Gss)the lexicon Define them chang 0. tsh AN accordin ing to GIF chang 0. 1215 ts hv AN set chaI ng [0.0280] ts v AN chang [0.0187 AN chang [0.0187]z AN chang [0.0093 IAN P(GIFI GIF I Syllable) chang 0.0093]tsh AN chang [0.0093]tsh Center of Speech Technology, Tsinghua University Slide 10
Center of Speech Technology, Tsinghua University Slide 10 ❑ Learning pronunciations ❖ Definition of Generalized Initial-Finals (GIFs) ➢ z ts : canonical ➢ z ts_v : voiced ➢ z ts` : changed to ‘zh’ ➢ z ts`_v : changed to voiced ‘zh’ ➢ e 7 : canonical ➢ e 7` : retroflexed or changed to ‘er’ ➢ e @ : changed ❖ Definition of Generalized Syllables (GSs) – the lexicon ➢ chang [0.7850] ts`_h AN ➢ chang [0.1215] ts`_h_v AN ➢ chang [0.0280] ts`_v AN ➢ chang [0.0187] AN ➢ chang [0.0187] z` AN ➢ chang [0.0093] iAN ➢ chang [0.0093] ts_h AN ➢ chang [0.0093] ts`_h UN P ( [GIFi ] GIFf | Syllable ) Define them according to GIF set. Collect all of them and choose the most frequent ones as GIFs. Probabilistic lexicon. Surface form for IF and Syllable