Landmark-Based Speech Recognition The marriage of high-Dimensional machine Learning Techniques with Modern Linguistic Representations Mark hasegawa-Johnson Thasegamauiuc edu Research performed in colla boration with James Baker( Carnegie Mellon), Sarah Borys(lino is) Ken Chen(linois), Emily Coogan(llinois). Steven Greenberg(Berkeley), Amit Juneja( Maryland), Katrin Kirchhoff (Washington), Karen Livescu(MIT), Srividya Mohan(Johns Hopkins), Jen muller( dept of Defense), Kemal Sonmez (sri, and Tianyu wang (georgia Tech)
Landmark-Based Speech Recognition The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations Mark Hasegawa-Johnson jhasegaw@uiuc.edu Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)
What are landmarks Time-frequency regions of high mutual information between phone and signal (maxima of i(phone label; acoustics(t,f)) Acoustic events with similar importance in all languages, and across all speaking styles Acoustic events that can be detected even in extremely noisy environments Where do these things happen? Syllable onset consonant release Syllable nucleus Vowel Center Syllable coda a consonant closure I(phone; acoustics)experiment: Hasegawa-Johnson, 2000
What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(phone label; acoustics(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure Where do these things happen? I(phone;acoustics) experiment: Hasegawa-Johnson, 2000
Landmark-Based Speech Recognition Lattice hypothesis 3 backed up 已5 Words Times Scores Pronunciation 0 Variants W卜 200 backed up 100 backup 02 0.4 06 0.8 T back up ONSET ONSET backt ihp Syllable NUCLEUS UCLEUS wack ihp Structure CODA CODA
Landmark-Based Speech Recognition ONSET NUCLEUS CODA NUCLEUS CODA ONSET Pronunciation Variants: … backed up … … backtup .. … back up … … backt ihp … … wackt ihp… … Lattice hypothesis: … backed up … Syllable Structure Scores Words Times
Talk outline Overview 1. Acoustic Modeling Speech data and acoustic features Landmark detection Estimation of real-valued"distinctive features" using support vector machines(SVM 2. Pronunciation Modeling A Dynamic Bayesian network(DBn)implementation of Articulatory Phonology A Discriminative Pronunciation model implemented using Maximum Entropy(MaxEnt) 3. Technological Evaluation Rescoring of word lattice output from an hMm-based recognizer New errors that we caused: Pronunciation models trained on 3 hours can't compete with triphone models trained on 3000 hours Future plans
Talk Outline Overview 1. Acoustic Modeling – Speech data and acoustic features – Landmark detection – Estimation of real-valued “distinctive features” using support vector machines (SVM) 2. Pronunciation Modeling – A Dynamic Bayesian network (DBN) implementation of Articulatory Phonology – A Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt) 3. Technological Evaluation – Rescoring of word lattice output from an HMM-based recognizer – Errors that we fixed: Channel noise, Laughter, etcetera – New errors that we caused: Pronunciation models trained on 3 hours can’t compete with triphone models trained on 3000 hours. – Future Plans
Overview History Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04 Scientific goal To use high-dimensional machine learning technologies (SVM, DBn to create representations capable of learning from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments Technological Goal Long-term: To create a better speech recognizer Short-term: lattice rescoring, applied to word lattices produced by SrIs nn/hmm hybrid
Overview • History – Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04 • Scientific Goal – To use high-dimensional machine learning technologies (SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments • Technological Goal – Long-term: To create a better speech recognizer – Short-term: lattice rescoring, applied to word lattices produced by SRI’s NN/HMM hybrid
Overview of Systems to be described Rescoring: log-Linear score combination p(MFCC, PLPword), p(word words) First-Pass asr Word lattice Ip(SVMword word label start end times Pronunciation Model (dbn or MaxEnt) p(landmarkS) Acoustic model: svms concatenate 4-15 frames MFCC(5ms lms frame period), Formants, Phonetic auditory model Parameters
… … Acoustic Model: SVMs p(landmark|SVM) MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters concatenate 4-15 frames Pronunciation Model (DBN or MaxEnt) First-Pass ASR Word Lattice p(SVM|word) Rescoring: Log-Linear Score Combination p(MFCC,PLP|word), p(word|words) word label, start & end times Overview of Systems to be Described
I Acoustic Modeling Goal: Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature Methods Large input vector space(many acoustic feature types) Regularized binary classifiers(SVMs) SVM outputs"smoothed" using dynamic programming SVM outputs converted to posterior probabi estimates once/5ms using histogram
I. Acoustic Modeling • Goal: Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature. • Methods: – Large input vector space (many acoustic feature types) – Regularized binary classifiers (SVMs) – SVM outputs “smoothed” using dynamic programming – SVM outputs converted to posterior probability estimates once/5ms using histogram
Speech Databases SI Ize Phonetic Word lattices T transcr NTIMIT 14hrs manual WS96&97 3.5hrs manual SWB1 WS04 subset 12hrs auto-SRI BBN Evalo1 10hrs bbn sri rto3 Dev 6hrs SRI RTO3 Eval 6hrs SRI
Speech Databases Size Phonetic Transcr. Word Lattices NTIMIT 14hrs manual - WS96&97 3.5hrs manual - SWB1 WS04 subset 12hrs auto-SRI BBN Eval01 10hrs - BBN & SRI RT03 Dev 6hrs - SRI RT03 Eval 6hrs - SRI
Acoustic and auditory Features MFCCS, 25ms window(standard asr features) Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond Noise-robust MUSIC-based formant frequencies amplitudes, and bandwidths(zheng hasegawa Johnson, ICSLP 2004) Acoustic-phonetic parameters formant-based relative spectral measures and time-domain measures Bitar espy-Wilson, 1996) Rate-place model of neural response fields in the cat auditory cortex ( Carlyon shamma, JASA 2003)
Acoustic and Auditory Features • MFCCs, 25ms window (standard ASR features) • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths (Zheng & HasegawaJohnson, ICSLP 2004) • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996) • Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)
What are distinctive Features? What are landmarks? · Distinctive feature= a binary partition of the phonemes (Jakobson, 1952) that compactly describes pronunciation variability (halle and correlates with distinct acoustic cues(Stevens) Landmark Change in the value of a manner Feature [+sonorant to [sonorant], [-sonorant to [+sonorant 5 manner features: Consonantal, continuant, syllabic, silence] Place and Voicing features: SVMs are only trained at landmarks Primary articulator: lips, tongue blade, or tongue body Features of primary articulator: anterior, strident Features of secondary articulator nasal, voiced
What are Distinctive Features? What are Landmarks? • Distinctive feature = – a binary partition of the phonemes (Jakobson, 1952) – … that compactly describes pronunciation variability (Halle) – … and correlates with distinct acoustic cues (Stevens) • Landmark = Change in the value of a Manner Feature – [+sonorant] to [–sonorant], [–sonorant] to [+sonorant] – 5 manner features: [consonantal, continuant, syllabic, silence] • Place and Voicing features: SVMs are only trained at landmarks – Primary articulator: lips, tongue blade, or tongue body – Features of primary articulator: anterior, strident – Features of secondary articulator: nasal, voiced