Automatic Speech Recognition ING SHEN SCHOOL OF SOFTWARE ENGINEERING TONGJIUNIVERSITY
Automatic Speech Recognition Y I NG SH EN SCH O O L O F SO FTWARE ENGI NEERING TO NGJI UNI VERSI TY
Outline Introduction Speech recognition based on HMm Acoustic processing Acoustic modeling: Hidden Markov Model anguage modeling HUMAN COMPUTER INTERACTION
Outline Introduction Speech recognition based on HMM • Acoustic processing • Acoustic modeling: Hidden Markov Model • Language modeling 1/28/2021 HUMAN COMPUTER INTERACTION 2
What is speech recognition Automatic speech recognition(asr) is the process by which a computer maps an acoustic speech signal to text Challenges for researchers Linguistic factor Physiologic factor Environmental factor HUMAN COMPUTER INTERACTION
What is speech recognition? Automatic speech recognition(ASR) is the process by which a computer maps an acoustic speech signal to text. Challenges for researchers • Linguistic factor • Physiologic factor • Environmental factor 1/28/2021 HUMAN COMPUTER INTERACTION 3
Classification of speech recognition system Users Speaker dependent system Speaker independent system Speaker adaptive system Vocabulary small vocabulary: tens of word medium vocabulary: hundreds of words large vocabulary: thousands of words very-large vocabulary: tens of thousands of words Word pattern isolated-word system: single words at a time continuous speech system: words are connected together HUMAN COMPUTER INTERACTION
Classification of speech recognition system Users • Speaker dependent system • Speaker independent system • Speaker adaptive system Vocabulary • small vocabulary : tens of word • medium vocabulary : hundreds of words • large vocabulary : thousands of words • very-large vocabulary : tens of thousands of words Word pattern • isolated-word system : single words at a time • continuous speech system : words are connected together 1/28/2021 HUMAN COMPUTER INTERACTION 4
How do human do it? Middle ear 咖中 Eustachian ICULATE CORTE Articulation produces sound waves COCHLEA Which the ear conveys to the brain SIGNAL FROM for processing LEFT EAR COCHI三AR NUCLE SUPERIOR OLIVE HUMAN COMPUTER INTERACTION
How do human do it? Articulation produces sound waves Which the ear conveys to the brain for processing 1/28/2021 HUMAN COMPUTER INTERACTION 5
How might computers do it? Digitization Acoustic analysis of the speech signal Linguistic interpretation Acoustic waveform Acoustic signal 静中解 学需 an maris e va neri n a n :i rout u s even Speech recognition HUMAN COMPUTER INTERACTION
How might computers do it? Digitization Acoustic analysis of the speech signal Linguistic interpretation 1/28/2021 HUMAN COMPUTER INTERACTION 6 Acoustic waveform Acoustic signal Speech recognition
Outline Introduction Speech recognition based on HMm Acoustic processing Acoustic modeling: Hidden Markov Model anguage modeling Statistical approach HUMAN COMPUTER INTERACTION
Outline Introduction Speech recognition based on HMM • Acoustic processing • Acoustic modeling: Hidden Markov Model • Language modeling • Statistical approach 1/28/2021 HUMAN COMPUTER INTERACTION 7
Acoustic processing A wave for the words " speech lab"looks like p ee a 10000 1.20□ “to“a transition 0w个 Graphs from Simon Arnfield' s web tutorial on speech, Sheffield http://lethe.leedsac.uk/research/cogn/speech/tutoriall HUMAN COMPUTER INTERACTION
Acoustic processing A wave for the words “speech lab” looks like: 1/28/2021 HUMAN COMPUTER INTERACTION 8 s p ee ch l a b Graphs from Simon Arnfield’s web tutorial on speech, Sheffield: http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/ “l” to “a” transition:
Acoustic sampling 10 ms frame( ms= millisecond =1/1000 second C25 ms window around frame to smooth signal processing 体体和个 I ms 10ms Result Acoustic Feature vectors -986,-792,-692,-614,-429,-286,-134,-57,-41,-169,-456,-450,-541,-761,-1067,-1231,-1847,-952,-645,-489,-448 -212,193,114,-17,-110,128,261,198,390,461,772,948,1451,1974,2624,3793,4968,5939,6057,6581,7302,7649,7223,6119,5461 4353,3611,2740,204,1349,1178,1085,901,301,-262,-499,-488,-707,-1406,-1997,-2377,-2494,-2605,-2675,-2627,-2500,-2148, 1648,-970,-364,13,260,494,788,1011,938,717,507,323,324,325,350,103,-113,64,176,93,-249,-461,-606,-909,-1159,-1397,-1544 HUMAN COMPUTER INTERACTION 9
Acoustic sampling 10 ms frame (ms = millisecond = 1/1000 second) ~25 ms window around frame to smooth signal processing 1/28/2021 HUMAN COMPUTER INTERACTION 9 25 ms 10ms . . . a1 a2 a3 Result: Acoustic Feature Vectors
Spectral analysis Frequency gives pitch; amplitude gives volume sampling at -8 kHz phone, -16 kHz mic(kHz=1000 cycles/sec) p ee ch 10000 10000 Fourier transform of wave yields a spectrogram darkness indicates energy at each frequency hundreds to thousands of frequency samples HUMAN COMPUTER INTERACTION
Spectral analysis Frequency gives pitch; amplitude gives volume • sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) Fourier transform of wave yields a spectrogram • darkness indicates energy at each frequency • hundreds to thousands of frequency samples 1/28/2021 HUMAN COMPUTER INTERACTION 10 s p ee ch l a b