当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

《Artificial Intelligence:A Modern Approach》教学资源(讲义,英文版)chapter15b-6pp

资源类别:文库,文档格式:PDF,文档页数:3,文件大小:84.55KB,团购合买
点击下载完整版文档(PDF)

Wrds are the hidden state the observation P(Words signal)=aP(sigal Words)P(Words are noisy variable, It's not easy to wreck a nice beach Word pronunci peech sounds Speech as probabilistic inference CHAPTER 15,SECTION 6 SPEECH RECOGNITION (BRIEFLY Einighth"has tongue against front teeth -the parameters of a mixture of Gaussians Phone models rame features are typically in the power spectrum ARPAbet designedr American English Phones 旨是生年国

Speech recognition (briefly) Chapter 15, Section 6 Chapter 15, Section 6 1 Outline ♦ Speech as probabilistic inference ♦ Speech sounds ♦ Word pronunciation ♦ Word sequences Chapter 15, Section 6 2 Speech as probabilistic inference It’s not easy to wreck a nice beach Speech signals are noisy, variable, ambiguous What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words|signal) Use Bayes’ rule: P(Words|signal) = αP(signal|Words)P(Words) I.e., decomposes into acoustic model + language model Words are the hidden state sequence, signal is the observation sequence Chapter 15, Section 6 3 Phones All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal ⇒ acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] beat [b] bet [p] pet [ih] bit [ch] Chet [r] rat [ey] bet [d] debt [s] set [ao] bought [hh] hat [th] thick [ow] boat [hv] high [dh] that [er] Bert [l] let [w] wet [ix] roses [ng] sing [en] button . . . . . . . . . . . . . . . . . . E.g., “ceiling” is [s iy l ih ng] / [s iy l ix ng] / [s iy l en] Chapter 15, Section 6 4 Speech sounds Raw signal is the microphone displacement as a function of time; processed into overlapping 30ms frames, each described by features Frames with features: digital signal: Sampled, quantized Analog acoustic signal: 10 15 38 52 47 82 22 63 24 89 94 11 10 12 73 Frame features are typically formants—peaks in the power spectrum Chapter 15, Section 6 5 Phone models Frame features in P(features|phone) summarized by – an integer in [0 . . . 255] (using vector quantization); or – the parameters of a mixture of Gaussians Three-state phones: each phone has three phases (Onset, Mid, End) E.g., [t] has silent Onset, explosive Mid, hissing End ⇒ P(features|phone, phase) Triphone context: each phone becomes n 2 distinct phones, depending on the phones to its left and right E.g., [t] in “star” is written [t(s,aa)] (different from “tar”!) Triphones useful for handling coarticulation effects: the articulators have inertia and cannot switch instantaneously between positions E.g., [t] in “eighth” has tongue against front teeth Chapter 15, Section 6 6

-FORWARD(f.C+1) and se the recursive update Pe)can be computed recursively:define P(rd)=aP((or Structure is created manually,transition probabilities larned from data Word pronunciation models Output probabilities for the phone HMM: Phone HMM for [m Isolated words C0.2 Phone model example 已上 Viterbi algorithm finds the mos likely phone state Train by counting a word pairs in aange corpus Bigam model: Adjacent Combined HMM Language model Not just a sequence of iolated-word recognition problem! Continuous speech

Phone model example Phone HMM for [m]: 0.1 0.9 0.3 0.6 0.4 C3: 0.3 C2: 0.2 C1: 0.5 C5: 0.1 C4: 0.7 C3: 0.2 C7: 0.4 C6: 0.5 C4: 0.1 Output probabilities for the phone HMM: Onset: Mid: End: FINAL 0.7 Mid End Onset Chapter 15, Section 6 7 Word pronunciation models Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model 0.5 0.5 0.8 0.2 [m] [ey] [ow] [t] [aa] [t] [ah] [ow] 1.0 1.0 1.0 1.0 1.0 P([towmeytow]|“tomato”) = P([towmaatow]|“tomato”) = 0.1 P([tahmeytow]|“tomato”) = P([tahmaatow]|“tomato”) = 0.4 Structure is created manually, transition probabilities learned from data Chapter 15, Section 6 8 Isolated words Phone models + word models fix likelihood P(e1:t |word) for isolated word P(word|e1:t) = αP(e1:t |word)P(word) Prior probability P(word) obtained simply by counting word frequencies P(e1:t |word) can be computed recursively: define `1:t = P(Xt , e1:t) and use the recursive update `1:t+1 = Forward(`1:t , et+1) and then P(e1:t |word) = Σxt `1:t(xt) Isolated-word dictation systems with training reach 95–99% accuracy Chapter 15, Section 6 9 Continuous speech Not just a sequence of isolated-word recognition problems! – Adjacent words highly correlated – Sequence of most likely words 6= most likely sequence of words – Segmentation: there are few gaps in speech – Cross-word coarticulation—e.g., “next thing” Continuous speech systems manage 60–80% accuracy on a good day Chapter 15, Section 6 10 Language model Prior probability of a word sequence is given by chain rule: P(w1 · · · wn) = Yn i=1 P(wi |w1 · · · wi−1) Bigram model: P(wi |w1 · · · wi−1) ≈ P(wi |wi−1) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit Chapter 15, Section 6 11 Combined HMM States of the combined language+word+phone model are labelled by the word we’re in + the phone in that word + the phone state in that phone Viterbi algorithm finds the most likely phone state sequence Does segmentation by considering all possible word sequences and boundaries Doesn’t always give the most likely word sequence because each word sequence is the sum over many state sequences Jelinek invented A ∗ in 1969 a way to find most likely word sequence where “step cost” is − log P(wi |wi−1) Chapter 15, Section 6 12

Contexteffet (cari)are handld by augmentingate Evidence=speech signal,hidden variables=word and phone seqene 之 DBNs for speech recognition

DBNs for speech recognition tongue, lips articulators P(OBS | 2) = 1 end-of-word observation deterministic, fixed stochastic, learned deterministic, fixed phoneme transition index phoneme 0 1 0 o P(OBS | not 2) = 0 1 1 1 2 2 n n n 0 o observation stochastic, learned a a b b u u r r a u stochastic, learned Also easy to add variables for, e.g., gender, accent, speed. Zweig and Russell (1998) show up to 40% error reduction over HMMs Chapter 15, Section 6 13 Summary Since the mid-1970s, speech recognition has been formulated as probabilistic Evidence inference = speech signal, hidden variables = word and phone sequences “Context” effects (coarticulation etc.) are handled by augmenting state Variability in human speech (speed, timbre, etc., etc.) and background noise make continuous speech recognition in real settings an open p Chapter roblem 15, Section 6 14

点击下载完整版文档(PDF)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
已到末页,全文结束
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有