O-COCOSDA. Oct. 1-3. 2003 Sentosa, singapore Making Full Use of Chinese speech Corpora Thomas Fang Zheng Center of speech Technology State Key laboratory of intelligent Technology and Systems Tsinghua University http://sp.cs.tsinghuaedu.cn, Beijing d-Ear Technologies Co. Ltd http://www.d-ear.com Oct.2,2003
Making Full Use of Chinese Speech Corpora Thomas Fang Zheng Center of Speech Technology State Key Laboratory of Intelligent Technology and Systems Tsinghua University http://sp.cs.tsinghua.edu.cn/ Beijing d-Ear Technologies Co., Ltd. http://www.d-Ear.com Oct. 2, 2003 O-COCOSDA, Oct. 1-3, 2003 Sentosa, Singapore
ecur 得意音通技术 2 Outline Your Partnerin the Century of Speech aPurpose of speech corpora U factors to be considered in data creation 日 Data creation 日 Data transcription ULearning from corpora aChinese Corpus Consortium(CCc)
Your Partner in the Century of Speech 2 Outline ❑Purpose of speech corpora ❑Factors to be considered in data creation ❑Data creation ❑Data transcription ❑Learning from corpora ❑Chinese Corpus Consortium (CCC)
ecur 得意音通技术 Purpose of Speech Corpora Your Partnerin the Century of Speech Item Description Percentage 1. Speech/ system development, evaluation, sentence 73% speaker comprehension and summarization, speech recognition recognition, speaker recognition 2. Speech system development, prosodic analysis 11% synthesis 3. Acoustic acoustic analysis, speech codin g 9% analVSiS 4. Sentence syntactic and semantic analysis 5% analysis 5. Speech/ speech and language education 2% language education
Your Partner in the Century of Speech 3 Purpose of Speech Corpora Item Description Percentage 1. Speech/ speaker recognition system development, evaluation, sentence comprehension and summarization, speech recognition, speaker recognition 73% 2. Speech synthesis system development, prosodic analysis 11% 3. Acoustic analysis acoustic analysis, speech coding 9% 4. Sentence analysis syntactic and semantic analysis 5% 5. Speech/ language education speech and language education 2%
ecur 得意音通技术 Outline Your Partnerin the Century of Speech PUrpose of speech corpora FActors to be considered in data creation 日 Data creation 日 Data transcription ULearning from corpora aChinese Corpus Consortium(CCc)
Your Partner in the Century of Speech 4 Outline ❑Purpose of speech corpora ❑Factors to be considered in data creation ❑Data creation ❑Data transcription ❑Learning from corpora ❑Chinese Corpus Consortium (CCC)
ecur 得意音通技术 5 Factors to be considered in data creation(1) Your Partnerin the Century of Speech 口 The language Language: e. g, Chinese or English i Dialectal background (e.g, for Chinese Putonghua or standard Chinese(普通话); Mandarin(官话, northern china Wu(xia, Southern Jiangsu, Zhejiang, and Shanghai Yue(ia, Guangdong, Hong Kong, Nanning Guangxi Min(闽南话, Fujian, Shantou guangdong, Haikou hainan, Taipei Taiwan kka(客家话, Meixian guangdong,Hsn- Chu Taiwan); Xiang(湘, Hunan); Gan(赣, Jiangxi; Hui(徽, Anhui;and Jn(晋, Shanxi ☆ Special for chinese: Simplified chinese Traditional chinese
Your Partner in the Century of Speech 5 Factors to be considered in data creation (1) ❑ The language. ❖ Language: e.g., Chinese or English ❖ Dialectal background (e.g., for Chinese) :- ▪ Putonghua or standard Chinese (普通话); ▪ Mandarin (官话,Northern China); ▪ Wu (吴语,Southern Jiangsu, Zhejiang, and Shanghai); ▪ Yue (粤语,Guangdong, Hong Kong, Nanning Guangxi); ▪ Min (闽南话,Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); ▪ Hakka (客家话,Meixian Guangdong, Hsin-Chu Taiwan); ▪ Xiang (湘,Hunan); ▪ Gan (赣,Jiangxi); ▪ Hui (徽,Anhui); and ▪ Jin (晋,Shanxi). ❖ Special for Chinese :- ▪ Simplified Chinese ▪ Traditional Chinese
ecur 得意音通技术 6 Your Partner inthe Centum af snatch A中适 兰糖话 陶容话 e江官话 说明:本图《中国西喜 集(图A2) 言的 官话方言分布图
Your Partner in the Century of Speech 6
ecur 得意音通技术 Your Partnerinthe Century of speech 现代吴语方言分区图 江淮官话” 苏沪嘉小片 宣州片 灶 徽语 太湖片 州片 处衢」 福 瓯江片 建
Your Partner in the Century of Speech 7 太湖片 台 州 片 瓯江片 ? 处衢片 苏沪嘉小片 江淮官话 徽语 宣州片 杭州小片 林绍小片
ecur 得意音通技术 Factors to be considered in data creation(2) Your Partnerin the Century of Speech 日 Speaking style Read for asr in earlier research, or for Tts Spontaneous/ conversational: for ASR nowadays 口 Recording channel 8 Depending on goal of task or application, or the application environment Close-talk microphones: for personal computers(PCs) Telephone, and or cellular phone: for telephony applications Specific channel: for embedded applications(PDA, digital recorder, .) or broadcast news, TV news. Normally mono channel instead of stereo channel 4 However, microphone array may be used for some research purpose
Your Partner in the Century of Speech 8 Factors to be considered in data creation (2) ❑Speaking style :- ❖Read: for ASR in earlier research, or for TTS ❖Spontaneous/conversational: for ASR nowadays ❑Recording channel ❖Depending on goal of task or application, or the application environment ▪ Close-talk microphones: for personal computers (PCs) ▪ Telephone, and/or cellular phone: for telephony applications ▪ Specific channel: for embedded applications (PDA, digital recorder, ...), or broadcast news, TV news. ❖Normally mono channel instead of stereo channel. ❖However, microphone array may be used for some research purpose
ecur 得意音通技术 9 Factors to be considered in data creation (3) Your Partnerin the Century of Speech 口 Sampling rate: s8 kHz: for the telephone/ mobile-phone channel where the bandwidth is about 3. 4 khz 16 kHz: for the close-talk microphone PC channel though the bandwidth is higher than 8 kHz 日 Sampling precision: ☆16bits, normally. 88-bit A-law or Miu-law(13-bit wide after decompression) a Signal-to-Noise Ratio ( snr) level s Was/is often collected in a good environment (clean speech database For noise-related research, noisy data obtained via Noises(noiseX 92 )mixed with clean speech Collected in real-world noisy environments
Your Partner in the Century of Speech 9 Factors to be considered in data creation (3) ❑ Sampling rate :- ❖ 8 kHz: for the telephone/mobile-phone channel where the bandwidth is about 3.4 kHz ❖ 16 kHz: for the close-talk microphone PC channel though the bandwidth is higher than 8 kHz. ❑ Sampling precision :- ❖ 16 bits, normally. ❖ 8-bit A-law or Miu-law (13-bit wide after decompression). ❑ Signal-to-Noise Ratio (SNR) level: ❖ Was/is often collected in a good environment (clean speech database). ❖ For noise-related research, noisy data obtained via :- ▪ Noises (NOISEX 92) mixed with clean speech; ▪ Collected in real-world noisy environments
ecur 得意音通技术 10 Factors to be considered in data creation(4) Your Partnerin the Century of Speech U Number of speakers and speaker balance The more, the better: with a good speaker diversity according to Gender ge ■ Education Birthplace or dialectal background Occupation and so on 日 Corpus size: B Measured by either the number of speakers or the length of valid speech in hour, or both
Your Partner in the Century of Speech 10 Factors to be considered in data creation (4) ❑Number of speakers and Speaker balance: ❖The more, the better: with a good speaker diversity, according to :- ▪ Gender; ▪ Age; ▪ Education; ▪ Birthplace (or dialectal background); ▪ Occupation; ▪ and so on. ❑Corpus size: ❖Measured by either the number of speakers or the length of valid speech in hour, or both