University of Marcaru Domain Adaptation for Statistical Machine Translation Master Defense By Longyue WANG, Vincent MT Group, NLP2CT Lab, FST, UM Supervised by Prof. Lidia S. Chao, Prof. Derek F. Wong 20/08/2014 UNIVERSIDADE DE MACAU UMM
Domain Adaptation for Statistical Machine Translation Master Defense By Longyue WANG, Vincent MT Group, NLP2CT Lab, FST, UM Supervised by Prof. Lidia S. Chao, Prof. Derek F. Wong 20/08/2014
A Research Scope Computational Linguistics tr Machine domain ranslation adaptation Speech Text Translation Translation Rule-based mt Hybrid MT Domain-Specific Statistical mt Figure 1: Our Research Scope [1][2] [] Daniel Jurafsky and James Martin(2008)An Introduction to Natural Language Processing, Computational Linguistics, and Speed Recognition, Second Edition. Prentice Hall [2wikipedIa,http://en.wikipediaorg/wiki/machine_tranSlation. (2/84)
Computational Linguistics Machine Translation Text Translation Domain-Specific Statistical MT Rule-based MT Hybrid MT Speech Translation Research Scope Figure 1: Our Research Scope [1] [2] [1] Daniel Jurafsky and James Martin (2008) An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Second Edition. Prentice Hall. [2] Wikipedia, http://en.wikipedia.org/wiki/Machine_translation. (2/84) Domain-Specific Statistical MT
genda Introduction Proposed Method I: New Criterion Proposed Method l: Combination Proposed Method lll: Linguistics Domain-Specific Online Translator ■Conc| UsIOn (3/84)
Agenda ◼ Introduction ◼ Proposed Method I: New Criterion ◼ Proposed Method II: Combination ◼ Proposed Method III: Linguistics ◼ Domain-Specific Online Translator (3/84) ◼ Conclusion
Part e Introduction (4/84)
Part I: Introduction (4/84)
The First Question WHAT IS STATISTICAL MACHINE TRANSLATION?
WHAT IS STATISTICAL MACHINE TRANSLATION? The First Question 5
a Statistical Machine Translation Corpus Word in Bl-Jext Alignment Training Models (static) Translation Language /sowe A 其 中<二 Decoding Figure 2: Phrase-based SMT Framework o SMt translations are generated on the basis of statistical models whose parameters are derived from the analysis of text corpora [3] o Currently, the most successful approach of SmT is phrase-based SMT, where the smallest translation unit is n-gram consecutive words. [3 Peter F. Brown, Vincent ]. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics. 19: 263-311 (6/84)
Statistical Machine Translation SMT translations are generated on the basis of statistical models whose parameters are derived from the analysis of text corpora [3]. Currently, the most successful approach of SMT is phrase-based SMT, where the smallest translation unit is n-gram consecutive words. [3] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics. 19:263–311. Figure 2: Phrase-based SMT Framework (6/84)
a Statistical Machine Translation It can be a very complicated thing, the ocean And it can be a very complicated thing, what 海洋是一个非常复杂的事物 human health is 叭类的健康也是一件非常复杂的事情。 And bringing those two together might seem very daunting task, but what I'm going to 将两者统一起来看起来是一件艰巨的任务。 我想要试图去说明的是即使是如此复杂 的情电存在一些我以为简单的话 we understand, we can really move forward 的话题 And those simple themes arent really themes about the complex science of what 这些简单的话题确实不是有关那复杂的科 going on, but things that we all pretty well 学有了怎样的发展,而是一些我们都恰好 知道的事情 And I'm going to start with this one 接下来我就来说一个。如果老妈不高兴了 momma ain 't happy, ain 't nobody happy 大家都别想开心 We know that, right? We 've experienced 我们都知道,不是吗?我们都经历过 Figure 2: Phrase-based SmT Frame work o Corpus is a collection of texts. e. g, IWSLT2012 official corpus language. Monolingual corpus, in one(mostly are the target side) language o Bilingual corpus is a collection of text paired with translation into anothe o Corpus may come from different genres topics etc. (7/84)
Statistical Machine Translation Corpus is a collection of texts. e.g., IWSLT2012 official corpus. Bilingual corpus is a collection of text paired with translation into another language. Monolingual corpus, in one (mostly are the target side) language. Corpus may come from different genres, topics etc. Figure 2: Phrase-based SMT Framework Parallel Corpus Monolingual Corpus (7/84)
a Statistical Machine Translation Porallel Corpus Training Models (static) Extracion Language [回中一二 Figure 2: Phrase-based SmT Frame work o Word alignment can be mined by the help of em algorithm o then extract phrase pairs from word alignment to generate translation table. o Distance- based reordering model is a penalty of changing position of translated phrases (8/84)
Statistical Machine Translation Word alignment can be mined by the help of EM algorithm. Then extract phrase pairs from word alignment to generate translation table. Distance-based reordering model is a penalty of changing position of translated phrases. Figure 2: Phrase-based SMT Framework Translation Table Word Alignment Reordering Model (8/84)
a Statistical Machine Translation Porallel Corpus in Bi-Text Training Models (static) Extracion guage [回中一二 Figure 2: Phrase-based SmT Frame work o Language model assigns a probability to a sequence of words. (n-gram)[4] PLM(S)=p(w, 1wi=n+) =1 [4 F Song and W B Croft( 1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval 279-280 (9/84)
Statistical Machine Translation Language model assigns a probability to a sequence of words. (n-gram) [4] Figure 2: Phrase-based SMT Framework Language Model [4] F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280.. 1 1 1 1 ( ) ( | ) l i LM i i n i p s p w w + − − + = = (9/84) (1)
a Statistical Machine Translation Porallel Corpus Training Models (static) Extracion Language 已中一中 Figure 2: Phrase-based SmT Frame work e hest =arg maxx Io(,le, xd(start, - -DII PiM(e, le. e-) Decoding function consists of three components the phrase translation table, which ensure the foreign phrase to match target ones; reordering model, which reorder the phrases appropriately; and language model, which ensure the output to be fluent (10/84)
Statistical Machine Translation Decoding function consists of three components: the phrase translation table, which ensure the foreign phrase to match target ones; reordering model, which reorder the phrases appropriately; and language model, which ensure the output to be fluent. Figure 2: Phrase-based SMT Framework Source Text Decoding Target Text Searching Translation Candidates 1 1 1 1 1 arg max ( | ) ( 1) ( | ... ) I e best e i i i i LM i i i i e f e d start end P e e e − − = = = − − (10/84) (2)