Presentation is loading. Please wait.

Presentation is loading. Please wait.

--Mengxue Zhang, Qingyang Li

Similar presentations


Presentation on theme: "--Mengxue Zhang, Qingyang Li"— Presentation transcript:

1 --Mengxue Zhang, Qingyang Li
Chris Callison-Burch --Mengxue Zhang, Qingyang Li

2 Chris Callison-Burch --- Timeline
Associate Professor in Information Science Department at University of Pennsylvania Chris Callison-Burch --- Timeline The Symbolic Systems Program (SSP) at Stanford University focuses on computers and minds: artificial and natural systems that use symbols to communicate, and to represent information General Chair for ACL 2017 Secretary-Treasurer for SIGDAT ( organizes the EMNLP) Program Co-Chair for EMNLP 2015 Sloan Research Fellow Got tenure in June 2017

3 Chris Callison-Burch --- Development
PPDB --- the paraphrase database (a resource with 169 million paraphrase) Joshua --- an open source decoder for statistical machine translation ( use synchronous context free grammars and extracts linguistically informed translation rules.) Moses System -- open-source toolkit for statistical machine translation Synchronous context free grammars: Rules in these grammars apply to two languages at the same time, capturing grammatical structures that are each other's translations.

4 Research Interests: Natural Language Understanding via Paraphrasing
Method that extracts paraphrases from bilingual parallel corpora. Paraphrasing with Bilingual Parallel Corpora, ACL 2005 Paraphrasing and Translation, PHD Thesis Extend his bilingual pivoting methodology to syntactic representation of translation rules. Semantically-informed syntactic machine translation, AMTA-2010 Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to Text Generation, EMNLP 2011 Then use the bilingual pivoting technique to create the paraphrase database PPDB: The Paraphrase Database, NAACL 2013 Made several Advances to PPDB: Semantics: add an interpretable semantics to PPDB Adding semantics to Data-Driven Paraphrasing, ACL 2015 Domain adaptation: Language is used differently in different domains. An algorithm automatically adapt paraphrases to suit a particular domain Domain-Specific Paraphrase Extraction, ACL, 2015 Natural Language generation: problem for text simplification Problems in current text simplification research, TACL 2015 Optimizing statistical machine translation for text simplification, 2016 I developed a method that extracts paraphrases from bilingual parallel corpora by identifying equivalent English expressions using a shared foreign phrase. This ensures that their meaning is similar Joshua Decoder It is useful for translating between languages with different word orders. Instead of pivoting over foreign phrases, they pivot over foreign synchronous context free grammar rules.

5 Research Interests: Statistical Machine Translation
To Build statistical machine translation systems without parallel corpora He used a bilingual lexicon induction method to estimate the parameters of phrase-based statistical machine translation systems A Comprehensive analysis of bilingual lexicon induction, Computational linguistics 2016 Joshua: An Open source Toolkit for parsing-based machine translation, WMT, 2009 Joshua Combining a diverse set of monolingually-derived signals of translation equivalence Supervised Bilingual lexicon induction with multiple monolingual signals, NAACL 2013 His Goal is: To go beyond simply expanding bilingual dictionaries so that he can use bilingual lexicon induction techniques in order to produce full translations systems

6 Research Interests: Crowdsourcing
To show that the quality of Urdu-English translations produced by non-professional translators can be made to approach the quality of professional translation at a fraction of the cost. Crowdsourcing Translation: Professional Qualify from Non-Professionals, ACL 2011 Use crowdsourcing to create a wide range of new NLP data sets The Arabic Online Commentary Dataset: An annotated Dataset of informal arabic with high dialectal content, ACL 2011 Constructing parallel corpora for six indian languages via crowdsourcing, WMT 2012 Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation, IWSLT 2014 ...etc Beyond NLP, design tool to help crowd workers find better higher paying work Crowd-Workers: Aggregating Information Across Turkers to Help them find higher paying work, HCOMP, 2014 Third research focus is crowdsourcing. The idea of using crowdsourcing to create annotated data for nlp applications is a relatively new topic.

7 Important Work: Moses Moses: Open source toolkit for statistical machine translation. ACL 2007 Citation: 3905. Motivation: Phrase-based statistical machine translation has been dominant, but lack of openness. An implementation of the data-driven approach to machine translation. Automatically train translation models for any language pair. Support multiple translation types Phrase-based machine translation Syntax-based translation Factored matchine translation Support multiple language models The most influential work of Chris is the paper Moses: Open source toolkit for statistical machine translation, which was accepted to ACL in The citation of this paper is now 3905, which is pretty high. The motivation of this paper is that Phrase-based statistical machine translation has been dominant in machine translation research. but most work in this field has been in-house research. There is a lack of openness. Therefore they implemented this open source toolkit for statistical machine translation. The toolkit can automatically train translation models for any language pair. All you need is a parallel corpus. It support multiple translation typles includes Phrase-based machine translation, Syntax-based translation, and Factored translation. Moses also support multiple language models.

8 Important Work: Moses Consists of all the components needed to preprocess data, train the language models and the translation models. Contains tools for tuning these models. Use standard external tools for some of the tasks GIZA++ (Och and Ney 2003) for word alignments. SRILM for language modeling. Two Main Components: Training Pipeline: Turn raw data into a machine translation model. mainly in perl, some in C++. Decoder: Translate the source sentence into the target language. single C++ Moses consists of all the components needed to preprocess data, train the language models and the translation models. it also contains tools for tuning these models. It uses some standard external tools for some of the tasks, for example, they use GIZA++ for word alignments, and SRILM for language Modeling. Two Main Components of Moses are Training Pipeline and the Decoder. the training pipeline is a collection of tools which take the raw data and turn it into a machine translation model. and the decoder will translate the source sentence into the target language.

9 Important Work: Moses Noval Contributions:
Support for linguistically motivated factors, such as POS tags or lemma. Besides an open-source toolkit, the paper has its noval contributions. the first is that it supports for linguistically motivated factors, such as Part Of Speech tags, the factors turns out to be informative in translation. The left picture is a phrase table with no factors. and the right picture is an augmented phrase table that contains factors of POS tags.

10 Noval Contributions: Confusion network decoding.
A weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Allow multiple, ambiguous input hypotheses. Improve spoken language translation, which is prone to speech recognition errors Efficient data formats for translation models and language models. The second noval contributions is the confusion network decoding. In spoken language translation, the input maybe noisy and ambiguous. To address this issue. they include confusion network decoding in Moses. Confusion network is a A weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes, as shown in the picture. This contribution will allow multiple, ambiguous input hypotheses and therefore improve spoken language translation. The paper also intrdouced an efficient data structure for translation models and language models, which reduces memory use and maintains translation speed as well.

11 Thanks&QA


Download ppt "--Mengxue Zhang, Qingyang Li"

Similar presentations


Ads by Google