Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.

Slides:



Advertisements
Similar presentations
Chapter 8 Technicalities: Functions, etc. Bjarne Stroustrup
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 A Web-Based Integral Evaluator: A Demonstration of the Successful Integration of WebEQ, Maple, and Java Wanda M. Kunkle Department of Mathematics & Computer.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Translation with Scarce Resources The Avenue Project.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Machine translation Context-based approach Lucia Otoyo.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Designing Interface Components. Components Navigation components - the user uses these components to give instructions. Input – Components that are used.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
An ICALL writing support system tunable to varying levels of learner initiative Karin Harbusch 1 & Gerard Kempen 2,3 1 University of Koblenz-Landau, Koblenz,
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Communicative and Academic English for the EFL Professional.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
Chapter – 8 Software Tools.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Eliciting a corpus of word-aligned phrases for MT
Yuri Pettinicchi Jeny Tony Philip
Introduction to Machine Translation
ICT Word Processing Lesson 5: Revising and Collaborating on Documents
Presentation transcript:

Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited computer skills to translate and word-align a corpus in some source language. The output of the elicitation tool is a text file containing triplets of eliciting sentence, elicited sentence, and alignment. The elicitation tool can produce bilingual glossaries based on the aligned corpus. It also has a simple "auto-align" option to add alignments for unambiguous word pairs in the same file. The Elicitation Tool Our Goals Alison Alvarez Lori Levin Robert Frederking Erik Peterson Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA Jeff Good (MPI Leipzig) Max Planck Institute for Evolutionary AnthropologyDeutscher Platz Leipzig Feature Specification ((subj ((np-my-general-type pronoun-type common-noun-type) (np-my-person person-first person-second person-third) (np-my-number num-sg num-pl) (np-my-biological-gender bio-gender-male bio-gender-female) (np-my-function fn-predicatee))) {[(predicate ((np-my-general-type common-noun-type) (np-my-definiteness definiteness-minus) (np-my-person person-third) (np-my-function predicate))) (c-my-copula-type role)] [(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)] [(predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-definiteness definiteness-plus) (np-my-function predicate))) (c-my-copula-type identity)]} (c-my-secondary-type secondary-copula) (c-my-polarity #all) (c-my-function fn-main-clause)(c-my-general-type declarative) (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) (c-v-my-phase-aspect durative)) “Use all values of polarity” “Multiply out by these lists of values” Disjoint set of copula types and their predicates Feature Structure Design A control language is used to define the size and scope of the set of feature structures that will be used by GenKit to generate the corpus np-my-number num-sg num-pl num-dual Notes for analysis of data: CS, page 38, seem to imply that some combinations of numbers are more expected than others Overview This research is part of the AVENUE Machine Translation Project. AVENUE is supported by the US National Science Foundation, NSF grant number IIS In the field of Machine Translation fully aligned and tagged translation corpora are considered to be one of the most valuable resources for automatically training translation systems. However, among minority languages such resources are hard to find. It is possible to overcome this obstacle by using techniques inspired by field linguistics. That is, by drawing on bilingual informants to translate and align given sentences. Field linguists have relied on questionnaires that have remained relatively static over a number of years. We want the flexibility to change the questionnaire to reflect different semantic domains, different goals for machine translation systems, different levels of detail, etc. We also want the questionnaire to be available in multiple languages. For example, we would want a version of the questionnaire in Spanish for use by Latin American minority language speakers. We also want flexibility in lexical selection in order to avoid cultural bias and to choose appropriate lexical items for the major language. This paper will look at methods for specifying the scope and depth of an elicitation corpus as well as methods for quick design and implementation of elicitation corpora. The resulting can also be used as a test suite to explore existing machine translation systems or design far-reaching corpora for studying low resource languages. ((subj ((np-my-general-type pronoun-type) (np-my-person person-third) (np-my-number num-sg) (np-my-biological-gender bio-gender-male) (np-my-function fn-predicatee)(np-my-animacy anim-human) (np-my-info-function info-neutral)(np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-semantic-class NEED_VALUES)(np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-n/a))) (predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-function predicate)(np-my-animacy anim-human) (np-my-info-function info-neutral) (np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-number num-sg)(np-my-semantic-class NEED_VALUES) (np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-! n/a))) (c-my-copula-type role) (c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-function fn-main-clause) (c-my-general- type declarative)(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state)(c-v-my-absolute- tense past)(c-v-my-phase-aspect durative)(c-my-imperative-degree imp-degree-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-sem-role actor-sem- role-neutral)(c-my-minor-type minor-n/a)(c-my-headedness-rc rc-head-n/a)(c-my-answer-type ans-n/a)(c-my-restrictivess-rc rc-restrictive-n/a)(c- my-focus-rc focus-n/a)(c-my-actor's-status actor-neutral)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a)) Feature Structures They are multi-level sets of feature-value pairs that are used to reflect the grammatical structures intended for elicitation. When paired with an English grammar and lexicon the above feature structure will generate ‘He was a teacher.’ 1. Tools for semi-automated corpus design: Test suite for MT Structured corpus for input to machine learning 2. A user interface for producing high quality, word-aligned parallel corpora (Elicitation Tool) 3. Automated learning of morpho-syntax for low-resource languages Feature Detection ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) I was a teacher Sentence Selection Translation/Alignment Mapping ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) I was a teacher watashi wa sensei deshita ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) I was a teacher watashi wa sensei deshita Minimal Pair Linking ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first) (person third) (animacy human) (identifiability - ) ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense present)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense present) (num sg) (animacy human) (person first) (person third) (animacy human) (identifiability - ) “I was a teacher” Watashi wa sensei deshita “I am a teacher” Watashi wa sensei desu Difference Detection ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) watashi wa sensei deshita ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense present)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense present) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) watashi wa sensei desu = = = ≠ Substitution mismatch Difference is found on ME