Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.

Similar presentations


Presentation on theme: "Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages."— Presentation transcript:

1 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages Jaime Carbonell, Lori Levin, Alon Lavie Language Technologies Institute Carnegie Mellon University {jgc, lsl, alavie}@cs.cmu.edu

2 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Context: Project NICE Very low-density languages (e.g. Mapudungun, Inupiaq, Siona,…) Minimal amount of parallel text (< 100K words) No standard orthography/spelling No available trained linguists Access to native informants possible Minimize development time and cost Target: functional but rudimentary MT

3 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Two Technical Approaches Generalized EBMT Parallel text 50K-2MB (uncontrolled corpus) Rapid implementation Proven for major L’s with reduced data Transfer-rule learning Elicitation (controlled) corpus to extract grammatical properties Seeded version-space learning

4 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Architecture Diagram User Learning Module Elicitation Process SVS Learning Process Transfer Rules Run-Time Module SL Input SL Parser Transfer Engine TL Generator EBMT Engine Unifier Module TL Output

5 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 EBMT Example English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

6 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Elicitation of Data for Seeded Version Space Learning

7 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Example: Elicitation Corpus I fell. Caí Tranün I am falling. Estoy cayendo Tranmeken You (John) fell. Tu (Juan) caiste Eymi tranimi (Kuan) You (John) are falling. Tu (Juan) estás cayendo Eimi(Kuan) tranmekeymi You (Mary) fell. Tu (María) caiste Eymi tranimi (Maria) You (Mary) are falling. Tu (María) estás cayendo Eimi tranmekeymi (Maria)

8 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 The Elicitation Corpus List of sentences in a major language –English –Spanish Dynamically adaptable –Different sentences are presented depending on what was previously elicited Compositional –Joe, Joe’s brother, I saw Joe’s brother, I told you that I saw Joe’s brother, etc. Aim for typological completeness –Cover all types of languages

9 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Version Space Learning Symbolic learning from + and – examples Invented by Mitchell, refined by Hirsch Builds generalization lattice implicitly Bounded by G and S sets Worse-case exponential complexity (in size of G and S) Slow convergence rate

10 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Seeded Version Spaces Generate concept seed from first + example –Generalization-level hypothesis (POS + feature agreement for T-rules in NICE) Generalization/specialization level bounds –Up to k-levels generalization, and up to j-levels specialization. Implicit lattice explored seed-outwards

11 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Complexity of SVS O(g k ) upward search, where g = # of generalization operators O(s j ) downward search, where s = # of specialization operators Since m and k are constants, the SVS runs in polynomial time of order max(j,k) Convergence rates bounded by F(j,k)

12 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 NICE Partners LanguageCountryInstitutions Mapudungun (in place) Chile Universidad de la Frontera, Institute for Indigenous Studies, Ministry of Education Iñupiaq (advanced discussion) US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center Siona (discussion) Colombia OAS-CICAD, Plante, Department of the Interior

13 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Pilot Version of Elicitation Corpus Approximately 800 sentences Tested on Swahili and Mapudungun Vocabulary –Include a variety of semantic classes e.g., animate, inanimate, man-made objects, natural objects, etc. Noun phrases –Detect number, gender, types of possessives, classifiers, etc. Basic sentences –Detect agreement between verb and subject and/or object, basic word order, problems with indefinite or inanimate subjects, etc. Complex constructions –Currently relative clauses. Later, comparatives, questions, embedded clauses, etc.

14 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Detection of Grammatical Features Each language uses a different inventory of grammatical features: tense, number, person, agreement. Swahili The hunter kill-ed the animal Mwindaji a-li-mu-ua mnyama a – class-one subject li – past tense mu – class-one object ua – kill Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him Ne-waapam-ek-wa me-see-indirect-he

15 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Mapudungun Data for EBMT Spanish-Mapudungun parallel corpora –Total words: 223,366 –Bilingual newspaper, 4 issues –Ultimas Familias – memoirs –Memorias de Pascual Coña –A publishable version of a historical text with a new translation into Spanish –35 hours transcribed speech (will be translated into Spanish) –80 hours recorded speech Spanish-Mapudungun glossary –About 5500 entries

16 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Nice/Mapudungun: Other Products Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE. Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data.

17 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Summary of Results: iRBMT Preliminary design and implementation of transfer rule formalism for machine translation. Design and pilot testing of prototype elicitation corpus. First prototype of feature detection Morphological processing in PC Kimmo covering about 40 Mapudungun morphemes. Preliminary version of new parser for run-time translation component.

18 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Next Steps (original plan) Lexical and phrasal generalization for EBMT Complete implementation of transfer-rule intepreter Implementation of SVS to learn transfer rules Extend elicitation corpus for evaluation Evaluate first on Mapudungun MT

19 Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 DARPA Redirection for NICE Focus on technology for rapid deployment of MT for new (low density) languages. Not interested in indigenous endangered L’s –Somali, Kirgistani, Bahasa, => yes –Siona, US-indigenous, Mapudungun => no First focus on limited-data evaluation for Major L’s, such as Chinese & Arabic Statistical methods favored over linguistic.


Download ppt "Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages."

Similar presentations


Ads by Google