Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.

Slides:



Advertisements
Similar presentations
Syntactic analysis using Context Free Grammars. Analysis of language Morphological analysis – Chairs, Part Of Speech (POS) tagging – The/DT man/NN left/VBD.
Advertisements

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Chapter 6 Graphical User Interface (GUI) and Object-Oriented Design (OOD)
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
P.Fiévet July 4, 2006 IPCA6TRANS Assistance for the translation of IPC master files Geneva, July 4, 2006 Patrick FIÉVET World Intellectual Property Organization.
CS-EE 481 Spring Founders Day, 2005 University of Portland School of Engineering Project Pocket Gopher Conversational Learning Agent Team Josh Jones.
MT for Languages with Limited Resources Machine Translation April 20, 2011 Based on Joint Work with: Lori Levin, Jaime Carbonell, Stephan Vogel,
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
LING 388: Language and Computers Sandiway Fong Lecture 17.
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Improving Statistical Machine Translation by Means of Transfer Rules Nurit Melnik.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
CSE573 Autumn /23/98 Natural Language Processing Administrative –PS3 due today –PS4 out Wednesday, due Friday 3/13 (last day of class) special.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
NATURAL LANGUAGE PROCESSING
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Enabling MT for Languages with Limited Resources Alon Lavie and Lori Levin Language Technologies Institute Carnegie Mellon University.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
COMPUTER SOFT WARE Software is a set of electronic instructions that tells the computer how to do certain tasks. A set of instructions is often called.
Stat-Xfer מציגים: יוגב וקנין ועומר טבח, 05/01/2012
Stat-XFER: A General Framework for Search-based Syntax-driven MT
AMTEXT: Extraction-based MT for Arabic
Presentation transcript:

Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004

Hebrew-to-English MT Update2 The Team Alon Lavie Shuly Wintner (Faculty at Haifa Univ.) Yaniv Eytani (MS student at Haifa Univ.) Erik Peterson and Kathrin Probst…

Transfer Engine English Language Model Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Hebrew Input בשורה הבאה Decoder English Output in the next line Translation Output Lattice (0 1 (1 1 (2 2 (1 2 "THE (0 2 "IN (0 4 "IN THE NEXT Preprocessing Morphology

June 2, 2004Hebrew-to-English MT Update4 Main Tasks in Month-1 Hebrew Encoding Issues Hebrew Language Resources: –H-to-E Translation Lexicon –Morphological Analyzer Putting together a front-end to the XFER engine: morphology, format conversions Elicitation for Hebrew (two versions of EC) Installing system on local server in Haifa

June 2, 2004Hebrew-to-English MT Update5 Main Tasks in Month-2 Improving Hebrew Language Resources: –H-to-E Translation Lexicon: “full” spelling, reverse dict, compounds, enhanced English side –Morphological Analyzer: all analyses, lattice representation Manual Transfer Grammar Collecting development and testing data (and their reference translations) Development based on small dev-set Evaluation on test data

June 2, 2004Hebrew-to-English MT Update6 Hebrew Encoding Issues Input texts are (mostly) in standard Windows encoding for Hebrew Morphology analyzer and other resources already set to work in an “ascii-like” representation  Converter script converts the input into the ascii representation All further processing is done in the ascii representation Lexicon and grammar rules are also in ascii representation Elicitation is done in UTF8 Hebrew, output is converted to ascii representation

June 2, 2004Hebrew-to-English MT Update7 Translation Lexicon “Dahan” H-to-E and E-to-H dictionary available to us Excel spreadsheet format (from prev project) Coverage is not great but not bad –H-to-E is about 15K translation pairs –E-to-H is about 7K translation pairs POS information on both sides No proper names or named entities Issue with spelling convention “KTIB XSR”

June 2, 2004Hebrew-to-English MT Update8 Translation Lexicon Yaniv wrote scripts that –Extract the relevant fields from the excel file –Extract words in “deficient spelling” and transform into “full spelling” –Extract and special treat compound nouns –Merge with added lexicons (i.e. names) –Sort and remove duplicate entries –Convert to the XFER lexicon format Kathrin adapted script that “enhances” lexicon for English generation (plurals of nouns, tensed verb forms) [Show portion of full lexicon…]

June 2, 2004Hebrew-to-English MT Update9 Morphological Analyzer Morphology is a big deal for Hebrew Not just inflections and derivations, but also –Different words due to omission of vowels from the script –Attached prefixes for conj, det, prepositions, and some attached possessive suffixes Analyzer program from MS student at Technion already available, works on Windows and with minimal adaptation on Linux Coverage is reasonable… Produces all analyses or a disambiguated analysis for each word Entire sentence passed as input to morpher (not word- by-word)

June 2, 2004Hebrew-to-English MT Update10 Morphological Processing Split attached prefixes and suffixes into separate words for translation Produce f-structures as output Convert feature-value codes to our conventions Install morpher as a server running on our linux machines Yaniv wrote java scripts to handle input-output from the morpher Erik integrated a wrapper for running morpher as a server on our linux machines “All analyses mode”: all possible analyses for each input word returned, represented in the form of a input lattice

June 2, 2004Hebrew-to-English MT Update11 Morphology Example Input word: B$WRH | B$WRH | |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

June 2, 2004Hebrew-to-English MT Update12 Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

June 2, 2004Hebrew-to-English MT Update13 Manual Transfer Grammar Written by Alon in a couple of days… Current grammar has 36 rules: –21 NP rules –one PP rule –6 verb complexes and VP) rules –8 higher-phrase and sentence-level rules Captures the most common (mostly local) structural differences between Hebrew and English [show portion of grammar…]

June 2, 2004Hebrew-to-English MT Update14 Elicitation for Hebrew Erik made sure Elicitation Tool works for Hebrew Various versions of EC used: –Two reduced versions of full EC –Two versions of Structural EC Shuly and Yaniv translated and aligned substantial portion of both Kathrin trained an initial learned grammar

June 2, 2004Hebrew-to-English MT Update15 Decoding Strong Decoder for H-to-E: –Kathrin and Alon adapted script for running Stephan’s decoder. –No real amounts of parallel text, so no translation model scores for the edges… –Kathrin constructed a new English LM for decoding the Hebrew-to-English system 160 Million words Includes English side of our translation lexicon [show portion of lattice…]

June 2, 2004Hebrew-to-English MT Update16 Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

June 2, 2004Hebrew-to-English MT Update17 Evaluation Results Test set of 62 sentences from Haaretz newspaper, 2 reference translations SystemBLEUNISTPRMETEOR No Gram Learned Manual

June 2, 2004Hebrew-to-English MT Update18 Further Issues Transfer: XFER engine cannot handle the construction of full lattices anymore (too many entries)  we need a pruning mechanism Further improvements in the translation lexicon and morphological analyzer Decoding: –Adding a source-language LM –Can we train a translation model? Manual Grammar development… Improved grammar learning…