Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Natural Language Understanding
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
IMSS005 Computer Science Seminar
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Invitation to Computer Science, Java Version, Second Edition.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
HyperLex: lexical cartography for information retrieval Jean Veronis Presented by: Siddhanth Jain( ) Samiulla Shaikh( )
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Concept-based Short Text Classification and Ranking
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Approaches to Machine Translation
Natural Language Processing (NLP)
Statistical NLP: Lecture 13
Category-Based Pseudowords
Statistical NLP: Lecture 9
Extracting Semantic Concept Relations
A method for WSD on Unrestricted Text
Approaches to Machine Translation
Chapter 1 Problem Solving with C++
Natural Language Processing (NLP)
Information Retrieval
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia

Presentation Overview Problem Background & Motivation Research Objectives Methodology Advantages & Contributions

Presentation Overview Problem Background & Motivation Research Objectives Methodology Advantages & Contributions

Natural Language is Ambiguous bank ??

Word Sense Disambiguation Given:  a list of meanings/senses of words (dictionaries)  input text containing occurrences of ambiguous words Assign the correct sense to particular instance of ambiguous word in context A.k.a. “sense-tagging” …. bank#1: a financial institution that accepts deposits and channels the money into lending activities bank#2: sloping land (especially the slope beside a body of water) …. …withdraw money from the bank... bank#1

Disambiguation in Machine Translation (1) …. bank#1: a financial institution that accepts deposits and channels the money into lending activities bank#2: sloping land (especially the slope beside a body of water) …. …withdraw money from the bank... (Malay translations) bank tebing …withdraw money from the bank#1... …mengeluarkan wang dari bank... English input Malay output sense-tag (WSD) select translation word That worked well…

Disambiguation in Machine Translation (2) …. circulation#6: the spread or transmission of something (as news or money) to a wider group or area …. (Malay translations) edaran (money) penyebaran (berita) …50 ringgit notes in circulation... … 50 ringgit notes in circulation#6... …duit kertas 50 ringgit dalam edaran?? penyebaran?... English input Malay output sense-tag (WSD) translate That DIDN’T work well…

Optimising WSD for MT Input wordSense numberTranslation word select (Lee and Kim 2002)

Presentation Overview Problem Background & Motivation Research Objectives Methodology Advantages & Contributions

Main Objective Existing MT system:  Selects fragments (translation units) from previously translated examples  Re-combines selected translation units to produce translation output for new input text Improve the translation quality of this MT system by adapting a WSD algorithm specifically for MT purposes.

Need semantic knowledge about… Word senses  Use dictionary definitions Pairs of translation words  From bilingual knowledge bank (BKB) made up of pairs of sentences that are translations of each other  Corresponding words in each translation sentence pair are explicitly marked Need a model to capture semantic knowledge of lexical items  Conceptual Vectors (Lafourcade 2001)  Using a selection of concepts or themes  Construct mathematical vectors from concepts  Thematic similarity between lexical items ≡ angle between CVs

Need to: Compile CVs for word meanings on 2 levels:  Word sense (from dictionary)  Word/phrase translation unit (from BKB) using data compiled from previous step Use compiled information during translation runtime to select correct translation units

Presentation Overview Problem Background & Motivation Research Objectives Methodology Advantages and Contributions

Brief Outline Dictionary / Lexicon Word senses word → sense number level knowledge Concept Category Labels BKB Examples Translation units tag Translation Unit Profile (word → translation level knowledge) Input Text “clues” matching, comparison, selection selected translation units Translated Text Data Preparation PhaseEBMT Run-time Phase

Concept Hierarchy Example: GoiTaikei noun concrete abstract agent place object abstract thing event relation person organisation facility region nature animate inanimate mental state action human activity phenomenon natural phenomenon existence categorisation system relation characteristic state form numerical location time

circulation#6: the spread or transmission of something (such as news or money) to a wider group or area Definition CVs for Word Senses INFORMATION TRANSMISSION_ OF_INFORMATION SPREAD_MOVEMENT MONEY concepts Activation level concepts Activation level

Sense-tagging Translation Examples (English) … number [n] of [prep] one [num_card] ringgit [n] coins [n] in [prep] circulation [n]. … number [n]#2 of [prep] one [num_card]#1 ringgit [n]#1 coins [n]#1 in [prep] circulation [n]#6. … bilangan [n] syiling [n] seringgit [n] dalam [prep] edaran [n]. E: M:

circulation peredaran (2299, 2306, 2309) 2299:The circulation#5 of air through the pipes… Peredaran udara melalui paip-paip… 2306:… one ringgit coins in circulation#6. … syiling seringgit dalam peredaran. 2309:…dollar note… withdrawn from circulation#6. Wang kertas … ditarik daripada peredaran. BKB Examples V context (σ)V lex_def (σ)    == =  V profile (σ) V context ( σ, 2299)V lex_def ( σ, 2299) V context ( σ, 2306) V context ( σ, 2309) V lex_def ( σ, 2306) V lex_def ( σ, 2309) σ CVs of Translation Pairs

During Translation Dictionary / Lexicon Word senses word → sense number level knowledge Concept Category Labels BKB Examples Translation units tag Translation Unit Profile (word → translation level knowledge) Input Text “clues” matching, comparison, selection selected translation units Translated Text Data Preparation PhaseEBMT Run-time Phase

Some Results Translating ‘circulation’ to Malay  edaran or penyebaran TS: proposed translation selection using CVs BS: baseline strategy, chooses  the translation that co-occur with the same input words (and same structure) as in the BKB  or the most frequently occuring translation Input Translation chosen by TS Translation chosen by BS We will stop the circulation of that magazine. edaran  penyebaran We will stop the circulation of that rumour. penyebaran We will stop the circulation of that newspaper. edaran  penyebaran

Presentation Overview Problem Background & Motivation Research Objectives Methodology Advantages & Contributions

Advantages and Weaknesses Pros:  optimized for EBMT focus on translation selection, bypass intermediate WSD at run time Handles many-to-many mapping of source word  sense  translation words  allows for bi-directional translation with sense-tagging for 1 language  mathematical operations on vectors are easy to implement  avoids combinatorial effect when multiple ambiguous words in input Cons:  not all ambiguities can be solved using co-occurring concepts  does not handle translation selection of function words  manual work required in data preparation

Research Contributions Adaptation of a WSD approach for the specific aim of translation selection Proposal of specific guidelines for assigning related concepts for word meanings from dictionaries Production of knowledge about word meanings on two levels:  Word senses as in dictionaries  Translations as in parallel text

Summary WSD can be customized for different NLP applications accordingly  Different requirements  Increase efficiency WSD and related tasks based on concepts common to co-occurring word senses can be facilitated using conceptual vector model  Requires a concept category hierarchy and word sense list  Concepts related to a word sense modelled as mathematical vector  Conceptual similarity = angular distance between vectors Future work  Automating data preparation tasks  Investigating suitable weights or normalizing factors during CV manipulation  Integration with other WSD or translation selection strategies

Future Work Automate tagging tasks that are currently done manually Investigate different weight values for CVs for different syntactic relations or word classes Integrate with other WSD/translation selection tasks

Thank You