Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Automatic Name Transliteration via OCR and NLP Yu Cao Tao Wang.
Image Categorization by Learning and Reasoning with Regions Yixin Chen, University of New Orleans James Z. Wang, The Pennsylvania State University Published.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Introduction to Machine Learning Approach Lecture 5.
Multiclass object recognition
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
This week: overview on pattern recognition (related to machine learning)
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales Bo Pang and Lillian Lee Cornell University Carnegie.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Support Vector Machine Based Orthographic Disambiguation Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital “center” and “centre” are equivalent?
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Support Vector Machine (SVM) Presented by Robert Chen.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
A Simple Approach for Author Profiling in MapReduce
EXTRACTING COMPLEX PREDICATES IN HINDI ACROSS PARALLEL CORPORA
Queensland University of Technology
A German Corpus for Similarity Detection
My Tiny Ping-Pong Helper
Improved Word Alignments Using the Web as a Corpus
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof. Amitabha Mukerjee Presented By: Ankit Modi (10104)

Introduction » Bilingual terminologies are important for various applications of human language technologies » Earlier studies may be distinguished by whether they work on parallel or comparable corpora » Focus on Comparable corpora is crucial as Parallel corpora is tough to find for all language pairs

Task To extract bilingual terminologies from comparable Corpora

Task To extract bilingual terminologies from comparable Corpora Comparable corpora: Collection of source-target language document pairs that are not direct translations but topically related.

Method » Pair each term extracted from S with each term extracted from T Term: Contiguous sequence of words (No particular syntactic restriction)

Method » Pair each term extracted from S with each term extracted from T » Treat term alignment as a binary classification task

Method » Pair each term extracted from S with each term extracted from T » Treat term alignment as a binary classification task » Extract features for each S-T potential term pair Decide whether to classify it as term equivalent or not ( SVM binary classifier with linear kernel)

Feature Extraction » Dictionary Based Features 1. isFirstWordTranslated ( Binary Feature ) 2. isLastWordTranslated 3. percentageOfTranslatedWord 4. percentageOfNotTranslatedWords

Feature Extraction » Dictionary Based Features 5. longestTranslatedUnitInPercentage 6. longestNotTranslatedUnitInPercentage 7. averagePercentageOfTranslatedWords » First 6 features are computed in both directions (S - > T and T -> S). In total, we have 13 Dictionary Based Features

Feature Extraction » Cognate Based Features 1. Longest Common Subsequence Ratio: Ex: LCSR (‘dollar’, ‘dolari’) = 5/6 2. Longest Common Substring Ratio: Ex: LCSTR (‘dollar’, ‘dolari’) = 3/6 3 Dice Similarity: Dice = 2*LCST / (len(X) + len(Y))

Feature Extraction » Cognate Based Features 4. Needlemann Wunsch Distance (NWD): NWD = LCST /min[ len(X) + len(Y)] 5. Levenshtein Distance: LDn = 1 - ( LD / max[len(X), len(Y)] ) » We have 5 Cognate Based Features

Feature Extraction » Cognate based features with term matching Applicable to those pair of languages whose alphabets belong to a common character set A mapping is performed from a source term to a target writing system or vice versa. Same cognate features as previous are calculated in both directions » We have 10 such features

Feature Extraction » Combined Features 1. isFirstWordCovered: Translation + Transliteration 2. isLastWordCovered: 3. percentageOfCoverage: 4. percentageOfNonCoverage 5. difBetweenCoverageAndNonCoverage » Calculated in both directions - 10 Combined Features

Feature Extraction » We have 38 features Dictionary based features : 13 Cognate based features : 5 Cognate based features with term matching : 10 Combined features :10

Evaluation 1 » Some positive and negative examples are created » Precision, recall and f-score are calculated » The precision score ranges from 100 to 67 percent

Evaluation 2 » Manual Evaluation » Human assessors are asked to categorize each term pair into one of the following categories: Equivalence, Inclusion, Overlap and Unrelated » Over 80 percent of the term pairs were assessed to be of the first category i.e. Equivalence.

Dataset » Training data taken from EUROVOC thesarus » English-German term-tagged comparable corpora for manual evaluation

Thank You

Manual Evaluation » Equivalence: Exact translation/ transliteration of each other » Inclusion: An exact translation/ transliteration of one term contained within the other » Overlap: Terms share at least one translated/ transliterated word » Unrelated: No word in either term is a translation/ transliteration of a word in other

Error » Error percentage was generally low » Reason for errors: Existence of words with very similar spellings but completely different meanings

SVM Binary Classifier » Pair each term extracted from S with each term extracted from T » Treat term alignment as a binary classification task » Linear Kernel » Trade-off between training error and margin parameter, c = 10.

Future Work » Looking into the usefulness of the term pairs in various application scenarios such as machine translation etc