2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 1 Department of Computer Science, Gujarat.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
Scalable Text Mining with Sparse Generative Models
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.
Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
1 The LIG Arabic / English Speech Translation System at IWSLT07 Laurent BESACIER, Amar MAHDHAOUI, Viet-Bac LE LIG*/GETALP (Grenoble, France)
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.
Korea Maritime and Ocean University NLP Jung Tae LEE
Chapter 23: Probabilistic Language Models April 13, 2004.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Mountain... पहाड mountain... पर्वत Mr... श्री Aligarh District... अलीगढ जिला Allahabad University... इलाहाबाद विश्वविद्यालय Amit Vilasrao Deshmukh... अमित.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Translation of Unknown Words in Low Resource Languages
Suggestions for Class Projects
The CoNLL-2014 Shared Task on Grammatical Error Correction
Cheshire at GeoCLEF 2008: Text and Fusion Approaches for GIR
Presentation transcript:

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland

Outline Motivation System Setup and Changes Monolingual Experiments Crosslingual Experiments SMT system Training data Translation results OOV Reduction FAQ Retrieval Results Conclusions and Future Work

Motivation Task: Given a SMS query, find FAQ documents answering the query Last year’s DCU system: SMS correction and normalisation In-Domain retrieval: Three approaches (SOLR, Lucene, Term Overlap) Out-of-domain (OOD) detection: Three approaches (term overlap, normalized BM25 scores, ML) Combination of ID retrieval and OOD results

Motivation This year’s system: Same SMS correction and normalisation one more spelling correction resource (manually created) Single retrieval approach: Lucene with BM25 retrieval model Single OOD detection approach: IB-1 classification using Timbl (Machine Learning) additional features for term overlap and normalized BM25 scores Trained statistical machine translation system for document translation (Hindi to English)

Questions Investigate the influence of OOD detection on system performance the influence of out-of-vocabulary (OOV) words on crosslingual performance

Collection Statistics LanguageDocumentsTraining (rel/non_rel) Test (rel/non_rel) English (3047/1429) 1733 (726/1007) Hindi (173/381) 579 (200/379) English to Hindi (173/381) 431 (75/356)

Monolingual Experiments (Setup) Experiments for English and Hindi Processing steps: Normalize SMS and FAQ documents Correct SMS queries Retrieve answers Detect OOD queries (or not), e.g. “NONE” queries Produce final result

Crosslingual Experiments (Setup) Experiments for English to Hindi Additional translation step to translate Hindi FAQ documents into English Translation is based on newly trained statistical machine translation system (SMT) Problem: sparse training data → combination of different training resources out of vocabulary (OOV) words → OOV reduction

Crosslingual Experiments (SMT System) Training an SMT system Data preparation tokenization/normalization scripts Data alignment Giza++ for word-level alignment Phrase extraction Moses MT toolkit Training a language model SRILM for trigram LM with Kneser-Ney smoothing Tuning Minimum error rate tuning (MERT)

Crosslingual Experiments (Training Data) Agro (agricultural domain): 246 sentences Crowdsourced HI-EN data: 50k sentences EILMT (tourism domain): 6700 sentences ICON: 7000 sentences TIDES: 50k sentences FIRE ad-hoc queries: 200 titles, 200 descriptions Interlanguage Wikipedia links: 27k entries OPUS/KDE: 97k entries UWdict: 128k entries

Translation Results (Hindi to English) DataTraining / Test / DevelopmentBLEU TIDES49,504 / 697 / Crowdsourced EN-HI41,396 / 8000 / ICON7000 / 500 /

OOV Reduction Problem: 15.4% untranslated words in translation output Idea: modify untranslated words to obtain a translation OOV reduction is based on two resources UWdict Manually created transliteration lexicon (TRL): 639 entries

OOV Reduction Word modifications: Character normalization, e.g. replace Chandrabindu with Bindu delete Virama character replace long with short vowels Stemming Lucene Hindi stemmer Transliteration ITRANS transliteration rules rules for cleaning up ITRANS results Decompounding word split at every position into candidate constituents word is decompounded if both constituents have a translation

OOV Reduction Results (Hindi to English) Lookup formLookup DataCount% Reduction original termUWdict.4, original termTRL830.3 normalized termUWdict normalized termTRL240.1 stemmed termUWdict1, stemmed termTRL140.0 stemmed normalized termUWdict stemmed normalized termTRL00.0 compound constituentsUWdict transliterationN/A24,

FAQ Retrieval Results RunLanguageOOD detection OOV reduction ID correct OOD correct MRR 1ENN-661/72619/ ENY-595/726981/ HIN-77/37913/ HIY-26/379375/ EN2HINN29/7541/ EN2HINY22/7560/ EN2HIYY4/75989/

Conclusions Monolingual experiments: Good performance for English and Hindi OOD detection improves MRR (but reduces number of correct ID queries) Crosslingual experiments: Lower performance OOD detection reduces MRR OOV reduction reduces MRR

Future work Further analysis of our results needed Normalization issues for MT training data? Unbalanced OOD training data for Hindi and English? Is there Hindi textese (e.g. abbreviations etc.)? Does the training data match the test data? manually or automatically created Improve transliteration approach Comparison to other submissions

10q 4