Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Cross-Language Retrieval INST 734 Module 11 Doug Oard.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Access to News Audio User Interaction in Speech Retrieval Systems by Jinmook Kim and Douglas W. Oard May 31, th Annual Symposium and Open House.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
Advance Information Retrieval Topics Hassan Bashiri.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Japanese Spontaneous Spoken Document Retrieval Using NMF-Based Topic Models Xinhui Hu, Hideki Kashioka, Ryosuke Isotani, and Satoshi Nakamura National.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An information-pattern-based approach to novelty detection Presenter : Lin, Shu-Han Authors : Xiaoyan.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
UIC at TREC 2007: Genomics Track Wei Zhou, Clement Yu University of Illinois at Chicago Nov. 8, 2007.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
TDT 2000 Workshop Lessons Learned These slides represent some of the ideas that were tried for TDT 2000, some conclusions that were reached about techniques.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval: Models and Methods
An Automatic Construction of Arabic Similarity Thesaurus
Presentation transcript:

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003

Roadmap Goals of expansion –Expansion points in CL-SDR Pre- and Post-translation document expansion experiments –Task, query & document processing –Expansion methodology Results Discussion & Conclusions

Why Expansion? Recover terms that could have appeared –Compensate for difference in term choice Author concepts vs searcher information need –Compensate for noisy processing ASR transcription errors –Misrecognitions, deletions, missegmentations Translation errors –Gaps, missegmentations –Context disambiguates

Expansion Opportunities Query: –(Ballesteros & Croft’96; McNamee & Mayfield 2002) –Before, after translation; both –Different enhancements to precision/recall –Pre-translation key – something to translate European languages Document –Before, after translation; both –Developed for monolingual SDR (Singhal 1999) –CLIR (+SDR) (Levow & Oard 2000) Post-translation promising

Experimental Configuration: Basic Task Variant of Topic Detection and Tracking (TDT) –English queries to Mandarin documents Query-by-example –English newswire or broadcast news stories Mandarin audio broadcast news documents –Automatically transcribed by Dragon ASR system –Modifications: Retrospective retrieval Evaluation metric: Mean Average Precision

Experimental Configuration: Query and Document Processing Query: –Select top 180 positively correlated terms in 4 exemplars Based on Χ^2 test 996 prior documents assumed not relevant Document: –Dictionary-based word-for-word translation Segmentation: NMSU ch_seg Translation resource: –Merged bilingual term list: CETA & LDC term list Translation ranking: –Target language unigram frequency: single words, multi-word

Experimental Configuration: Document Expansion

Document Expansion: Details Side collections: –Mandarin: TDT-2 Xinhua, Zaobao newswire –English: TDT-2 New York Times, AP news Expansion term selection –Top 5 documents –Sort candidate terms by idf –Exclude terms in only one document –Add one term instance per document –Add until document doubled in length

Results Post-translation significantly outperforms pre- translation expansion NonePrePostPre+Post

Discussion: Post-translation Effectivenes Post-translation document expansion significantly improves retrieval effectiveness –Little improvement from pre-translation expans’n Either alone or in conjunction Expansion introduces key enriching terms –Named entities, alternate forms E.g. Tariq Aziz, Saddam, Yeltsin, etc –Available in English (post-translation) collection

Discussion: Pre-translation Limitations Expansion terms do not exist –Segmentation & transcription rely on term lists Named entities frequently absent Can not extract terms from Mandarin newswire Expansion terms can not translate –Key terms (e.g. named entities) absent from bilingual term lists All examples on previous page absent

Discussion: Contrasts Contradict prior query expansion results –Re: Primacy of pre-translation expansion Explanation: –Prior languages – mostly European Common writing system, white-space delimited Pre-translation expansion produces –-> translatable terms + (possibly) untranslatable cognates –Cognates still match, even without translation –Current experiment: English-Mandarin Untranslatable cognates useless –Different orthography Terms not identified - missegmentation

Conclusion Document expansion improves effectiveness –For CL-SDR case, recovers terms lost by missegmentation, mistranscription, or mistranslation; supports different terms Post-translation expansion most effective –Translated terms provide context for retrieval Correct translations/transcriptions coherent; others noise –Enriching terms often absent from term lists Segmentation, transcription, translation all rely on lists –Expansion in indexing language bypasses barriers Crucial in languages with segmentation issues and different forms