Presentation is loading. Please wait.

Presentation is loading. Please wait.

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.

Similar presentations


Presentation on theme: "Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003."— Presentation transcript:

1 Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003

2 Roadmap Goals of expansion –Expansion points in CL-SDR Pre- and Post-translation document expansion experiments –Task, query & document processing –Expansion methodology Results Discussion & Conclusions

3 Why Expansion? Recover terms that could have appeared –Compensate for difference in term choice Author concepts vs searcher information need –Compensate for noisy processing ASR transcription errors –Misrecognitions, deletions, missegmentations Translation errors –Gaps, missegmentations –Context disambiguates

4 Expansion Opportunities Query: –(Ballesteros & Croft’96; McNamee & Mayfield 2002) –Before, after translation; both –Different enhancements to precision/recall –Pre-translation key – something to translate European languages Document –Before, after translation; both –Developed for monolingual SDR (Singhal 1999) –CLIR (+SDR) (Levow & Oard 2000) Post-translation promising

5 Experimental Configuration: Basic Task Variant of Topic Detection and Tracking (TDT) –English queries to Mandarin documents Query-by-example –English newswire or broadcast news stories Mandarin audio broadcast news documents –Automatically transcribed by Dragon ASR system –Modifications: Retrospective retrieval Evaluation metric: Mean Average Precision

6 Experimental Configuration: Query and Document Processing Query: –Select top 180 positively correlated terms in 4 exemplars Based on Χ^2 test 996 prior documents assumed not relevant Document: –Dictionary-based word-for-word translation Segmentation: NMSU ch_seg Translation resource: –Merged bilingual term list: CETA & LDC term list Translation ranking: –Target language unigram frequency: single words, multi-word

7 Experimental Configuration: Document Expansion

8 Document Expansion: Details Side collections: –Mandarin: TDT-2 Xinhua, Zaobao newswire –English: TDT-2 New York Times, AP news Expansion term selection –Top 5 documents –Sort candidate terms by idf –Exclude terms in only one document –Add one term instance per document –Add until document doubled in length

9 Results Post-translation significantly outperforms pre- translation expansion NonePrePostPre+Post 0.390.460.590.61

10 Discussion: Post-translation Effectivenes Post-translation document expansion significantly improves retrieval effectiveness –Little improvement from pre-translation expans’n Either alone or in conjunction Expansion introduces key enriching terms –Named entities, alternate forms E.g. Tariq Aziz, Saddam, Yeltsin, etc –Available in English (post-translation) collection

11 Discussion: Pre-translation Limitations Expansion terms do not exist –Segmentation & transcription rely on term lists Named entities frequently absent Can not extract terms from Mandarin newswire Expansion terms can not translate –Key terms (e.g. named entities) absent from bilingual term lists All examples on previous page absent

12 Discussion: Contrasts Contradict prior query expansion results –Re: Primacy of pre-translation expansion Explanation: –Prior languages – mostly European Common writing system, white-space delimited Pre-translation expansion produces –-> translatable terms + (possibly) untranslatable cognates –Cognates still match, even without translation –Current experiment: English-Mandarin Untranslatable cognates useless –Different orthography Terms not identified - missegmentation

13 Conclusion Document expansion improves effectiveness –For CL-SDR case, recovers terms lost by missegmentation, mistranscription, or mistranslation; supports different terms Post-translation expansion most effective –Translated terms provide context for retrieval Correct translations/transcriptions coherent; others noise –Enriching terms often absent from term lists Segmentation, transcription, translation all rely on lists –Expansion in indexing language bypasses barriers Crucial in languages with segmentation issues and different forms


Download ppt "Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003."

Similar presentations


Ads by Google