Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.

Similar presentations


Presentation on theme: "Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni."— Presentation transcript:

1 Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni

2 The Scenario Lexicons are the bread and butter of many NLP areas So far, techniques to acquire automatically bilingual lexical data have been using mainly parallel corpora Availability of parallel corpora is limited Pure statistical approaches applied to comparable corpora failed to produce consistent results Results must hold for a great range of text types Need for a robust and extensible method The Motivation Employing a hybrid approach (statistical + rule-based) Drawing upon recent developments in the area of monolingual lexical acquisition (WSD, named entity recognition, term extraction) Investigate further exploitations of comparable corpora for lexical acquisition purposes by: New methodology for bilingual lexicon acquisition from comparable corpora (BLACC)

3 The Present – Methodology (1) COGNATE MATCHING LEMMATIZATION TERM EXTRACTION POS TAGGING TOKENIZATION LEMMATIZATION TERM EXTRACTION POS TAGGING TOKENIZATION ???? RULE-BASED METHODS L1L1 L2L2 COOCCURRENCE SIMILARITY CONTEXT HETEROGENEITY OTHER STATISTICAL METHODS

4 TOP 5 TRANSLATION CANDIDATES TOP 5 TRANSLATION CANDIDATE TOP 5 TRANSLATION CANDIDATES STATISTICAL METHODS RULE-BASED METHODS COMPARISON = RERANKING (weights stats/rule-based to be defined) LEXICON L1 – L2 The Present – Methodology (2)

5 The Past – Previous Work Lexical Acquisition From Parallel Corpora Statistical Co-Occurrence Frequencies + Length or Positional Statistics: Dagan et al. (1993), Kupiec (1993), Smadja & McKeown (1994), Kumano & Hirakawa (1994), Wu & Xia (1994) LA for Machine Translation: Sato & Nagao (1990), Brown et al. (1993), Melamed (1997) Concordancing: Gale & Church (1991), Catizone et al. (1993) Tools for Translators: Melamed (1996) Lexical Acquisition from Comparable Corpora Statistical Use of Multilingual Thesauri: Dejean et al. (2002) Co-occurrence Assumption: Fung & Church (1994), Rapp (1995), Rapp (1997), Fung & McKeown (1997), Fung & Yee (1998) Positional Difference Vector: Fung & McKeown (1994) Context Heterogeneity: Fung (1995) Rule-Based Cognates: Bourigault (1992), Ananiadou (1994), Jacquemin & Royaute (1994), Dagan & Church (1995), Oueslati et al. (1996), Koehn & Knight (2002) Context and Semantic Information: Lauriston (1996), Dubuc & Lauriston (1997)

6 The Future – Way Ahead Consider modeling procedures on the basis of a parallel corpus Implement the possibility of exploiting already available tagged corpora Analyse the possible application of clustering procedures to reduce polysemy Investigate the ethimological issue (closest common root) to fill the gap between distant languages


Download ppt "Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni."

Similar presentations


Ads by Google