Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.

Similar presentations


Presentation on theme: "1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio."— Presentation transcript:

1 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio Villena-Román (UC3M-Daedalus)

2 2 Our approach u New Year’s Resolution: work with all languages in CLEF  adhoc, image, web, geo, iclef, qa… u Wish list:  Language-dependent stuff  Language-independent stuff  Versatile combination  Fast  Simple for non computer scientists u Not to reinvent the wheel again every year! u Approach: Toolbox for information retrieval

3 3 Agenda u Toolbox u 2005 Experiments u 2005 Results u 2006 Homework

4 4 Toolbox Basics u Toolbox made of small one-function tools u Processing as a pipeline (borrowed from Unix):  Each tool combination leads to a different run approach u Shallow I/O interfaces:  tools in several programming languages (C/C++, Java, Perl, PHP, Prolog…),  with different design approaches, and  from different sources (own development, downloading, …)

5 5 MIRACLE Tools u Tokenizer:  pattern matching  isolate punctuation  split sentences, paragraphs, passages  identifies some entities  compounds, numbers, initials, abbreviations, dates  extracts indexing terms  own-development (written in Perl) or “outsourced” u Proper noun extraction  Naive algorithm: Uppercase words unless stop-word, stop- clef or verb/adverb u Stemming: generally “outsourced” u Transforming tools: lowercase, accents and diacritical characters are normalized, transliteration

6 6 More MIRACLE Tools u Filtering tools:  stop-words and stop-clefs  phrase pattern filter (for topics) u Automatic translation issues: “outsourced” to available on- line resources or desktop applications Bultra (En  Bu)Webtrance (En  Bu)AutTrans (Es  Fr, Es  Pt) MoBiCAT (En  Hu)SystranBabelFish Altavista BabylonFreeTranslationGoogle Language Tools InterTransWordLingoReverso u Semantic expansion  EuroWordNet  own resources for Spanish u The philosopher's stone: indexing and retrieval system

7 7 Indexing and Retrieval System u Implements boolean, vectorial and probabilistic BM25 retrieval models  Only BM25 in used in CLEF 2005  Only OR operator was used for terms u Native support for UTF-8 (and others) encodings  No transliteration scheme is needed  Good results for Bulgarian u More efficiency achieved than with previous engines  Several orders of magnitude in indexing time

8 8 Trie-based index calm, cast, coating, coat, money, monk, month

9 9 1st course implementation: linked arrays calm, cast, coating, coat, money, monk, month

10 10 Efficient tries: avoiding empty cells abacus, abet, ace, baby be, beach, bee

11 11 Basic Experiments u S: Standard sequence (tokenization, filtering, stemming, transformation) u N: Non stemming u R: Use of narrative field in topics u T: Ignore narrative field u r1: Pseudo-relevance feedback (with 1st retrieved document) u P: Proper noun extraction (in topics)  SR, ST, r1SR, NR, NT, NP

12 12 Paragraph indexing u H: Paragraph indexing  docpars (document paragraphs) are indexed instead of docs  term  doc1#1, doc69#5 …  combination of docpars relevance:  rel N = rel mN + α / n * ∑ j≠m rel jN n=paragraphs retrieved for doc N rel jN =relevance of paragraph i of doc N m=paragraph with maximum relevance α=0.75 (experimental)  HR, HT

13 13 Combined experiments u “Democratic system”: documents with good score in many experiments are likely to be relevant u a: Average:  Merging of several experiments, adding relevance u x: WDX - asymmetric combination of two experiments:  First (more relevant) non-weighted D documents from run A  Rest of documents from run A, with W weight  All documents from run B, with X weight  Relevance re-sorting  Mostly used for combining base runs with proper nouns runs  aHRSR, aHTST, xNP01HR1, xNP01r1SR1

14 14 Multilingual merging u Standard approaches for merging:  No normalization and relevance re-sorting  Standard normalization and relevance re-sorting  Min-max normalization and relevance re-sorting u Miracle approach for merging:  The number of docs selected from a collection (language) is proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard approaches is used

15 15 Results We performed… … countless experiments! (just for the adhoc task)

16 16 Monolingual Bulgarian Stemmer (UTF-8): Neuchâtel Rank: 4th

17 17 Bilingual English  Bulgarian (83% monolingual) En  Bu: Bultra, Webtrance Rank: 1st

18 18 Monolingual Hungarian Stemmer: Neuchâtel Rank: 3rd

19 19 Bilingual English  Hungarian (87% monolingual) En  Hu: MoBiCAT Rank: 1st

20 20 Monolingual French Stemmer: Snowball Rank: >5th

21 21 Bilingual English  French (79% monolingual) En  Fr: Systran Rank: 5th

22 22 Bilingual Spanish  French (81% monolingual) Es  Fr: ATrans, Systran (Rank: 5th)

23 23 Monolingual Portuguese Stemmer: Snowball Rank: >5th (4th)

24 24 Bilingual English  Portuguese (55% monolingual) En  Pt: Systran Rank: 3rd

25 25 Bilingual Spanish  Portuguese (88% monolingual) Es  Pt: ATrans (Rank: 2nd)

26 26 Multilingual-8 (En, Es, Fr) Rank: 2nd [Fr, En] 3rd [Es]

27 27 Conclusions and homework u Toolbox = “imagination is the limit” u Focus on interesting linguistic things instead of boring text manipulation u Reusability (half of the work is done for next year!) u Keys for good results:  Fast IR engine is essential  Native character encoding support  Topic narrative  Good translation engines make the difference u Homework:  further development on system modules, fine tuning  Spanish, French, Portuguese…


Download ppt "1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio."

Similar presentations


Ads by Google