1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Chapter 5: Introduction to Information Retrieval
Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.
Information Retrieval in Practice
Search Engines and Information Retrieval
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.
With or without users? Julio Gonzalo UNEDhttp://nlp.uned.es.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines and Information Retrieval Chapter 1.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
UA in ImageCLEF 2005 Maximiliano Saiz Noeda. Index System  Indexing  Retrieval Image category classification  Building  Use Experiments and results.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Information Retrieval
Departamento de Lenguajes y Sistemas Informáticos Cross-language experiments with IR-n system CLEF-2003.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
1 SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm Fernando Martínez Santiago Miguel Ángel García Cumbreras.
Information Retrieval in Practice
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Search Engine Architecture
Experiments for the CL-SR task at CLEF 2006
Implementation Issues & IR Systems
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio Villena-Román (UC3M-Daedalus)

2 Our approach u New Year’s Resolution: work with all languages in CLEF  adhoc, image, web, geo, iclef, qa… u Wish list:  Language-dependent stuff  Language-independent stuff  Versatile combination  Fast  Simple for non computer scientists u Not to reinvent the wheel again every year! u Approach: Toolbox for information retrieval

3 Agenda u Toolbox u 2005 Experiments u 2005 Results u 2006 Homework

4 Toolbox Basics u Toolbox made of small one-function tools u Processing as a pipeline (borrowed from Unix):  Each tool combination leads to a different run approach u Shallow I/O interfaces:  tools in several programming languages (C/C++, Java, Perl, PHP, Prolog…),  with different design approaches, and  from different sources (own development, downloading, …)

5 MIRACLE Tools u Tokenizer:  pattern matching  isolate punctuation  split sentences, paragraphs, passages  identifies some entities  compounds, numbers, initials, abbreviations, dates  extracts indexing terms  own-development (written in Perl) or “outsourced” u Proper noun extraction  Naive algorithm: Uppercase words unless stop-word, stop- clef or verb/adverb u Stemming: generally “outsourced” u Transforming tools: lowercase, accents and diacritical characters are normalized, transliteration

6 More MIRACLE Tools u Filtering tools:  stop-words and stop-clefs  phrase pattern filter (for topics) u Automatic translation issues: “outsourced” to available on- line resources or desktop applications Bultra (En  Bu)Webtrance (En  Bu)AutTrans (Es  Fr, Es  Pt) MoBiCAT (En  Hu)SystranBabelFish Altavista BabylonFreeTranslationGoogle Language Tools InterTransWordLingoReverso u Semantic expansion  EuroWordNet  own resources for Spanish u The philosopher's stone: indexing and retrieval system

7 Indexing and Retrieval System u Implements boolean, vectorial and probabilistic BM25 retrieval models  Only BM25 in used in CLEF 2005  Only OR operator was used for terms u Native support for UTF-8 (and others) encodings  No transliteration scheme is needed  Good results for Bulgarian u More efficiency achieved than with previous engines  Several orders of magnitude in indexing time

8 Trie-based index calm, cast, coating, coat, money, monk, month

9 1st course implementation: linked arrays calm, cast, coating, coat, money, monk, month

10 Efficient tries: avoiding empty cells abacus, abet, ace, baby be, beach, bee

11 Basic Experiments u S: Standard sequence (tokenization, filtering, stemming, transformation) u N: Non stemming u R: Use of narrative field in topics u T: Ignore narrative field u r1: Pseudo-relevance feedback (with 1st retrieved document) u P: Proper noun extraction (in topics)  SR, ST, r1SR, NR, NT, NP

12 Paragraph indexing u H: Paragraph indexing  docpars (document paragraphs) are indexed instead of docs  term  doc1#1, doc69#5 …  combination of docpars relevance:  rel N = rel mN + α / n * ∑ j≠m rel jN n=paragraphs retrieved for doc N rel jN =relevance of paragraph i of doc N m=paragraph with maximum relevance α=0.75 (experimental)  HR, HT

13 Combined experiments u “Democratic system”: documents with good score in many experiments are likely to be relevant u a: Average:  Merging of several experiments, adding relevance u x: WDX - asymmetric combination of two experiments:  First (more relevant) non-weighted D documents from run A  Rest of documents from run A, with W weight  All documents from run B, with X weight  Relevance re-sorting  Mostly used for combining base runs with proper nouns runs  aHRSR, aHTST, xNP01HR1, xNP01r1SR1

14 Multilingual merging u Standard approaches for merging:  No normalization and relevance re-sorting  Standard normalization and relevance re-sorting  Min-max normalization and relevance re-sorting u Miracle approach for merging:  The number of docs selected from a collection (language) is proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard approaches is used

15 Results We performed… … countless experiments! (just for the adhoc task)

16 Monolingual Bulgarian Stemmer (UTF-8): Neuchâtel Rank: 4th

17 Bilingual English  Bulgarian (83% monolingual) En  Bu: Bultra, Webtrance Rank: 1st

18 Monolingual Hungarian Stemmer: Neuchâtel Rank: 3rd

19 Bilingual English  Hungarian (87% monolingual) En  Hu: MoBiCAT Rank: 1st

20 Monolingual French Stemmer: Snowball Rank: >5th

21 Bilingual English  French (79% monolingual) En  Fr: Systran Rank: 5th

22 Bilingual Spanish  French (81% monolingual) Es  Fr: ATrans, Systran (Rank: 5th)

23 Monolingual Portuguese Stemmer: Snowball Rank: >5th (4th)

24 Bilingual English  Portuguese (55% monolingual) En  Pt: Systran Rank: 3rd

25 Bilingual Spanish  Portuguese (88% monolingual) Es  Pt: ATrans (Rank: 2nd)

26 Multilingual-8 (En, Es, Fr) Rank: 2nd [Fr, En] 3rd [Es]

27 Conclusions and homework u Toolbox = “imagination is the limit” u Focus on interesting linguistic things instead of boring text manipulation u Reusability (half of the work is done for next year!) u Keys for good results:  Fast IR engine is essential  Native character encoding support  Topic narrative  Good translation engines make the difference u Homework:  further development on system modules, fine tuning  Spanish, French, Portuguese…