La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento.

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

Distinción semántica de compuestos léxicos en Recuperación de Información Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos,
Evaluating Hierarchical Clustering of Search Results Departamento de Lenguajes y Sistemas Informáticos UNED, Spain Juan Cigarrán Anselmo Peñas Julio Gonzalo.
Multimedia Database Systems
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
Corpus-based Terminology Extraction applied to Information Access Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Browsing by phrases: terminological information in interactive multilingual text retrieval Anselmo Peñas, Julio Gonzalo and Felisa Verdejo NLP Group, Dpto.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Application of NLP in Information Retrieval Nirdesh Chauhan Ajay Garg Veeranna A.Y. Neelmani Singh.
Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas Anselmo Peñas Padilla Directores Julio Gonzalo.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Experiments on Using Semantic Distances Between Words in Image Caption Retrieval Presenter: Cosmin Adrian Bejan Alan F. Smeaton and Ian Quigley School.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Advance Information Retrieval Topics Hassan Bashiri.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Expansion.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 Query Operations Relevance Feedback & Query Expansion.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
UNED at iCLEF 2008: Analysis of a large log of multilingual image searches in Flickr Victor Peinado, Javier Artiles, Julio Gonzalo and Fernando López-Ostenero.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Performance Measurement. 2 Testing Environment.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
WHIM- Spring ‘10 By:-Enza Desai. What is HCIR? Study of IR techniques that brings human intelligence into search process. Coined by Gary Marchionini.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Irion Technologies (c)
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
Evaluation of IR Performance
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Search Engine Architecture
Recuperação de Informação
Presentation transcript:

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento de Lenguaje Natural Dpto. Lenguajes y Sistemas Informáticos UNED Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002

2Content Goal Morpho-syntactic ambiguity in IR Phrase indexing Conceptual indexing Conclusions

3Goal Indexing with automatic linguistic techniques within the classic IR model Informatio n need Search engine Docs. Document ranking Refinement Query Formulation POS tagging Phrase indexing WSD & Conceptual Indexing Bad strategies or too much error in automatic processing? IR-Semcor, hand-annotated test collection Lemmas and phrases Senses Synsets

4 Morpho-syntactic ambiguity in IR Texts...particle crosses the wall......canadian red cross......boat to cross mississippi river... Query cross_N...particl_N cross_V the_D wall_N......canadian_ADJ red_ADJ cross_N......boat_N to_TO cross_V mississippi_N river_N... POS Tagged Query cross...particl cross the wall......canadian red cross......boat to cross mississippi river... Plain matches

5

6 Morpho-syntactic ambiguity in IR Documents matched are ranked much higher (there are less competing documents) Manual POS tagging misses relevant matches Query:...talented baseball player... (talent_ADJ) Doc:...top talents of the time... (talent_N) Missing Match Automatic makes more mistakes, but not always correlated to retrieval decrease Query: summer_N shoes_N design_V (design_V) Doc: Italian_ADJ designed_V sandals_N (design_V) Match

7 Phrase indexing Texts...a guide for the fisher who......information on cat care......arboreal carnivorous called fisher cat... Query fisher...a guide for the fisher who......arboreal carnivorous called fisher cat......information on cat care... Plain Query fisher Phrase indexing...a guide for the fisher who......arboreal carnivorous called fisher_cat......information on cat care... matches

8

9 Phrase indexing Phrase indexing harms retrieval sometimes Query: Candidate in governor’s_race Doc: Opened his race for governor Missing match Phrase meaning is highly compositional Needs semantic distinction

10 Conceptual Indexing This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) Depending on WSD error rate Query spring Texts...spring......muelle......spring......fountain......fuente......spring......springtime......primavera... Conceptual Index n n n WSD

11 Word Sense Disambiguation (Sanderson 1994) introduced fixed error rates in pseudo-words disambiguation banana  banana/education/toy/gun/forest  WSD  toy to conclude (over Reuters collection) –WSD must be above 90% accuracy Reproduce Sanderson’s experiment (over IR-Semcor) Compare precision in retrieval over synsets with WSD errors n  spring  WSD  n (error) {spring,springtime} {spring, hook}

12 Pseudo-words with no errors in WSD  text

13 Synset indexing with no errors in WSD

14 Conceptual Indexing Although explicit disambiguation strategies applied to Indexing POS tagging Phrase indexing Word Sense Disambiguation don’t produce a significative improvement in IR Conceptual indexing based on synsets Needs automatic WSD accuracy near to state-of-the-art (60%) Permit Cross-Language Information Retrieval Qualitative evaluation (Item Search engine) Some unsolved challenges (mainly WSD) Users perceive a slower and less transparent system

15 Conclusions Think of users –Even an improvement of 10% wouldn’t change users perception –Don’t subordinate NLP to classic IR model –Find new paradigms in Information Access –In a higher level, closer to users Consider users tasks Consider users interaction

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento de Lenguaje Natural Dpto. Lenguajes y Sistemas Informáticos UNED Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002

17 IR-Semcor test collection –254 hand-annotated documents in English –82 hand-annotated queries in English with ~6.8 relevant documents each Example The Fulton County Grand Jury investigates possible irregularities in Atlanta’s primary election Lemmas and phrase annotation The Fulton_County_Grand_Jury investigate possible irregularity in atlanta primary_election Sense annotation Fulton_County_Grand_Jury investigate2 possible2 irregularity1 atlanta1 primary_election1 Synset annotation (actually synset offsets or ILI-records) Fulton_County_Grand_Jury v a n n n { investigate, carry_out_an_investigation_of } { irregularity, abnormality } { Atlanta, capital_of_Georgia } { primary_election, primary } { possible, potential }

18 IR-Semcor test collection Semcor 1.5 Doc 1 Doc 2 Doc 1 Doc ~100 Semcor 1.6 Doc 1 Doc 2 Doc 1 Doc 83 IR-Semcor Doc 1 Doc 2 Doc 171 Doc 1 Doc 254 Query 1 Query 2 Query 82 Hand-annotated sumaries only for chunked docs Assume the summary of a text is relevant to all fragments of the original Semcor document

19 Textual representation : query is translated into the target language Conceptual representation : query and documents are compared at a conceptual level Selection of query language Selection of WSD strategy Selection of newspaper determines the target language Retrieved documents

20 Approaches Natural Language Processing Disambiguation Conceptual indexing Terminology Controlled vocabularies indexing & browsing String Processing Free text indexing Information Retrieval Phrase indexing & browsing (Phind) Keyphrase navigation (Phrasier) Automatic Terminology Extraction Terminology Retrieval & Term browsing (WTB)

21

22

23 Semantic distinction of compounds II. Experiments in Lexical Ambiguity and Indexing Automatic classification through WordNet Endocentric: one component is hyperonym Appositional: all components are hyperonyms Exocentric: no components are hyperonyms purchasing department is_a Endocentric aspirin powder powderaspirin is_a Appositional fisher cat Exocentric Types of lexical compounds

24