La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento.

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento de Lenguaje Natural Dpto. Lenguajes y Sistemas Informáticos UNED Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002

2Content Goal Morpho-syntactic ambiguity in IR Phrase indexing Conceptual indexing Conclusions

3Goal Indexing with automatic linguistic techniques within the classic IR model Informatio n need Search engine Docs. Document ranking Refinement Query Formulation POS tagging Phrase indexing WSD & Conceptual Indexing Bad strategies or too much error in automatic processing? IR-Semcor, hand-annotated test collection Lemmas and phrases Senses Synsets

4 Morpho-syntactic ambiguity in IR Texts...particle crosses the wall......canadian red cross......boat to cross mississippi river... Query cross_N...particl_N cross_V the_D wall_N......canadian_ADJ red_ADJ cross_N......boat_N to_TO cross_V mississippi_N river_N... POS Tagged Query cross...particl cross the wall......canadian red cross......boat to cross mississippi river... Plain matches

6 Morpho-syntactic ambiguity in IR Documents matched are ranked much higher (there are less competing documents) Manual POS tagging misses relevant matches Query:...talented baseball player... (talent_ADJ) Doc:...top talents of the time... (talent_N) Missing Match Automatic makes more mistakes, but not always correlated to retrieval decrease Query: summer_N shoes_N design_V (design_V) Doc: Italian_ADJ designed_V sandals_N (design_V) Match

7 Phrase indexing Texts...a guide for the fisher who......information on cat care......arboreal carnivorous called fisher cat... Query fisher...a guide for the fisher who......arboreal carnivorous called fisher cat......information on cat care... Plain Query fisher Phrase indexing...a guide for the fisher who......arboreal carnivorous called fisher_cat......information on cat care... matches

9 Phrase indexing Phrase indexing harms retrieval sometimes Query: Candidate in governor’s_race Doc: Opened his race for governor Missing match Phrase meaning is highly compositional Needs semantic distinction

10 Conceptual Indexing This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) Depending on WSD error rate Query spring Texts...spring......muelle......spring......fountain......fuente......spring......springtime......primavera... Conceptual Index n03114639 n05727069 n09151839 WSD

11 Word Sense Disambiguation (Sanderson 1994) introduced fixed error rates in pseudo-words disambiguation banana  banana/education/toy/gun/forest  WSD  toy to conclude (over Reuters collection) –WSD must be above 90% accuracy Reproduce Sanderson’s experiment (over IR-Semcor) Compare precision in retrieval over synsets with WSD errors n07062238  spring  WSD  n04985670 (error) {spring,springtime} {spring, hook}

12 Pseudo-words with no errors in WSD  text

13 Synset indexing with no errors in WSD

14 Conceptual Indexing Although explicit disambiguation strategies applied to Indexing POS tagging Phrase indexing Word Sense Disambiguation don’t produce a significative improvement in IR Conceptual indexing based on synsets Needs automatic WSD accuracy near to state-of-the-art (60%) Permit Cross-Language Information Retrieval Qualitative evaluation (Item Search engine) Some unsolved challenges (mainly WSD) Users perceive a slower and less transparent system

15 Conclusions Think of users –Even an improvement of 10% wouldn’t change users perception –Don’t subordinate NLP to classic IR model –Find new paradigms in Information Access –In a higher level, closer to users Consider users tasks Consider users interaction

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento de Lenguaje Natural Dpto. Lenguajes y Sistemas Informáticos UNED Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002

17 IR-Semcor test collection –254 hand-annotated documents in English –82 hand-annotated queries in English with ~6.8 relevant documents each Example The Fulton County Grand Jury investigates possible irregularities in Atlanta’s primary election Lemmas and phrase annotation The Fulton_County_Grand_Jury investigate possible irregularity in atlanta primary_election Sense annotation Fulton_County_Grand_Jury investigate2 possible2 irregularity1 atlanta1 primary_election1 Synset annotation (actually synset offsets or ILI-records) Fulton_County_Grand_Jury v00441414 a00036893 n00412042 n5608324 n00103176 { investigate, carry_out_an_investigation_of } { irregularity, abnormality } { Atlanta, capital_of_Georgia } { primary_election, primary } { possible, potential }

18 IR-Semcor test collection Semcor 1.5 Doc 1 Doc 2 Doc 1 Doc ~100 Semcor 1.6 Doc 1 Doc 2 Doc 1 Doc 83 IR-Semcor Doc 1 Doc 2 Doc 171 Doc 1 Doc 254 Query 1 Query 2 Query 82 Hand-annotated sumaries only for chunked docs Assume the summary of a text is relevant to all fragments of the original Semcor document

19 Textual representation : query is translated into the target language Conceptual representation : query and documents are compared at a conceptual level Selection of query language Selection of WSD strategy Selection of newspaper determines the target language Retrieved documents

20 Approaches Natural Language Processing Disambiguation Conceptual indexing Terminology Controlled vocabularies indexing & browsing String Processing Free text indexing Information Retrieval Phrase indexing & browsing (Phind) Keyphrase navigation (Phrasier) Automatic Terminology Extraction Terminology Retrieval & Term browsing (WTB)

23 Semantic distinction of compounds II. Experiments in Lexical Ambiguity and Indexing Automatic classification through WordNet Endocentric: one component is hyperonym Appositional: all components are hyperonyms Exocentric: no components are hyperonyms purchasing department is_a Endocentric aspirin powder powderaspirin is_a Appositional fisher cat Exocentric Types of lexical compounds

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento.

Similar presentations

Presentation on theme: "La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento.

Similar presentations

Presentation on theme: "La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento."— Presentation transcript:

Similar presentations

About project

Feedback