Download presentation
Presentation is loading. Please wait.
1
Irion Technologies (c)
MEANING WP8 Validation ©Irion Technologies Irion Technologies (c)
2
Irion Technologies (c)
Validation in MEANING Industrial partners: (Reuters), EFE, Irion Integration of MEANING results in end-user application: Cross-lingual retrieval Text classification system Application to real user cases: Reuters news collection EFE Fototeca database: news pictures with Spanish & English captions Evaluations: Text classification benchmark (Reuters) Information retrieval benchmark (Reuters & EFE) Task-based evaluation by end-users (EFE) Irion Technologies (c)
3
Validation in MEANING: Baseline
Corporate Semantic Network + Wordnet Domains Text-classification on Reuters news: Without Wordnets: R 67.8%, P 70.4%, C 83.2% Wordnets: R 75.6%, P 65.9%, C 99.5% Wordnets + WSD: R 79.2%, P 71.5%, C 100% Information retrieval with paraphrased English queries on Reuters news: Without Wordnets: R 29% Wordnets: R 25% Wordnets + WSD: R 32% Details in MEANING Deliverable 8.1 Irion Technologies (c)
4
Validation in MEANING: Fase-3
Integration of MEANING (MCR) EFE Fototeca database Evaluation: Information retrieval benchmark (EFE) Task-based evaluation by end-users (EFE) MEANING Deliverables 8.2, 8.3, 8.4 Irion Technologies (c)
5
MEANING-full effects in Information retrieval
WP8 Validation ©Irion Technologies Irion Technologies (c)
6
Irion Technologies (c)
Overview TwentyOne search system The EFE data and indexes built with MEANING Evaluation Conclusions Irion Technologies (c)
7
TwentyOne search system
Conceptual phrasal search Irion Technologies (c)
8
Value linguistic phrases
Traditional string-based page retrieval system cannot differentiate linguistic contexts: “animal party” & “party animal” “Java Internet servers” & “Internet servers on Java” “Good service but bad volley” & “Attend a service in the cathedral” User queries and linguistic phrases express complex concepts that should be matched as a whole Irion Technologies (c)
9
TwentyOne: two-stage retrieval system
Vector space model is used to retrieve all relevant pages from a large collection; Within the relevant pages we compare the concepts expressed in the query with the concepts expressed in the linguistic phrases; We list the pages with the best matching phrases; We use the vector space score when the phrase scores are equal; Irion Technologies (c)
10
Conceptual phrase matching
Document Phrase Word form1 Word form2 Word formN Domain = economy Query Word form1 Concept1..N Word form2 Word formN ConceptN ConceptM Domain = politics human right activist-leader mensenrechtenactivistenleider (human rights activist leader) Domain = politics Concept1..N ConceptN ConceptM all concepts, same wording -> 100% 1 out of 3 concepts, same wording: -> 33% Phrase-score: number matching concepts party animal; animal party matching conceptual relation matching domains: potatos, potatoes, Afganistan & afghanistan fuzzy word match: café, cafe, Café, CaFé, CAFÉ, café-noir depart, departure, departures, departing, departings flexion and derivation: mensenrechtenactivistenleider, human rights multiwords and compounds: original word, synonym or translation: café, pub, bar, coffee shop, tea room United States of America, US, USA, VS, Amerika, Pays-Bas, Holland, the Netherlands Irion Technologies (c)
11
Cross-lingual retrieval
NLP Query Syn Tokenization Tagging Parsing Nam Named Entity Recognition Con Concept Recognition Multilingual Semantic Network ES EN CA BA INDEX IT Expansion Lid XML NLP pages phrases Irion Technologies (c)
12
Domain-based WSD (IRST-Trento, Magnini 2002)
TwentyOne Classify Text Classifier Text grouped by Domains Un-seen Document Phrase: financial scandal Juventus Phrase: Players boycott the match More Contexts + Domain Train IST-project MEANING Set of concepts Domain Synsets Glosses Examples WordNet/Semnet Concept Selection Sport - words Export Microworld: Sport - Nanoworld: Finance Nanoworld: Sport Irion Technologies (c)
13
Effectivity of Domain disambiguation
2nd Level domains(163 -> 57); NPs classified in a window of 10 NPs; Threshold was set to 60; Nanoworlds Microworlds Spanish English disambiguated words 238,671 26,279 44,652 3,097 total concepts 1,691,079 314,394 220,574 18,541 excluded 879,317 52% 205,221 65% 105,620 48% 10,603 57% selected 811,762 109,173 35% 114,954 7,938 43% polysemy 7,1 12,0 4,9 6,0 Irion Technologies (c)
14
Fototeca database for finding news pictures from captions
EFE data and indexes Fototeca database for finding news pictures from captions Irion Technologies (c)
15
Irion Technologies (c)
EFE DATA 29,511 XML files (26,546 Spanish, 2,965 English), 29,943 images; Content: caption and descriptions (mostly capitalized!); Meta information, other fields; Irion Technologies (c)
16
Irion Technologies (c)
Indexes NO: no usage of wordnet FULL: wordnets used for full expansion MEANING: wordnets used for expansion adter disambiguation with MEANING data Irion Technologies (c)
17
Irion Technologies (c)
NO Index Spanish source string -> Spanish normalisation ->Spanish index -> no translation -> English, Basque, Catalan and Italian index, English source string -> English normalisation ->English index no translation -> Spanish, Basque, Catalan and Italian index, Irion Technologies (c)
18
Irion Technologies (c)
FULL Index Spanish source string Spanish-WN -> all meanings -> synonym expansion -> normalisation -> Spanish index -> translation -> normalisation -> English, Basque, Catalan and Italian index English source string English-WN -> all meanings -> synonym expansion -> normalisation -> English index -> translations -> normalisation -> Spanish, Basque, Catalan and Italian index Irion Technologies (c)
19
Irion Technologies (c)
MEANING Index Spanish source string Spanish-WN -> WSD -> selection of meanings -> synonym expansion -> normalisation -> Spanish index -> translation -> normalisation -> English, Basque, Catalan and Italian index English source string English-WN -> WSD -> selection of meanings -> synonym expansion -> normalisation -> English index -> translations -> normalisation -> Spanish, Basque, Catalan and Italian index Irion Technologies (c)
20
MEANING-full effects in information retrieval
Evaluation MEANING-full effects in information retrieval Irion Technologies (c)
21
Irion Technologies (c)
Evaluation set up Sets of paraphrased queries with translations to all languages; Automatic measurement of recall, where we accept a top-10 result; Number of results is limited to a maximim of 25, searched with Boolean AND; Applied to all 3 indexes: no wordnets wordnets & no disambiguation wordnets & disambiguation Irion Technologies (c)
22
Disambiguating effect of phrase matching
Context highly determining; Wordnet expansions: Maximize recall -> generate all possible synonyms for all possible meanings e.g. police cell -> jail; Maximize noise -> generate all possible synonyms for unintended/irrelevant meanings, e.g. cell -> neuron, phone, battery; Chances are low that user queries contain phrases where unintended meanings are combined with similar context words: police cell division police phone; police neuron; police battery Irion Technologies (c)
23
Irion Technologies (c)
Queries <TESTIN> <DBS_ID>EFE_1</DBS_ID> <DOC_ID>11</DOC_ID> <PAG_TITLE></PAG_TITLE> <PAG_ID>231</PAG_ID> <NPS> <NP ID="16">Un grupo de cargueros transporta una imagen adornada con flores violetas</NP> </NPS> <SOURCE_LNG>es</SOURCE_LNG> <BOOLEAN>AND</BOOLEAN> <QUERY_ES>flores violetas</QUERY_ES> <QUERY_EN>violet flowers</QUERY_EN> <QUERY_CA>flor violeta</QUERY_CA> <QUERY_BA>lore bioleta</QUERY_BA> <QUERY_IT>fiori di viola</QUERY_IT> <QUERY_SY>flores moradas</QUERY_SY> </TESTIN> Irion Technologies (c)
24
Irion Technologies (c)
Queries Multi word queries Single word queries Unique strings Postings Spanish original 58 105 77 92 Spanish paraphrase 94 69 96 English 57 74 Catalan 60 Basque 104 65 Italian 56 Words with ambiguity or synonyms Other meanings and synonyms also occur in the documents Words relate to the pictures There can be multiple correct results for each query Irion Technologies (c)
25
Results for multi word queries
Spanish original 105 Q paraphrase 94 Q English Catalan Basque 104 Q Italian NO 99 0.94 14 0.15 2 0.02 31 0.3 1 0.01 3 0.03 p1 60 0.57 9 0.1 21 0.2 FULL 96 0.91 71 0.76 39 0.37 70 0.67 50 0.48 55 0.52 38 0.4 16 44 0.42 27 0.26 19 0.18 MEANING 97 0.92 61 0.65 68 46 0.44 32 0.41 48 0.46 20 0.19 FULL has better overall scores, MEANING has better scores for the 1st position Wordnet indexes (FULL & MEANING) outperform NO for paraphrased & cross-lingual queries High recall for original queries, due to conceptual phrase search FULL and MEANING are very close to NO: no negative effects from expansion MEANING removed correct cases but has better precision -> FULL introduces more noise in the ranking Irion Technologies (c)
26
Irion Technologies (c)
Conclusions Integrated MEANING in TwentyOne Search and for a real end-user application: EFE picture database Showed that wordnets are useful for mono- and cross-lingual retrieval No significant improvement for WSD (top-10 scores are less, top-1 scores are better) WSD can be improved in many ways Query-phrase matching is very effective so that we can afford to maximize recall with wordnets The type of queries (short & ambiguous) and type of retrieval (small captions & page-phrase matching) are important for experimental results Irion Technologies (c)
27
Irion Technologies (c)
Thank you for your attention! Irion Technologies (c)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.