Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

Similar presentations


Presentation on theme: "10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis."— Presentation transcript:

1 10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou Euripides G.M. Petrakis Evangelos Milios

2 10/22/2015ACM WIDM'20052 Semantic Similarity  Semantic Similarity relates to computing the conceptual similarity between terms which are not lexicographically similar “car” “automobile”  Map two terms to an ontology and compute their relationship in that ontology

3 10/22/2015ACM WIDM'20053 Objectives  We investigate several Semantic Similarity Methods and we evaluate their performance http://www.ece.tuc.gr/similarity  We propose the Semantic Similarity Retrieval Model (SSRM) for computing similarity between documents containing semantically similar but not necessarily lexicographically similar terms http://www.ece.tuc.gr/intellisearch

4 10/22/2015ACM WIDM'20054 Ontologies  Tools of information representation on a subject  Hierarchical categorization of terms from general to most specific terms object  artifact  construction  stadium  Domain Ontologies representing knowledge of a domain e.g., MeSH medical ontology  General Ontologies representing common sense knowledge about the world e.g., WordNet

5 10/22/2015ACM WIDM'20055 WordNet  A vocabulary and a thesaurus offering a hierarchical categorization of natural language terms  More than 100,000 terms  An ontology of natural language terms  Nouns, verbs, adjectives and adverbs are grouped into synonym sets (synsets)  Synsets represent terms or concepts stadium, bowl, arena, sports stadium – (a large structure for open-air sports or entertainments)

6 10/22/2015ACM WIDM'20056 WordNet Hierarchies  The synsets are also organized into senses  Senses: Different meanings of the same term  The synsets are related to other synsets higher or lower in the hierarchy by different types of relationships e.g. Hyponym/Hypernym (Is-A relationships) Meronym/Holonym (Part-Of relationships)  Nine noun and several verb Is-A hierarchies

7 10/22/2015ACM WIDM'20057 A Fragment of the WordNet Is-A Hierarchy

8 10/22/2015ACM WIDM'20058 Semantic Similarity Methods  Map terms to an ontology and compute their relationship in that ontology  Four main categories of methods: Edge counting: path length between terms Information content: as a function of their probability of occurrence in corpus Feature based: similarity between their properties (e.g., definitions) or based on their relationships to other similar terms Hybrid: combine the above ideas

9 10/22/2015ACM WIDM'20059 Example  Edge counting distance between “conveyance” and “ceramic” is 2  An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus

10 10/22/2015ACM WIDM'200510 Semantic Similarity on WordNet  The most popular methods are evaluated  All methods applied on a set of 38 term pairs  Their similarity values are correlated with scores obtained by humans  The higher the correlation of a method the better the method is

11 10/22/2015ACM WIDM'200511 Evaluation MethodTypeCorrelation Rada 1989Edge Counting0.59 Wu 1994Edge Counting0.74 Li 2003Edge Counting0.82 Leackok 1998Edge Counting0.82 Richardson 1994Edge Counting0.63 Resnik 1999Info. Content0.79 Lin 1993Info. Content0.82 Lord 2003Info. Content0.79 Jiang 1998Info. Content0.83 Tversky 1977Feature Based0.73 Rodriguez 2003Hybrid0.71

12 10/22/2015ACM WIDM'200512 Observations  Edge counting/Info. Content methods work by exploiting structure information  Good methods take the position of the terms into account  Higher similarity for terms which are close together but lower in the hierarchy e.g., [Li et.al. 2003]  Information Content is measured on WordNet rather than on corpus [Seco2002]  Similarity only for nouns and verbs  No taxonomic structure for other p.o.s

13 10/22/2015ACM WIDM'200513 http://www.ece.tuc.gr/similarity

14 10/22/2015ACM WIDM'200514 Semantic Similarity Retrieval Model (SSRM)  Classic retrieval models retrieve documents with the same query terms  SSRM will retrieve documents which also contain semantically similar terms  Queries and documents are initially assigned tf x idf weights  q=(q 1,q 2,…q N ), d=(d 1,d 2,…d N )

15 10/22/2015ACM WIDM'200515 SSRM I.Query term re- weighting similar terms reinforce each other I.Query term expansion with synonyms and similar terms II.Document similarity

16 10/22/2015ACM WIDM'200516 Query Term Expansion

17 10/22/2015ACM WIDM'200517 Observations  Specification of T ?  Large T may lead to topic drift  Word sense disambiguation for expanding with the correct sense  Expansion with co-concurring terms? SVD, local/global analysis  Semantic similarity between terms of different parts of speech?  Work with compound terms (phrases)

18 10/22/2015ACM WIDM'200518 Evaluation of SSRM  SSRM is evaluated through intellisearch a system for information retrieval on the WWW intellisearch  1,5 Million Web pages with images  Images are described by surrounding text  The problem of image retrieval is transformed into a problem of text retrieval

19 10/22/2015ACM WIDM'200519 http://www.ece.tuc.gr/intellisearch

20 10/22/2015ACM WIDM'200520 Methods  Vector Space Model (VSM)  SSRM  Each method is represented by a precision/recall plot  Each point is the average precision/recall over 20 queries  20 queries from the list of the most frequent Google image queries

21 10/22/2015ACM WIDM'200521 Experimental Results

22 10/22/2015ACM WIDM'200522 MeSH and MedLine  MeSH: ontology for medical and biological terms by the N.L.M. 22,000 terms  MedLine: the premier bibliographic medical database of N.L.M. 13 Million references

23 10/22/2015ACM WIDM'200523 Evaluation on MedLine

24 10/22/2015ACM WIDM'200524 Conclusions  Semantic similarity methods approximated the human notion of similarity reaching correlation up to 83%  SSRM exploits this information for improving the performance of retrieval  SSRM can work with any semantic similarity method and any ontology

25 10/22/2015ACM WIDM'200525 Future Work  Experimentation with more data sets (TREC) and ontologies  Extend SSRM to work with Compound terms More parts of speech (e.g., adverbs) Co-occurring terms More terms relationships in WordNet More elaborate methods for specification of thresholds

26 10/22/2015ACM WIDM'200526 Try our system on the Web  Semantic Similarity System: http://www.ece.tuc.gr/similarity http://www.ece.tuc.gr/similarity  SSRM: http://www.ece.tuc.gr/intellisearch http://www.ece.tuc.gr/intellisearch


Download ppt "10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis."

Similar presentations


Ads by Google