Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Slides:

Advertisements

Similar presentations

Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.

Advertisements

Chapter 5: Introduction to Information Retrieval

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Multimedia Database Systems

Improved TF-IDF Ranker

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Introduction to Information Retrieval (Part 2) By Evren Ermis.

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

A Framework for Ontology-Based Knowledge Management System

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Using Natural Language Program Analysis to Locate and understand Action-Oriented Concerns David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Modeling Modern Information Retrieval

June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.

Evaluating the Performance of IR Sytems

Vector Space Model CS 652 Information Extraction and Integration.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

IR Models: Review Vector Model and Probabilistic.

Information Retrieval

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.

Query Relevance Feedback and Ontologies How to Make Queries Better.

COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

1 Query Operations Relevance Feedback & Query Expansion.

WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Web- and Multimedia-based Information Systems Lecture 2.

1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.

Vector Space Models.

C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Text Based Information Retrieval

Information Retrieval and Web Search

CS 430: Information Discovery

Multimedia Information Retrieval

Special Topics on Information Retrieval

אחזור מידע, מנועי חיפוש וספריות

Basic Information Retrieval

Representation of documents and queries

A method for WSD on Unrestricted Text

CS 430: Information Discovery

Chapter 5: Information Retrieval and Web Search

Ying Dai Faculty of software and information science,

MedSearch is a retrieval system for the medical literature

Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou

Recuperação de Informação B

Recuperação de Informação B

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge

Overview Backgrounds WordNet and Semantic Similarity Methods Semantic Similarity Retrieval Model (SSRM) Web Retrieval System Experimental Results Conclusions Brief introductions of the backgrounds

Backgrounds Lack of common terms != not related SSRM vs. Classical Models Semantically Similar vs. Lexicographic Term Matching Information retrieval is currently being widely used in domains from database systems to search engines. The main idea is to locate documents that contain terms that the users specify in queries. classical information retrieval models (e.g., Vector Space, Probabilistic, Boolean) [14] is based on lexicographic term matching. However, the lack of common terms in two documents does not necessarily mean that the documents are not related. two terms can be semantically similar (e.g., can be synonyms or have similar meaning) although they are lexicographically different Therefore, classical retrieval methods will fail to retrieve documents with semantically similar terms. This is exactly the problem this paper is addressing

WordNet and Semantic Similarity Methods An online lexical reference system Hierarchy and Relationships Hyponym/Hypernym (Is-A) Meronym/Holonym (Part-Of) WordNet is developed at the Princeton University It is an on-line lexical reference system, which attempts to model the lexical knowledge of the native English speakers It contains around 100,000 terms, including Nouns, verbs, adjectives and adverbs, which are organized into hierarchies

WordNet and Semantic Similarity Methods Semantic Similarity Methods (4 main) Edge Counting Method Length of path Information Content Method Probability of occurrence Feature Based Method Properties Hybrid Method Edge Counting Methods: Measure the similarity between two terms (concepts) as a function of the length of the path linking the terms Information Content Methods: Measure the difference in information content of the two terms as a function of their probability of occurrence in a corpus

Semantic Similarity Retrieval Model (SSRM) Queries and Documents  Term Vectors Each term in vector is represented by its weight TFIDF is used for computing the weight Queries and documents are syntactically analyzed and reduced into term (noun) vectors. A term is usually defined as a stemmed non stop-word. Very infrequent or very frequent terms are eliminated Each term in vector is represented by its weight We use the TFIDF model for computing the weight

Semantic Similarity Retrieval Model (SSRM) Vector Space Model (VSM) Cosine Similarity Traditionally, for vector space models, the similarity between two documents is computed as the cosine of the inner product between their document vectors Where qi and di are the weights in the two vector representations Given a query, all documents are ranked according to their similarity with the query Once again, the lack of common terms in two documents does not necessarily mean that the documents may not contain the same terms. Semantically similar concepts maybe expressed in different words in the documents and the queries, and direct comparison by word-based VSM is not effective. For example, VSM will not recognize synonyms or semantically similar terms (e.g., ”car”, ”automobile”).

Semantic Similarity Retrieval Model (SSRM) SSRM Works in Three Steps Term Re-weighting Term Expansion Document Similarity

Semantic Similarity Retrieval Model (SSRM) Term Re-weighting Give higher weights to semantically similar query terms The weight qi of each query term i is adjusted based on its relationships with other semantically similar terms j within the same vector where t is a user defined threshold. This formulae suggests assigning higher weights to semantically similar terms within the query (e.g., ”railway”, ”train”, ”metro”). The weights of non-similar terms remain unchanged (e.g., ”train”, ”house”).

Semantic Similarity Retrieval Model (SSRM) Term Expansion Hypo/hypernyms Each query term is assigned the weight as follows: First, the query is augmented by syn- onym terms (the most common sense is taken). Then, the query is augmented by hyponyms and hypernyms which are semantically similar to terms already in the query. Fig. 3 illustrates this process: Each query term is represented by its WordNet tree hierarchy. The neighborhood of the term is examined and all terms with similarity greater than threshold T (T = 0.9 in this work) are also included in the query vector. This expansion may include terms more than one level higher or lower than the original term. Then, each query term i is assigned a weight as follows where n is the number of hyponyms of each expanded term j. It is possible for a term to introduce terms that already existed in the query. It is also possible that the same term is introduced by more than one other terms.

Semantic Similarity Retrieval Model (SSRM) Document Similarity i and j are terms in the query and the document respectively In this calculation, query terms are expanded and re-weighted according to the previous steps while document terms dj are computed as tf · idf terms

Semantic Similarity Retrieval Model (SSRM) Pro Relax the classical assumptions Con Quadratic time complexity: O(n^2) SSRM relaxes the requirement of classical retrieval models that conceptually similar terms are mutually independent It has a quadratic time complexity as opposed to the linear time complexity of VSM. Expanding and re-weighting is fast for queries (queries are short in most cases specifying only a few terms) but not for documents with many terms

Web Retrieval System Two inverted files associate terms with their intra and inter document frequencies and allow for fast computation of term vectors. These descriptions are syntactically analyzed and reduced into vectors of stemmed terms (nouns) which are matched against the queries.

Experimental Results SSRM vs. VSM 20 most frequent short image queries from Google 5 human referees Precision and Recall For the evaluations, 20 queries were selected from the list of the most frequent Google image queries These are rather short queries containing between 1 and 4 terms. The evaluation is based on human relevance judgments by 5 human referees. Each referee evaluated a subset of 4 queries for both methods. Each method is represented by a precision-recall curve. Each query retrieves the best 50 answers and each point on a curve is the average precision and recall over 20 queries. Fig. 5.2 indicates that SSRM is far more effective than VSM achieving up to 30% better precision and up to 20% better recall. A closer look into the results reveals that the efficiency of SSRM is mostly due to the contribution of non-identical but semantically similar terms. VSM (like most classical retrieval models relying on lexicographic term matching) ignore this information.

Conclusions SSRM Can Work in Conjunction with Any Taxonomic Ontology SSRM Performs Better Than VSM Future: extend SSRM to work with compounded terms SRM can work in conjunction with any taxonomic ontology (e.g., application specific ontologies). The evaluation demonstrated very promising performance improvements over the Vector Space Model (VSM), which is the state-of-the-art document retrieval method.