Semantic Similarity in a Taxonomy By Ankit Ramteke (113050067) Bibek Behera (113050043) Karan Chawla (113050053)

Slides:



Advertisements
Similar presentations
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
Advertisements

A UTOMATICALLY A CQUIRING A S EMANTIC N ETWORK OF R ELATED C ONCEPTS Date: 2011/11/14 Source: Sean Szumlanski et. al (CIKM’10) Advisor: Jia-ling, Koh Speaker:
Evaluating semantic similarity using GML in Geographic Information Systems Fernando Ferri 1, Anna Formica 2, Patrizia Grifoni 1, and Maurizio Rafanelli.
Introduction to Databases
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
1 SWE Introduction to Software Engineering Lecture 13 – System Modeling.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures Presenter: Cosmin Adrian Bejan Alexander Budanitsky and.
Experiments on Using Semantic Distances Between Words in Image Caption Retrieval Presenter: Cosmin Adrian Bejan Alan F. Smeaton and Ian Quigley School.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Cognitive Linguistics Croft & Cruse 10 An overview of construction grammars (part 1, through )
June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Modeling: The process and the required competencies of its participants Paul Frederiks Theo van der Weide.
1 Pertemuan 6 The structure part of object data model (Lanjutan) Matakuliah: M0174/OBJECT ORIENTED DATABASE Tahun: 2005 Versi: 1/0.
Using Information Content to Evaluate Semantic Similarity in a Taxonomy Presenter: Cosmin Adrian Bejan Philip Resnik Sun Microsystems Laboratories.
Conceptual modelling. Overview - what is the aim of the article? ”We build conceptual models in our heads to solve problems in our everyday life”… ”By.
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Chapter 4 Object and Object-Relational Databases (Part ½: Object-Oriented Concepts) Lecturer: H.Ben Othmen Department of Computer Science, Umm Al-Qura.
UML Class Diagrams: Basic Concepts. Objects –The purpose of class modeling is to describe objects. –An object is a concept, abstraction or thing that.
Intelligent Tutoring Systems Traditional CAI Fully specified presentation text Canned questions and associated answers Lack the ability to adapt to students.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Dimacs Graph Mining (via Similarity Measures) Ye Zhu Stephanie REU-DIMACS, July 17, 2009 Mentor : James Abello.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Mapping and Localization with RFID Technology Matthai Philipose, Kenneth P Fishkin, Dieter Fox, Dirk Hahnel, Wolfram Burgard Presenter: Aniket Shah.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
UML Profile to Support Requirements Engineering with KAOS Presented by Chin-Yi Tsai.
Chapter Twenty-ThreeModern Programming Languages1 Formal Semantics.
1 Mobile-Assisted Localization in Wireless Sensor Networks Nissanka B.Priyantha, Hari Balakrishnan, Eric D. Demaine, Seth Teller IEEE INFOCOM 2005 March.
Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.
JHU WORKSHOP July 30th, 2003 Semantic Annotation – Week 3 Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Boncheva,
Semantic web course – Computer Engineering Department – Sharif Univ. of Technology – Fall Knowledge Representation Semantic Web - Fall 2005 Computer.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Data Abstaraction Chapter 10.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Wei-Tek Tsai who’s.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
1 CS Programming Languages Class 22 November 14, 2000.
1 Copyright © 1998 by Addison Wesley Longman, Inc. Chapter 10 Abstraction - The concept of abstraction is fundamental in programming - Nearly all programming.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Modeling Security-Relevant Data Semantics Xue Ying Chen Department of Computer Science.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Requirement Engineering with URN: Integrating Goals and Scenarios Jean-François Roy Thesis Defense February 16, 2007.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab Presenter : BEI-YI JIANG Authors : JAMAL A. NASIR, IRAKLIS VARLAMIS, ASIM KARIM, GEORGE TSATSARONIS KNOWLEDGE-BASED.
Slide 1 Wolfram Höpken RMSIG Reference Model Special Interest Group Wolfram Höpken IFITT RMSIG.
Natural Language Processing Vasile Rus
A Mental Game as a Source of CS Case Studies
Master Antenna TV System:
SysML v2 Formalism: Requirements & Benefits
Object Oriented Concepts -II
UML Class Diagrams: Basic Concepts
Distance Education.
Probabilistic Latent Preference Analysis
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
ETSI TC MTS TDL SC meeting Reports
11.1 The Concept of Abstraction
Suggested Layout ** Designed to be printed on A3 paper in an assortment of colours. This is directly linked to the Computer Science Specification.
Chapter 11 Abstraction - The concept of abstraction is fundamental in
Presentation transcript:

Semantic Similarity in a Taxonomy By Ankit Ramteke ( ) Bibek Behera ( ) Karan Chawla ( )

● It is based on a paper ● “ Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language” ● By Philip Resnik (July 1999)

Outline 1)Motivation 2)Abstract 3)Similarity Measure a)Concept And Word Similarity 4)Implementation a)Other Measures 5)Results a)Critique b)Solution

Motivation ● Evaluating semantic relatedness using network representations is a difficult problem ● One Approach : Edge Counting method ● For example:

Motivation Contd. ● Drawback with Edge-Counting Approach:- ● It relies on the notion that links in the taxonomy represent uniform distances.

Example Example : ● Rabbit Ears ● Sense 1 ● rabbit ears -- (an indoor TV antenna; consists of two extendible rods that form a V) ● => television antenna, tv-antenna -- (an omnidirectional antenna tuned to the broadcast frequencies assigned to television) ● phytoplankton ● Sense 1 ● phytoplankton -- (photosynthetic or plant constituent of plankton; mainly unicellular algae) ● => plant, flora, plant life -- (a living organism lacking the power of locomotion) ● => organism, being -- (a living thing that has (or can develop) the ability to act or function independently) ● => living thing, animate thing -- (a living (or once living) entity)

Abstract ● This paper presents a semantic similarity measure in an IS-A taxonomy. ● It presents algorithms that take advantage of taxonomic similarity in resolving syntactic and semantic ambiguity.

Some Notations ● C : Set of concepts in an is-a taxonomy ● p(C) : probability of encountering an instance of concept c ● if c1 is-a c2, then p(c1) <= p(c2) ● Information content of a concept c = - log p(c)

Similarity Measure ● Similarity between concepts c1 and c2, ● sim(c 1,c 2 ) = max c є S(c1,c2) [- log p(c) ] ● Where, ➢ S(c1,c2) : set of concepts that subsume both c1 and c2 ● Similarity between words c1 and c2, ● wsim (w 1,w 2 ) = max c1,c2 [ sim(c 1, c 2 ) ] ● where, ➢ c1 ranges over s(w1) ➢ c2 ranges over s(w2)

Concept and Word Similarity

Implementation ● The concept probability is calculated as p(c) = freq(c) / N Where, ➢ freq(c) = ∑ n є words(c) count(n) ➢ N : Total Number of nouns observed

Other Measures 2) Edge-counting method 3)wsim edge (w 1,w 2 ) = (2 * MAX )-[ min c1,c2 len(c 1, c 2 ) ] 4)Probability method sim p(c) (c 1,c 2 ) = max c є S(c1,c2) [1 - p(c) ] wsim p(c) (w 1,w 2 ) = max c1,c2 [ sim p(c) (c 1, c 2 ) ]

Results [1] r : Correlation between the similarity ratings and the mean ratings reported by Miller and Charles.

Critique ● The measure sometimes produces spuriously high similarity measures for words on the basis of inappropriate word senses. ● For example:

Solution ● Take a weighted similarity measure:- ● wsim α (w 1,w 2 ) = ∑ i α(c i ) [- log p(c i ) ] ● where, ● α(c i ) : weight for each sense

Conclusion ● Presented semantic similatiry measures in an IS-A taxonomy. ● Information Content Measure proved useful and gives good results, very close to human method.

Using Taxonomic Similarity in Resolving Syntactic Ambiguity

MOTIVATION “every way ambiguous” syntactic constructions Prepositional phrases, nominal compounds I saw a boy with a telescope A bank and a warehouse guard This section investigates role of semantic similarity in disambiguating nominal compounds

Motivation Noun phrase co-ordinations of the form n1 and n2 n3 There are two possibilities: A (bank and warehouse) guard A (policeman) and (park guard) It is important to distinguish between the two

Why semantic similarity?? Similarity of form and similarity of meaning are important cues for conjoinability Similarity of form can be captured by agreement in number (singular vs. plural) Several business and university groups Several businesses and university groups Boolean variable : candidate heads satisfying number agreement

Why semantic similarity?? Similarity of form and similarity of meaning are important cues for conjoinability Similarity of form can be captured by agreement in number (singular vs. plural) Several business and university groups Several businesses and university groups Boolean variable : candidate heads satisfying number agreement

Some more examples Noun-Noun modification Mail and securities fraud Corn and peanut butter Mail fraud is an important nominal phrase Corn butter?? Now consider Corn and peanut crops

Selectional Association For example, A(wool, clothing) would have a higher value than, say, A(wool, person). A(w1; w2) of two words is defined as the maximum of A(w1; c) taken over all classes c to which w2 belongs. For example, A(wool, glove) would most likely be equal to A(wool, clothing), as compared to, say, A(wool, sports equipment)

Using Selectional Association Resolving coordination ambiguities as follows: Prefer analyses that strong affinities Bank as a modifier of guard Dis-prefer very weak relationships corn as a modifier of butter Strong and weak would be parameters of the algorithm

Experiments Two sets of 100 noun phrases Form [NP n1 and n2 n3] Wall Street Journal (WSJ) corpus, as found in the Penn Treebank. Disambiguated manually One set used for training and other for testing Preprocessing All names were replaced with “someone” Expansion of month abbreviations Reduction of nouns to their root forms

Calculations of measures of similarity Selectional association Co-occurrence frequencies taken from a sample of approximately 15,000 noun-noun compounds extracted from the WSJ corpuses

Decision Rules

Combination of methods (a) A simple form of “backing off” Number agreement Selectional association Semantic similarity (b) Taking a vote among the three strategies and choosing the majority (c) classifying using the results of a linear regression (d) constructing a decision tree classifier

Results

Comments semantic similarity does contain information about the correct answer: Agrees with the majority vote a substantial portion of the time (74%) Selects correct answers more often than one would expect by default for the cases it receives through backing off

Using taxonomic similarity in resolving semantic ambiguity

Using taxonomic similarity in word sense selection ● Selecting the sense of noun when they appear in context of other nouns similar in meaning. ● Query expansion using related words is a well studied technique in information retrieval. ● Clusters of similar words can play a role in smoothing stochastic language models for speech recognition. ● Clusterings of related words can be used in characterizing subgroupings of retrieved documents in large-scale Web searches.

Assosciating word senses with noun groupings Attorney Counsel Trial Court judge For some tasks, the relevant relationships are not among words but word sense. Using this cluster if we search, we will retrieve documents involving advice (one sense of counsel) and royalty ( one sense of count) Semantically Sticky cluster

Resnik's algorithm ● Resnik introduced an algorithm in 1998 which finds the most relevant group of word sense based on relationship between pairs of word in a group. ● The algorithm is as follows :- ● Let the set of words be ● Each word w i has a sense set

● For example : -

● For each word pair, we find the nearest ancestor(c 1,2 ) and the word similarity value(v 1,2 ). For example : Now we traverse through all the senses for each word w i. The idea is that a sense is more relevant if it is a child of the ancestor of the pair of words. Such an ancestor gives support to that sense by incrementing its support value by its information content.

doctornurse Health professional Medical doctor Phd Health professional Nanny

Support and normalization matrix

doctoractor Person Medical doctor Phdperformer

● Taking doctor and actor

● Taking nurse and actor

Membership function

Find the most relevant group W' is the set of words that an ideal human being would cluster together. Using the above algorithm we can cluster doctor (medical doctor), nurse(health professional), actor(performer) in a cluster.

● On average, Resnik's algorithm achieved approximately 89% of the performance of human annotators performing the same task.

Linking to WordNet using a Bilingual Dictionary ● Multilingual resources are difficult to obtain. ● Associate Chinese vocabulary items with nodes in WordNet using definitions in the CETA Chinese-English dictionary. ● For example, consider the following dictionary entries: ● a) ● 1. brother-in-law (husband's elder brother) 2. father 3. uncle (father's elder brother ) 4. uncle (form of address to an older man) ● : actress, player of female roles.

● While assosciating chinese assosciations as these with wordnet taxonomy,we must avoid inappropriate senses. ● For ex:- ● in (a) should not be assosciated with father as in sense of Church's father or priest. ● English glosses in the CETA dictionary are generally too short to take advantage of word overlap in this fashion although they are very similar in meaning but for few cases. ● c): 1. case i.e., upper case or lower case 2. dial of a watch. ● inspection of the dictionary confirms that when multiple definitions are present they tend more toward polysemy than homonymy.

● Resnik conducted experiments to access the extent to which word sense disambiguation algorithm can be used to identify relevant noun senses in wordnet for chinese words in the CETA dictionary using the english definitions as the source of similar nouns to disambiguate. ● Using some simple heuristics like:- ● Find head nouns i.e. Nouns heading definitons ● for ex: a ) uncle, brother-in-law, father b ) actress, player ● Wordnet's noun database was used to automatically find compound nominals. For ex: - record player would have the compound record_player as noun rather than player.

Experiment ● Two independent judges were recruited for assistance in annotating the test set : - one a native chinese speaker and the other a chinese language expert for the US government. ● Judges independently annotated the 100 test items. ● For each item the judge was given:- ● chinese word ● English definition ● A list of all word net sense description for any head noun in the assosciated noun group.

Procedure ● For each item the judges were asked if the word was known. If the response was negative, the judge moved to next word. ● For known objects, the instructions were as follows:- ● for each WordNet definition, there are 6 boxes: ● For each definition: ● if the judge thinks the Chinese word can have that meaning, the number corresponding to his confidence in that choice is selected, where 1 is lowest confidence and 5 is highest confidence is _ a

● If the Chinese word cannot have that meaning, but can have a more specific meaning, is-a is selected. ● For example, if the Chinese word means truck ● " and the WordNet definition is automotive vehicle: self-propelled wheeled vehicle", ● this option would be selected. That is, it makes sense to say that this Chinese word describes a concept that is a kind of automotive vehicle. Then 1, 2, 3, 4, or 5 is picked as his confidence in this decision. ● If neither of the above cases apply for this WordNet definition, nothing is checked for this definition.

● For ex: - ● The list contains the following wordnet sense descriptions, as generated by the head nouns message and dispatch.Tick on the appropriate ratings. ● message, content, subject matter, substance: what a communication that is about something is about. ● dispatch, expedition, expeditiousness, fastness: subconcept of celerity, quick- ness, rapidity. ● dispatch, despatch, communique: an official report usually sent in haste. - Urgent message or urgent dispatch is _ a

● message: a communication usually brief that is written or spoken or signaled; he sent a three-word message" ● dispatch, despatch, shipment: the act of sending off something ● dispatch, despatch: the murder or execution of someone is _ a

● IS_A was not used in the analysis of results but it provided the judges to give their decision on words that dont have equivalent translation in english. ● For ex:- ● A judge would classify the word as ● festival: a day or period of time set aside for feasting and celebration. Since chinese new year does not appear as wordnet concept ● - the spring festival, lunar new year, Chinese new year

● After the test set has been annotated, evaluation is done according to two conditions: ● Selection – The annotator correctly identified which wordnet senses are to be considered correct. ● Filtering - The annotator correctly identified which wordnet senses are to be considered incorrect.

● An algorithm being tested against this set must include the senses that were labelled correctly by human annotator. ● For ex:- ● if the “ dispatch, despatch: the murder or execution of someone “ is included by the algorithm, it should be penalized.

● For the selection paradigm, the goal is to identify WordNet senses to include. ● The precision and recall is as follows:-

● For the filtering paradigm, the goal is to identify WordNet senses to exclude. ● The precision and recall is as follows:-

● In the filtering paradigm, precision begins with the set of senses that the method filtered out and computes the proportion that were correctly filtered out. ● Recall in filtering begins with the set of senses that should have been excluded i.e. the incorrect ones and computes the proportion of these that the method actually managed to exclude.

Results Evaluation using Judge 1 as reference standard considering items Selected with confidence 3 and above. Agreement and disagreement with judge 1

Conclusion ● The results clearly show that the algorithm is better than the baseline, but also indicate that it is overgenerating senses, which hurts selection precision. ● Below baseline performance in filtering recall suggests the algorithm is choosing to allow in senses that it should be filtering out. ● This pattern of results suggests that the best use of this algorithm at its present level of performance would be as a filter for a lexical acquisition process with a human in the loop.

References 1)Resnik, P Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. In Journal of Artificial Intelligence Research Vol. 11, pages )Resnik, P Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence.