Entity Disambiguation By Angela Maduko Directed by Amit Sheth.

Entity Disambiguation By Angela Maduko Directed by Amit Sheth

Entity Disambiguation Problem Emerges mainly while merging information from different sources Two major levels  1. Schema/Ontology level : Determining the similarity of attributes/concepts/classes from the different schema/ontology to be merged  2. Instance level: Which instances of concepts/classes (/tuples in relational databases ) refer to the same entity

Current approaches for both levels Feature-based Similarity Approach (FSA)  Set-Theory Similarity Approach (STA)  Information-Theory Similarity Approach (ITA)  Hybrid Approach (HA) Relationship-based Similarity Approach (RSA) Hybrid Similarity Approach (HSA)

ITA In [1], Dekang presents a measure for the similarity between two concepts based on both their commonalities and differences  Intuition 1: The similarity between A and B is related to their commonality. The more commonality they share, the more similar they are.  Intuition 2: The similarity between A and B is related to the differences between them. The more differences they have, the less similar they are.  Intuition 3: The maximum similarity between A and B is reached when A and B are identical, no matter how much commonality they share.

ITA Consider the concept Fruit  A is an Apple  B is an Orange  Commonality of A and B? Common(A, B) = Fruit(A) and Fruit(B) Measures the commonality between A and B = I(common(A, B)) by the amount of information contained in common(A, B) Where the information content of S I(S) = -logP(S)

ITA Differences is measured by I(description(A, B)) – I(common(A, B)) Decription(A, B) is a proposition which describes what A and B are Can be applied at both levels 1 & 2 Intuitively, sim(A, B) =  1 when A and B are exactly alike;  0 when they share no commonalities Proposes sim(A, B) =

ITA In [2], Resnik measures the similarity between two concepts in an is-a taxonomy based on the information content of their most specific common super-concept2 Define P(c) as the probability of encountering an instance of a concept c in the taxonomy For any two concepts c 1 and c 2, define S(c 1, c 2 ) as the set of concepts that subsume both c 1 and c 2 Proposes sim(c 1, c 2 ) =

ITA 100 instances of concept X 4 instances of concept Y 200 instances of concept Z 2000 instances of all concepts sim(A, B) Sim(C, D) sim(A, D) sim(A, E) sim(C, D) > sim(A, B). Should this be so? Y DC X AB Z FE

ITA Define s(w) as the set of concepts that are word senses of word w. Proposes a measure for word similarity as follows Sim(w 1, w 2 ) = Can be applied at level 1 only Doctor (medical and PhD) Nurse (medical and nanny) Sim(Doctor, Nurse)

STA [3] introduces a set theoretical notion of a matching function F based on the following assumptions for classes a, b, c with description sets A, B, C respectively3  Matching: s(a, b) = F(A  B, A - B, B - A)  Monotonicity: s(a, b) ≥ s(a, c) whenever A  B  A  C, A - B  A - C, B - A  C - A

STA Proposes two models: Contrast model: Similarity is defined as  An increasing function of common features  A decreasing function of distinctive features (features that apply to one object but not the other)  S(a, b) =  f(A  B) -  f(A -B) -  f(B - A) ( , ,  ≥ 0)  Function f measures the salience of set of features  f depends on intensity and context factors  Intensity – physical salience (eg physical features)  Context – salience of features varies with context

STA Ratio Model  S(a, b) =  , ,  ≥ 0 Can be applied at both levels 1 & 2

STA [4] determines the similarity between two entities by the distance between them4 Defines P ij as the probability that entities a i and b j have the same value for their k th attribute, for all k Assigns costs to mismatching errors (ie not matching where should and matching where should not) and then calculates a cost function based on P ij to be maximized. Shows that the expected distance between a i and b j d ij = 1 - P ij, substitutes this in the cost function Calculates the distance between two entities as a linear combination (weighted average) of the distances between their common attributes

STA Relationships amongst common attributes such as key attributes & functional dependencies are exploited Obtains attribute weights from user Can be applied at level 2 only

CA In [5], the authors present a model-based k-means clustering algorithm for name disambiguation in citations Randomly assigns citations to N clusters Estimates prior probabilities of each cluster Computes probability that a cluster produces a given citation c Assigns c to the cluster with the highest probability of producing it Applied at level 2 only

CA My comments on this paper The drawback of this approach is in the estimation of the model parameters, the data necessary for this may not be readily available. Even if a user supplies estimates of these parameters, these estimates may not be unbiased and consistent

HA [7] combines clustering and information content approaches for entity disambiguation (Scalable Information Bottleneck (LIMBO) method) Attempts to cluster entities in such a way that the clusters are informative about the entities within them Model: A set T of n entities (relational tuples), defined on m attributes (A 1, A 2, …, A m ).Domain of attribute A i is the set V i = {V i,1, V i,2, …, V i, di } Let T and V be two discrete random variables that can take values from T and V respectively Initially, assigns each entity to a cluster ie #clusters = #entities. Let C q denote this initial clustering, then the mutual information of C q and T, I(C q, T) = the mutual information of V and T, I(V, T)

HA Assumes number of distinct entities k is known Seeks a clustering C k of V such that I(C k, T) remains as large as possible or the information loss I(V, T) - I(C k, T) is minimal

HSA In [8], Kashyap and Sheth introduce the concept of semantic proximity (semPro) between entities to capture their similarity In addition to context, employs relationships and features of entities in determining their similarity semPro(O 1,O 2 ) =  Context  context in which objects O 1 and O 2 are being compared  Abstraction  abstraction/mappings relating domains of the objects  (D 1, D 2 )  domain definitions of the objects  (S 1, S 2 )  states of the objects

HSA Abstractions Total 1-1 value mapping Partial many-one mapping. Generalization/specialization. Aggregation. Functional dependencies. ANY NONE

HSA Semantic Taxonomy Defines 5 degrees of similarity between objects  Semantic Equivalence  Semantic Relationship  Semantic Relevance  Semantic Resemblance  Semantic Incompatibility

HSA Semantic Equivalence: strongest measure of semantic proximity  Two objects are said to be semantically equivalent when they represent the same real world entity ie  semPro(O 1,O 2 ) = (domain Semantic Equivalence)  semPro(O 1,O 2 ) = where M = a total 1-1 value mappings between (D 1, S1) and (D2, S 2 ) (state Semantic Equivalence)

HSA Semantic Relationship: weaker than semantic equivalence. semPro(O 1,O 2 ) = where M = a partial many-one value mapping, generalization or aggregation Requirement of a 1-1 mapping is relaxed such that, given an instance O1, we can identify an instance of O2, but not vice versa.

HSA Semantic Relevance: Two objects are semantically relevant if there exists any mapping between their domains in some context semPro(O 1,O 2 ) =

HSA Semantic Resemblance: weakest measure of semantic proximity. There does not exists any mapping between their domains in any context Have same roles in some contexts with coherent definition contexts

HSA Semantic Incompatibility Asserts semantic dissimilarity. Asserts that there is no context and no abstraction in which the domains of the two objects are related. semPro(O 1,O 2 ) =

HSA [4] encompasses both the feature-based and relationship-based approaches4 Represents entity classes using 3 components, assesses similarity of two classes using 3 different similarity measures wrt to the 3 components viz:  Synonym set (to address polysemy and synonymy)  Set of distinguishing features or differentiae (functions, parts and attributes)  Set of semantic inter-relations amongst entity classes (mainly hyponymy and meronymy)

HSA Applied at level 1 only Defines the semantic neighbourhood of a class a with radius r as follows: N(a, r) = {c i }   i d(a, c i ) ≤ r where d(a, c) is the shortest path connecting the two classes in the ontology

HSA Proposes S(a, b) =  w S w (a, b) +  u S u (a, b) +  n S n (a, b) S w, S u and S n are the respective similarities between synonym sets, features and semantic neighbourhoods of classes a and b  w,  u and  n ≥ 0 are the respective weights of the similarity of each component S(a, b) based on a normalization of Tversky’s[3] model, = with 0≤  ≤13

HSA Assumes similarity is asymmetric, with more similarity from a class to its super-class than vice-versa The function  evaluates this asymmetry, defined thus:  (a, b) = where depth(a) returns the shortest path from class a to an imaginary root connecting the roots of the two ontologies Applies word matching in the synonym sets of a and b for S w

HSA For S u, applies matching over corresponding differentiae (functions S f, parts S p, attributes S a with corresponding weights  f,  p and  f ≥ 0) such that S u (a,b) =  f S f (a,b) +  p S p (a,b) +  f S f (a,b) where  f +  p +  f = 1 For S n, compares entity classes in semantic neighbourhoods based on synonym sets or differentiae of these classes.

HSA In [5] Cho et al propose a model derived from the edge-based approach, employing information content of the node based approach based on these facts:5 There exists a correlation between similarity and # of shared parent concepts in a hierarchy Link type (hyponymy, meronymy etc)  semantic relationship

HSA Conceptual similarity between a node and its adjacent child node may not be equal As depth increases in the hierarchy, conceptual similarity b/w a node and its adjacent child node decreases Population of nodes is not uniform over entire ontological structure (links in a dense part of hierarchy  less distance than that in a less dense part )

HSA Proposes S(c i, c j ) = D(L j  i )  0≤k≤n [ W(t k )d(c k+1  k )f(d) ] ( max[H(c)] ), where  f(d) is a function that returns a depth factor (topological location in hierarchy)  d(c k+1  k ) is a density function  D(L j  i ) is a function that returns a distance factor between c i and c j (shortest path from one node to the other)  W(t k ) is a weight function that assigns weights to each link type (W(t k ) = 1 for is-a link)  H(c) is information content of super-concepts of c i and c j For level 1 only

References 1. Dekang Lin, An Information-Theoretic Definition of Similarity, Proceedings of the Fifteenth International Conference on Machine Learning, p.296-304, 1998An Information-Theoretic Definition of Similarity 2. Philip Resnik, Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI, 1995. 3. Tversky Amos, Features of Similarity, Psychological Review 84(4), 1977, pp 327 - 352. 4. Debabrata Dey, A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases, IEEE Transactions on Knowledge and Data Engineeing, 14 (3), May/June 2002. 5. Hui Han, Hongyuan Zha and C. Lee Giles, A Model-based K-means Algorithm for Name Disambiguation in Proceedings of the Second International Semantic Web Conference (ISWC- 03) Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. 2003A Model-based K-means Algorithm for Name Disambiguation 6. M. Andrea Rodriguez and Max J. Egenhofer, Determining Semantic Similarity Among Entity Classes from Different Ontologies, IEEE Transactions on Knowledge and Data Engineering, 15 (2): 442-456, 2003Determining Semantic Similarity Among Entity Classes from Different Ontologies 7. Periklis Andritsos, Renee J. Miller and Panayiotis Tsaparas, Information-Theoretic Tools for Mining Database Structure from Large Data Sets, SIGMOD Conference 2004: 731-742Information-Theoretic Tools for Mining Database Structure from Large Data Sets 8. Vipul Kashyap, Amit Sheth, Semantic and schematic similarities between database objects: a context-based approach, VLDB Journal 5, no. 4 (1996): 276--304. 367Semantic and schematic similarities between database objects: a context-based approach 9. Miyoung Cho, Junho Choi and Pankoo Kim, An Efficient computational Method for Measuring Similarity between Two Conceptual Entities, WAIM 2003: 381-388

Entity Disambiguation By Angela Maduko Directed by Amit Sheth.

Similar presentations

Presentation on theme: "Entity Disambiguation By Angela Maduko Directed by Amit Sheth."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Entity Disambiguation By Angela Maduko Directed by Amit Sheth.

Similar presentations

Presentation on theme: "Entity Disambiguation By Angela Maduko Directed by Amit Sheth."— Presentation transcript:

Similar presentations

About project

Feedback