Presentation is loading. Please wait.

Presentation is loading. Please wait.

Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park.

Similar presentations


Presentation on theme: "Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park."— Presentation transcript:

1 Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park

2  Discover the domain entities  Map each reference to an entity The Entity Resolution Problem Abdulla Ansari WeiWei WangChih Chen Wenyi WangLiyuan Li P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari Chien-Te Chen

3 Query-time ER: Motivation  Most publicly available databases do not have resolved entities oPubMed, CiteSeer have many unresolved authors  Millions of queries everyday require resolved entities directly or indirectly o“I am looking for all papers by Stuart Russell”  How do we address this problem? 1.Leave the burden on the user to do the resolution 2.Ask owners to ‘clean’ their databases 3.Develop techniques for query-time resolution

4 Entity Resolution Queries  Disambiguation Query oAmong all papers with ‘W Wang’ as author, find those written by WeiWei Wang  Resolution Query oDo disambiguation oAlso retrieve papers by WeiWei Wang with a different author name, e.g. ‘W W Wang’ etc P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang

5 Query-time ER using Relations 1.Simple approach for resolving queries oUse attributes oQuick but not accurate 2.Use best techniques available oCollective resolution using relationships oHow can localize collective resolution?  Two-phase collective resolution for query oExtract minimal set of relevant records oCollective resolution on extracted records

6 Cut-based Evaluation of Relational Clustering Vertices embedded in attribute space Additional (hyper)edges represent relationships Good separation of attributes Many cluster-cluster relationships  C1-C3, C1-C4, C2-C4 Worse in terms of attributes Fewer cluster-cluster relationships  C1-C3, C2-C4 C1 C2 C4 C3 C1 C2 C4 C3

7 A Cut-based Objective Function weight for attributes weight for relations similarity of attributes 1 iff relational edge exists between c i and c j compatibility of c i and c j  Greedy clustering algorithm: merge cluster pair with max reduction in objective function Common cluster neighborhood Jaccard works better than intersection Similarity of attributes Jaro, Levenstein; TF-IDF

8 W Wang P4: W W Wang P1: W Wang P2: W Wang P3: W Wang P4: A Ansari P2: A Ansari P1: A Ansari P1: C Chen P3: C Chen P3: L Li P: A Ansari P: C Chen P: L Li Extracting Relevant Records Start with query name or record Alternate between 1.Name expansion: For any relevant record, include other records with that name 2.Hyper-edge Expansion: For any relevant record, include other related records Terminate at some depth k Name expansion Name expansion Hyper-edge expansion Query Level 0Level 1 Level 2

9 Adaptive Expansion for a Query  Too many records with unconstrained expansion oAdaptively select records based on ‘ambiguity’ o‘Chen’ is more ambiguous than ‘Ansari’  Adaptive Name Expansion oExpand the more ambiguous records  They need extra evidence  Adaptive Hyper-edge expansion oAdd fewer ambiguous records  They lead to imprecision

10 Unsupervised Estimation of Ambiguity  Probability of multiple entities sharing an attribute value  Estimate ambiguity of one single valued attribute (A1=a) using another (A2) oCount number of different values of A2 observed for records having A1=a oe.g. #different first initials for last-name ‘Smith’  Estimate improves with more independent attributes

11 Evaluation Datasets  arXiv High Energy Physics o29,555 publications, 58,515 refs to 9,200 authors oQueries: All ambiguous names (75 in total)  True authors per name: 2 to 11 (avg. is 2.4)  Elsevier BioBase o156,156 publications, 831,991 author refs oKeywords, topic classifications, language, country and affiliation of corresponding author, etc oQueries: 100 most frequent names  True authors per name: 1 to 100 (avg. is 32)

12 Growth Rate of Relevant Records and Query Processing Time Number of relevant references grows rapidly with expansion depth RC-ER is fast but not good enough for query-time resolution

13 Query-time ER Results Unconstrained expansion oCollective resolution more accurate oAccuracy improves beyond depth 1 A : pair-wise attributes similarity ; A+N: also neighbors’ attributes ; * : transitive closure AX-2 : adaptive expansion at depths 2 and beyond AX-1 : adaptive expansion even at depth 1 Adaptive expansion oMinimal loss in accuracy oDramatic reduction in query processing time

14 Conclusions  Query-centric entity resolution  Cut-based evaluation of relational clustering  Adaptive selection of relevant references for a query  Resolution at query-time with minimal loss in accuracy Future Directions  Spectral algorithm for relational clustering  Stronger coupling between extraction and resolution  Localized resolution for incoming records

15 References  "Query-Time Entity Resolution", Indrajit Bhattacharya, Louis Licamele and Lise Getoor, ACM SIGKDD, 2006  "A Latent Dirichlet Model for Unsupervised Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIAM Data Mining, 2006  "Entity Resolution in Graphs", Indrajit Bhattacharya and Lise Getoor, Chapter in Mining Graph Data, Lawrence B. Holder and Diane J. Cook, Editors, Wiley, 2006 (to appear).  "Relational Clustering for Multi-type Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIGKDD Workshop on Multi Relational Data Mining (MRDM), 2005


Download ppt "Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park."

Similar presentations


Ads by Google