Presentation is loading. Please wait.

Presentation is loading. Please wait.

Self-tuning in Graph-Based Reference Disambiguation

Similar presentations


Presentation on theme: "Self-tuning in Graph-Based Reference Disambiguation"— Presentation transcript:

1 Self-tuning in Graph-Based Reference Disambiguation
Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine

2 DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

3 DASFAA 2007, Bangkok, Thailand
Data Cleaning Analysis on bad data leads to wrong conclusions 11 January 2019 DASFAA 2007, Bangkok, Thailand

4 Example of the problem: CiteSeer top-K
Suspicious entries Lets go to DBLP website which stores bibliographic entries of many CS authors Lets check two people “A. Gupta” “L. Zhang” they are in top-20 because there are many of them CiteSeer: the top-k most cited authors DBLP DBLP 11 January 2019 DASFAA 2007, Bangkok, Thailand

5 Two Most Common Entity-Resolution Challenges
Fuzzy lookup reference disambiguation match references to objects list of all objects is given Fuzzy grouping group together object repre-sentations, that correspond to the same object 11 January 2019 DASFAA 2007, Bangkok, Thailand

6 Standard Approach to Entity Resolution
11 January 2019 DASFAA 2007, Bangkok, Thailand

7 DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

8 DASFAA 2007, Bangkok, Thailand
RelDC Framework 11 January 2019 DASFAA 2007, Bangkok, Thailand

9 DASFAA 2007, Bangkok, Thailand
RelDC Framework Past work SDM’05, TODS’06 Domain-independent framework Viewing the dataset as an Entity Relationship Graph Analyzes paths in this graph Solid theoretic foundation Optimization problem Scales to large datasets Robust under uncertainty High disambiguation quality No Self-tuning This paper solves this challenge 11 January 2019 DASFAA 2007, Bangkok, Thailand

10 Entity-Relationship Graph
Choice node For uncertain references To encode options/possibilities yr1, … yrN Among options yr1, … yrN Pick the most strongly connected one CAP principle Analyze paths in G that exist between xr and yrj, for all j Use a model to measure connection strength “Connection strength” model c(u,v), for nodes u and v in G how strongly u and v are connected in G RandomWalk-based Fixed Based on Intuition!!! This paper, instead, learns such a model from data. 11 January 2019 DASFAA 2007, Bangkok, Thailand

11 DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

12 DASFAA 2007, Bangkok, Thailand
Adaptive Solution Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} If paths p1 and p2 are of the same type then they are treated as identical. We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} If there is a way to associate path Ti to wi then connection strength will be: 11 January 2019 DASFAA 2007, Bangkok, Thailand

13 DASFAA 2007, Bangkok, Thailand
Problems to Answer How will we classify the paths? How will we associate each path type with a weight? 11 January 2019 DASFAA 2007, Bangkok, Thailand

14 DASFAA 2007, Bangkok, Thailand
Classifying Paths Path Type Model (PTM): Views each path as a sequence of edges <e1,e2,e3,…,en> Each edge ei has a type Ei associated with it Thus, can associate each path p with a string <E1,E2,E3,…,En> Different strings correspond to different path types Associate each string a weight Different models are also possible 11 January 2019 DASFAA 2007, Bangkok, Thailand

15 Learning Path Weights : Optimization Problem
CAP Principle states that: the right option will be better connected Linear programming Learn path types weight w’s. 11 January 2019 DASFAA 2007, Bangkok, Thailand

16 DASFAA 2007, Bangkok, Thailand
Final Solution The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j Then final solution: 11 January 2019 DASFAA 2007, Bangkok, Thailand

17 DASFAA 2007, Bangkok, Thailand
Example -Graph P1= e1-e3-e P2= e1-e1-e3 P3= e1-e2-e2-e P4= e1-e2-e3-e2-e3 11 January 2019 DASFAA 2007, Bangkok, Thailand

18 DASFAA 2007, Bangkok, Thailand
Example- Solution w1 =1 w3 = w4 = 0 w2 can be anything between 0 and 1. 11 January 2019 DASFAA 2007, Bangkok, Thailand

19 DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

20 DASFAA 2007, Bangkok, Thailand
Experimental Setup Parameters When looking for L-short simple paths, L = 5 L is the path-length limit RealMov: movies (12K) people (22K) actors directors producers studious (1K) producing distributing ground truth is known SynPub datasets: many ds of five different types emulation of RealPub publications (5K) authors (1K) organizations (25K) departments (125K) ground truth is known 11 January 2019 DASFAA 2007, Bangkok, Thailand

21 Experimental Results on Movies
Parameters : Fraction : fraction of uncertain references in the dataset Each reference has 2 choices 11 January 2019 DASFAA 2007, Bangkok, Thailand

22 Experimental Results on Movies- II
Number of options based on PMF Distribution 11 January 2019 DASFAA 2007, Bangkok, Thailand

23 Experimental Results on SynPub
RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? Hybrid Model : 11 January 2019 DASFAA 2007, Bangkok, Thailand

24 Effect of Random Relationships in the Publications Domain
11 January 2019 DASFAA 2007, Bangkok, Thailand

25 DASFAA 2007, Bangkok, Thailand
Summary Main Contribution An adaptive solution for connection strength Model learns the weights of different path types Ongoing work Using different models to learn the importance of paths in the connection strength Use of standard machine learning techniques for learning: such as decision trees, etc… Different ways to classify paths 11 January 2019 DASFAA 2007, Bangkok, Thailand

26 DASFAA 2007, Bangkok, Thailand
Contact Information RelDC project (RESCUE) Rabia Nuray-Turan (contact author) Dmitri V. Kalashnikov Sharad Mehrotra 11 January 2019 DASFAA 2007, Bangkok, Thailand

27 Thank you !


Download ppt "Self-tuning in Graph-Based Reference Disambiguation"

Similar presentations


Ads by Google