Presentation is loading. Please wait.

Presentation is loading. Please wait.

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Similar presentations


Presentation on theme: "Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution."— Presentation transcript:

1 Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra

2 Progressive ER

3 IdNamePapers u1u1 Very Large Data Bases {p1}{p1} u2u2 ICDE Conference {p2}{p2} u3u3 VLDB {p3}{p3} u4u4 IEEE Data Eng. Bull {p4}{p4} IdTitleAuthorsVenue p1p1 Transaction Support in Read Optimized … { a 1, a 2 } u1u1 p2p2 Read Optimized File System Designs: … {a1}{a1} u2u2 p3p3 Transaction Support in Read Optimized … { a 3, a 4 } u3u3 p4p4 Berkeley DB: A Retrospective.. {a3}{a3} u4u4 Author Venue IdNamePapers a1a1 Marge Seltzer { p 1, p 2 } a2a2 Michael Stonebraker {p1}{p1} a3a3 Margo I. Seltzer { p 3, p 4 } a4a4 M. Stonebraker {p3}{p3} Paper Relational Dataset

4 duplicate Resolve Graph Representation u1, u3u1, u3 u1, u3u1, u3 p1, p3p1, p3 p1, p3p1, p3 a1, a3a1, a3 a1, a3a1, a3 u2, u4u2, u4 u2, u4u2, u4 a2, a4a2, a4 a2, a4a2, a4 p2, p4p2, p4 p2, p4p2, p4

5 Problem Definition  Given a relational dataset D, and a cost budget BG.  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.  Given a relational dataset D, and a cost budget BG.  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.

6 ER Graph R 1 S 1 R 2 T 2 T 1 S2S2

7 ER Graph R 1 S 1 R 2 T 2 V9V9 V9V9 V 10 V 11 V 12 T 1 V5V5 V5V5 V6V6 V6V6 V7V7 V7V7 V8V8 V8V8 S2S2 V1V1 V1V1 V3V3 V3V3 V4V4 V4V4 V2V2 V2V2

8 V8V8 V8V8 R 2 T 2 S2S2 Partially Constructed Graph V1V1 V1V1 V5V5 V5V5 V6V6 V6V6 V7V7 V7V7 R 1 S 1 V9V9 V9V9 V 10 V 11 V 12 T 1 V3V3 V3V3 V4V4 V4V4 V2V2 V2V2

9 Resolution Windows Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( )  Set of blocks ( ) to be instantiated.  Set of nodes ( ) to be resolved. BG Lazy Resolution Strategy

10 Resolution Windows Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( )  Set of blocks ( ) to be instantiated.  Set of nodes ( ) to be resolved. BG

11 Plan Execution Phase V1V1 V1V1 V5V5 V5V5 V6V6 V6V6 V7V7 V7V7 R 1 S 1 V9V9 V9V9 V 10 V 11 T 1 V8V8 V8V8 S2S2 V2V2 V2V2 V3V3 V3V3 V 12 R 2 T 2 V4V4 V4V4 Lazy Resolution Strategy

12 Plan Cost and Benefit

13 Node Benefit V3V3 V3V3 … … … … V6V6 V6V6 … … V1V1 V1V1 V2V2 V2V2 V4V4 V4V4 V5V5 V5V5 Indirect Benefit Direct Benefit

14 2. Generate a plan such that:  h  is maximized. 2. Generate a plan such that:  h  is maximized. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. Plan Generation Phase NP-hard

15 Instantiated Unresolved Nodes Step#1 Step#2 Uninstantiated Blocks v1v1 v1v1 v2v2 v2v2 v4v4 v4v4 v6v6 v6v6 v7v7 v7v7 v 10 v 15 v 16 v 21 v 13 v1v1 v1v1 v2v2 v2v2 v6v6 v6v6 v 10 v 16 R1R1 R1R1 R2R2 R2R2 R4R4 R4R4 R5R5 R5R5 R6R6 R6R6 R8R8 R8R8 R9R9 R9R9 Plan Generation Algorithm

16 Step#3 v1v1 v1v1 v2v2 v2v2 v6v6 v6v6 v 10 v 16 v 30 v 32 v 34 v 36 v 38 v1v1 v1v1 v2v2 v2v2 v 10 v 30 If > else return and v 40 v 42 v 45 v 47 v 48 R1R1 R1R1 R8R8 R8R8 R6R6 R6R6 R2R2 R2R2 … Plan Generation Algorithm

17 Algorithms: 1.DepGraph.  X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2.Static.  S. E. Whang et al. Joint entity resolution. ICDE. Quality Metric: Experimental Evaluation R R1R1 R1R1 R4R4 R4R4 R5R5 R5R5 … T6T6 T6T6 T1T1 T1T1 T3T3 T3T3 … S2S2 S2S2 S6S6 S6S6 S5S5 S5S5 … T S

18 Real Dataset - CiteSeerX  Papers (P)  (Title, Abstract, Keywords, Authors, Venue)  |P| = 30,000  Authors (A)  (Name, , Affiliation, Address, Paper)  |A| = 83,152  Venues (U)  (Name, Year, Pages, Papers)  |U| = 30,000

19 Time vs. Recall

20 Conclusion  Progressive Approach to Relational ER.  Cost and benefit model for generating a resolution plan.  Lazy resolution strategy to resolve nodes with the least amount of cost.  Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.

21 Questions


Download ppt "Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution."

Similar presentations


Ads by Google