Presentation is loading. Please wait.

Presentation is loading. Please wait.

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Similar presentations


Presentation on theme: "Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution."— Presentation transcript:

1 Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra

2 Data Processing Flow Analysis Decision Data Quality of Analysis Quality of Decision Quality of Data Data Quality Challenges  Erroneous Values.  Missing Values.  Duplication. …… Data Cleaning Accounts For: 80%

3 Digital World Entities Real World Objects Entity Resolution ( ER ) Michael Jordan Basketball Player Michael Jordan Professor @ UCB

4 IdProduct NamePrice p1p1 IPad Two 16GB WiFi$490 p2p2 IPad 2 nd Generatation 16GB WiFi$469 p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 p5p5 IPhone 4 th Generation 32GB$520 Entity Resolution ( ER ) P1P1 P1P1 P2P2 P2P2 P4P4 P4P4 P3P3 P3P3 P5P5 P5P5

5 Blocking Dataset IdProduct NamePrice p1p1 IPad Two 16GB WiFi$490 p2p2 IPad 2 nd Generatation 16GB WiFi$469 p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 p5p5 IPhone 4 th Generation 32GB$520 p1p1 p2p2 p5p5 p3p3 p4p4 … BF = 1 st char of product name BF … BF 2 BF 1 Blocks

6 Resolve ( ) = duplicate, distinct, or uncertain Similarity Computation IdProduct NamePrice p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 Similarity Functions: Resolve Function:

7 Progressive ER

8 Real-time Analysis of Big Data Event Monitoring Situational Awareness Real-time Alerts Semantic Search Anti-terrorism Applications Data Cleaning

9 Progressive Data Cleaning How Progressive ER Helps Progressive Analysis Continually Refined Results

10 IdNamePapers u1u1 Very Large Data Bases {p1}{p1} u2u2 ICDE Conference {p2}{p2} u3u3 VLDB {p3}{p3} u4u4 IEEE Data Eng. Bull {p4}{p4} IdTitleAuthorsVenue p1p1 Transaction Support in Read Optimized … { a 1, a 2 } u1u1 p2p2 Read Optimized File System Designs: … {a1}{a1} u2u2 p3p3 Transaction Support in Read Optimized … { a 3, a 4 } u3u3 p4p4 Berkeley DB: A Retrospective.. {a3}{a3} u4u4 Author Venue IdNamePapers a1a1 Marge Seltzer { p 1, p 2 } a2a2 Michael Stonebraker {p1}{p1} a3a3 Margo I. Seltzer { p 3, p 4 } a4a4 M. Stonebraker {p3}{p3} Paper Relational Dataset

11 duplicate Resolve Graph Representation u1, u3u1, u3 u1, u3u1, u3 p1, p3p1, p3 p1, p3p1, p3 duplicate

12 Problem Definition  Given a relational dataset D, and a cost budget BG,  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.  Given a relational dataset D, and a cost budget BG,  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.

13 ER Graph R 1 S 1 R 2 T 2 T 1 S2S2

14 ER Graph R 1 S 1 R 2 T 2 T 1 S2S2 v1v1 v2v2 v3v3 v4v4 v8v8 v7v7 v6v6 v5v5 v9v9 v 10 v 11 v 12

15 R 2 T 2 S2S2 Partially Constructed Graph R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12

16 Overview Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( )  Set of blocks ( ) to be instantiated.  Set of nodes ( ) to be resolved. BG

17 Plan Execution Phase R 1 S 1 T 1 S2S2 R 2 T 2 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v8v8 v9v9 v 10 v 11 v4v4 v 12

18 Plan Cost and Benefit

19 Node Benefit … … … … … … Indirect Benefit Direct Benefit v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 Stat e

20 Probability Estimation … Effect Cause Noisy-OR Model Effect:  Node being duplicate. Causes:  Influencing duplicate nodes of.  Block to which belongs. Fraction of duplicate pairs in the block. vivi vivi vivi

21 R 2 T 2 S2S2 Example R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12 duplicate distinct v1v1 v2v2 S1S1

22 Node Impact …………………… v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 Dependent Nodes Nearest Nodes (K=2) 1.Belief Update  NP-hard. 2.The Nearest Nodes are not always instantiated. Why?

23 Impact Model v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 Case # 1 Case # 2 Case # 3 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v1v1 v3v3 v6v6 v2v2 v4v4 v5v5 v7v7

24 2. Generate a plan such that:  h.  is maximized. 2. Generate a plan such that:  h.  is maximized. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. Plan Generation Phase NP-hard Oregon-Trail Knapsack NP-hard Oregon-Trail Knapsack

25 Instantiated Unresolved Nodes Step#1 Step#2 Uninstantiated Blocks R1R1 R1R1 R2R2 R2R2 R4R4 R4R4 R5R5 R5R5 R6R6 R6R6 R8R8 R8R8 R9R9 R9R9 Plan Generation Algorithm v1v1 v2v2 v4v4 v6v6 v7v7 v 10 v 13 v 15 v 16 v 21 v1v1 v2v2 v6v6 v 10 v 16

26 Step#3 If > else return and R1R1 R1R1 R8R8 R8R8 R6R6 R6R6 R2R2 R2R2 … Plan Generation Algorithm v1v1 v2v2 v6v6 v 10 v 16 v1v1 v2v2 v 10 v 30 v 32 v 34 v 36 v 38 v 40 v 42 v 45 v 47 v 48

27 Lazy Resolution with Workflow Resolve … duplicate or distinct How to resolve ? How to resolve ? duplicate or distinct v1v1 v1v1 Workflow of v1v1

28 Contribution of Functions

29 Workflow Generation vivi

30 vivi

31 Resolution Cost vivi Resolution Cost when vivi is duplicate. Resolution Cost when vivi is distinct.

32 Experimental Evaluation 1.Papers (P) 2.Authors (A) 3.Venues (U) = ( Title, Abstract, Keywords, Authors, Venue ). = ( Name, Email, Affiliation, Address, Paper ). = ( Name, Year, Pages, Papers ). Number of Entities Blocking Functions Similarity Functions Resolve Function P 30,00023 Naïve Bayes A 83,15214 Naïve Bayes U 30,00013 Naïve Bayes CiteSeerX Dataset

33 CiteSeerX - Blocking  Papers (P)  First three characters of title.  Last three characters of title.  Authors (A)  First one character of first name appended with the first two characters of last name.  Venues (U)  First two characters of name appended with the first two digits of year.

34 Algorithms: 1.DepGraph.  X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2.Static.  S. E. Whang et al. Joint entity resolution. ICDE. 3.Full:  No lazy resolution strategy. 4.Random:  Lazy resolution strategy but with random order. Experimental Evaluation R R1R1 R1R1 R4R4 R4R4 R5R5 R5R5 … T6T6 T6T6 T1T1 T1T1 T3T3 T3T3 … S2S2 S2S2 S6S6 S6S6 S5S5 S5S5 … T S

35 Time vs. Recall

36 Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Plan Execution 95.11%96.17%97.40 Lazy Resolution with Workflow Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Reading Blocks 4.70%3.75%2.90% Graph Creation 8.40%6.25%4.72% Node Resolution 82.01%86.17%89.78%  Reading Blocks.  Creating Nodes.  Resolving Nodes.  Reading Blocks.  Creating Nodes.  Resolving Nodes.

37 Lazy Resolution with Workflow #2 Number of Sim Functions PAU Set_1343 Set_2222 Set_3111

38 Correlation Among Sim Functions

39 Synthetic Dataset ParameterDescriptionValue n Number of entity-sets4 s Number of entities per entity-set20,000 b Number of blocks per entity-set100 d Fraction of duplicate pairs in each entity-set0.2 z Zipfian distribution exponent0.15 l Probability of generating an influence0.3

40 Duplicate Distribution Z = 0.00 Z = 0.15 Z = 0.30

41 Number Of Influences l = 0.0 l = 0.3 l = 0.6

42 Conclusion  Progressive Approach to Relational ER.  Cost and benefit model for generating a resolution plan.  Lazy resolution strategy to resolve nodes with the least amount of cost.  Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.

43 Questions


Download ppt "Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution."

Similar presentations


Ads by Google