Download presentation
Presentation is loading. Please wait.
Published byGilbert Roberts Modified over 9 years ago
1
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra
2
Data Processing Flow Analysis Decision Data Quality of Analysis Quality of Decision Quality of Data Data Quality Challenges Erroneous Values. Missing Values. Duplication. …… Data Cleaning Accounts For: 80%
3
Digital World Entities Real World Objects Entity Resolution ( ER ) Michael Jordan Basketball Player Michael Jordan Professor @ UCB
4
IdProduct NamePrice p1p1 IPad Two 16GB WiFi$490 p2p2 IPad 2 nd Generatation 16GB WiFi$469 p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 p5p5 IPhone 4 th Generation 32GB$520 Entity Resolution ( ER ) P1P1 P1P1 P2P2 P2P2 P4P4 P4P4 P3P3 P3P3 P5P5 P5P5
5
Blocking Dataset IdProduct NamePrice p1p1 IPad Two 16GB WiFi$490 p2p2 IPad 2 nd Generatation 16GB WiFi$469 p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 p5p5 IPhone 4 th Generation 32GB$520 p1p1 p2p2 p5p5 p3p3 p4p4 … BF = 1 st char of product name BF … BF 2 BF 1 Blocks
6
Resolve ( ) = duplicate, distinct, or uncertain Similarity Computation IdProduct NamePrice p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 Similarity Functions: Resolve Function:
7
Progressive ER
8
Real-time Analysis of Big Data Event Monitoring Situational Awareness Real-time Alerts Semantic Search Anti-terrorism Applications Data Cleaning
9
Progressive Data Cleaning How Progressive ER Helps Progressive Analysis Continually Refined Results
10
IdNamePapers u1u1 Very Large Data Bases {p1}{p1} u2u2 ICDE Conference {p2}{p2} u3u3 VLDB {p3}{p3} u4u4 IEEE Data Eng. Bull {p4}{p4} IdTitleAuthorsVenue p1p1 Transaction Support in Read Optimized … { a 1, a 2 } u1u1 p2p2 Read Optimized File System Designs: … {a1}{a1} u2u2 p3p3 Transaction Support in Read Optimized … { a 3, a 4 } u3u3 p4p4 Berkeley DB: A Retrospective.. {a3}{a3} u4u4 Author Venue IdNamePapers a1a1 Marge Seltzer { p 1, p 2 } a2a2 Michael Stonebraker {p1}{p1} a3a3 Margo I. Seltzer { p 3, p 4 } a4a4 M. Stonebraker {p3}{p3} Paper Relational Dataset
11
duplicate Resolve Graph Representation u1, u3u1, u3 u1, u3u1, u3 p1, p3p1, p3 p1, p3p1, p3 duplicate
12
Problem Definition Given a relational dataset D, and a cost budget BG, Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost. Given a relational dataset D, and a cost budget BG, Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.
13
ER Graph R 1 S 1 R 2 T 2 T 1 S2S2
14
ER Graph R 1 S 1 R 2 T 2 T 1 S2S2 v1v1 v2v2 v3v3 v4v4 v8v8 v7v7 v6v6 v5v5 v9v9 v 10 v 11 v 12
15
R 2 T 2 S2S2 Partially Constructed Graph R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12
16
Overview Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( ) Set of blocks ( ) to be instantiated. Set of nodes ( ) to be resolved. BG
17
Plan Execution Phase R 1 S 1 T 1 S2S2 R 2 T 2 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v8v8 v9v9 v 10 v 11 v4v4 v 12
18
Plan Cost and Benefit
19
Node Benefit … … … … … … Indirect Benefit Direct Benefit v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 Stat e
20
Probability Estimation … Effect Cause Noisy-OR Model Effect: Node being duplicate. Causes: Influencing duplicate nodes of. Block to which belongs. Fraction of duplicate pairs in the block. vivi vivi vivi
21
R 2 T 2 S2S2 Example R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12 duplicate distinct v1v1 v2v2 S1S1
22
Node Impact …………………… v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 Dependent Nodes Nearest Nodes (K=2) 1.Belief Update NP-hard. 2.The Nearest Nodes are not always instantiated. Why?
23
Impact Model v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 Case # 1 Case # 2 Case # 3 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v1v1 v3v3 v6v6 v2v2 v4v4 v5v5 v7v7
24
2. Generate a plan such that: h. is maximized. 2. Generate a plan such that: h. is maximized. 1. Benefit-vs-Cost Analysis: Each node and block has an updated cost and benefit. 1. Benefit-vs-Cost Analysis: Each node and block has an updated cost and benefit. Plan Generation Phase NP-hard Oregon-Trail Knapsack NP-hard Oregon-Trail Knapsack
25
Instantiated Unresolved Nodes Step#1 Step#2 Uninstantiated Blocks R1R1 R1R1 R2R2 R2R2 R4R4 R4R4 R5R5 R5R5 R6R6 R6R6 R8R8 R8R8 R9R9 R9R9 Plan Generation Algorithm v1v1 v2v2 v4v4 v6v6 v7v7 v 10 v 13 v 15 v 16 v 21 v1v1 v2v2 v6v6 v 10 v 16
26
Step#3 If > else return and R1R1 R1R1 R8R8 R8R8 R6R6 R6R6 R2R2 R2R2 … Plan Generation Algorithm v1v1 v2v2 v6v6 v 10 v 16 v1v1 v2v2 v 10 v 30 v 32 v 34 v 36 v 38 v 40 v 42 v 45 v 47 v 48
27
Lazy Resolution with Workflow Resolve … duplicate or distinct How to resolve ? How to resolve ? duplicate or distinct v1v1 v1v1 Workflow of v1v1
28
Contribution of Functions
29
Workflow Generation vivi
30
vivi
31
Resolution Cost vivi Resolution Cost when vivi is duplicate. Resolution Cost when vivi is distinct.
32
Experimental Evaluation 1.Papers (P) 2.Authors (A) 3.Venues (U) = ( Title, Abstract, Keywords, Authors, Venue ). = ( Name, Email, Affiliation, Address, Paper ). = ( Name, Year, Pages, Papers ). Number of Entities Blocking Functions Similarity Functions Resolve Function P 30,00023 Naïve Bayes A 83,15214 Naïve Bayes U 30,00013 Naïve Bayes CiteSeerX Dataset
33
CiteSeerX - Blocking Papers (P) First three characters of title. Last three characters of title. Authors (A) First one character of first name appended with the first two characters of last name. Venues (U) First two characters of name appended with the first two digits of year.
34
Algorithms: 1.DepGraph. X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2.Static. S. E. Whang et al. Joint entity resolution. ICDE. 3.Full: No lazy resolution strategy. 4.Random: Lazy resolution strategy but with random order. Experimental Evaluation R R1R1 R1R1 R4R4 R4R4 R5R5 R5R5 … T6T6 T6T6 T1T1 T1T1 T3T3 T3T3 … S2S2 S2S2 S6S6 S6S6 S5S5 S5S5 … T S
35
Time vs. Recall
36
Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Plan Execution 95.11%96.17%97.40 Lazy Resolution with Workflow Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Reading Blocks 4.70%3.75%2.90% Graph Creation 8.40%6.25%4.72% Node Resolution 82.01%86.17%89.78% Reading Blocks. Creating Nodes. Resolving Nodes. Reading Blocks. Creating Nodes. Resolving Nodes.
37
Lazy Resolution with Workflow #2 Number of Sim Functions PAU Set_1343 Set_2222 Set_3111
38
Correlation Among Sim Functions
39
Synthetic Dataset ParameterDescriptionValue n Number of entity-sets4 s Number of entities per entity-set20,000 b Number of blocks per entity-set100 d Fraction of duplicate pairs in each entity-set0.2 z Zipfian distribution exponent0.15 l Probability of generating an influence0.3
40
Duplicate Distribution Z = 0.00 Z = 0.15 Z = 0.30
41
Number Of Influences l = 0.0 l = 0.3 l = 0.6
42
Conclusion Progressive Approach to Relational ER. Cost and benefit model for generating a resolution plan. Lazy resolution strategy to resolve nodes with the least amount of cost. Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.
43
Questions
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.