Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra

Data Processing Flow Analysis Decision Data Quality of Analysis Quality of Decision Quality of Data Data Quality Challenges  Erroneous Values.  Missing Values.  Duplication. …… Data Cleaning Accounts For: 80%

Digital World Entities Real World Objects Entity Resolution ( ER ) Michael Jordan Basketball Player Michael Jordan Professor @ UCB

IdProduct NamePrice p1p1 IPad Two 16GB WiFi$490 p2p2 IPad 2 nd Generatation 16GB WiFi$469 p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 p5p5 IPhone 4 th Generation 32GB$520 Entity Resolution ( ER ) P1P1 P1P1 P2P2 P2P2 P4P4 P4P4 P3P3 P3P3 P5P5 P5P5

Blocking Dataset IdProduct NamePrice p1p1 IPad Two 16GB WiFi$490 p2p2 IPad 2 nd Generatation 16GB WiFi$469 p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 p5p5 IPhone 4 th Generation 32GB$520 p1p1 p2p2 p5p5 p3p3 p4p4 … BF = 1 st char of product name BF … BF 2 BF 1 Blocks

Resolve ( ) = duplicate, distinct, or uncertain Similarity Computation IdProduct NamePrice p3p3 Apple Phone 4 32 GB$545 p4p4 Apple iPod Shuffle 2GB$49 Similarity Functions: Resolve Function:

Progressive ER

Real-time Analysis of Big Data Event Monitoring Situational Awareness Real-time Alerts Semantic Search Anti-terrorism Applications Data Cleaning

Progressive Data Cleaning How Progressive ER Helps Progressive Analysis Continually Refined Results

IdNamePapers u1u1 Very Large Data Bases {p1}{p1} u2u2 ICDE Conference {p2}{p2} u3u3 VLDB {p3}{p3} u4u4 IEEE Data Eng. Bull {p4}{p4} IdTitleAuthorsVenue p1p1 Transaction Support in Read Optimized … { a 1, a 2 } u1u1 p2p2 Read Optimized File System Designs: … {a1}{a1} u2u2 p3p3 Transaction Support in Read Optimized … { a 3, a 4 } u3u3 p4p4 Berkeley DB: A Retrospective.. {a3}{a3} u4u4 Author Venue IdNamePapers a1a1 Marge Seltzer { p 1, p 2 } a2a2 Michael Stonebraker {p1}{p1} a3a3 Margo I. Seltzer { p 3, p 4 } a4a4 M. Stonebraker {p3}{p3} Paper Relational Dataset

duplicate Resolve Graph Representation u1, u3u1, u3 u1, u3u1, u3 p1, p3p1, p3 p1, p3p1, p3 duplicate

Problem Definition  Given a relational dataset D, and a cost budget BG,  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.  Given a relational dataset D, and a cost budget BG,  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.

ER Graph R 1 S 1 R 2 T 2 T 1 S2S2

ER Graph R 1 S 1 R 2 T 2 T 1 S2S2 v1v1 v2v2 v3v3 v4v4 v8v8 v7v7 v6v6 v5v5 v9v9 v 10 v 11 v 12

R 2 T 2 S2S2 Partially Constructed Graph R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12

Overview Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( )  Set of blocks ( ) to be instantiated.  Set of nodes ( ) to be resolved. BG

Plan Execution Phase R 1 S 1 T 1 S2S2 R 2 T 2 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v8v8 v9v9 v 10 v 11 v4v4 v 12

Plan Cost and Benefit

Node Benefit … … … … … … Indirect Benefit Direct Benefit v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 Stat e

Probability Estimation … Effect Cause Noisy-OR Model Effect:  Node being duplicate. Causes:  Influencing duplicate nodes of.  Block to which belongs. Fraction of duplicate pairs in the block. vivi vivi vivi

R 2 T 2 S2S2 Example R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12 duplicate distinct v1v1 v2v2 S1S1

Node Impact …………………… v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 Dependent Nodes Nearest Nodes (K=2) 1.Belief Update  NP-hard. 2.The Nearest Nodes are not always instantiated. Why?

Impact Model v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 Case # 1 Case # 2 Case # 3 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v1v1 v3v3 v6v6 v2v2 v4v4 v5v5 v7v7

2. Generate a plan such that:  h.  is maximized. 2. Generate a plan such that:  h.  is maximized. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. Plan Generation Phase NP-hard Oregon-Trail Knapsack NP-hard Oregon-Trail Knapsack

Instantiated Unresolved Nodes Step#1 Step#2 Uninstantiated Blocks R1R1 R1R1 R2R2 R2R2 R4R4 R4R4 R5R5 R5R5 R6R6 R6R6 R8R8 R8R8 R9R9 R9R9 Plan Generation Algorithm v1v1 v2v2 v4v4 v6v6 v7v7 v 10 v 13 v 15 v 16 v 21 v1v1 v2v2 v6v6 v 10 v 16

Step#3 If > else return and R1R1 R1R1 R8R8 R8R8 R6R6 R6R6 R2R2 R2R2 … Plan Generation Algorithm v1v1 v2v2 v6v6 v 10 v 16 v1v1 v2v2 v 10 v 30 v 32 v 34 v 36 v 38 v 40 v 42 v 45 v 47 v 48

Lazy Resolution with Workflow Resolve … duplicate or distinct How to resolve ? How to resolve ? duplicate or distinct v1v1 v1v1 Workflow of v1v1

Contribution of Functions

Workflow Generation vivi

Resolution Cost vivi Resolution Cost when vivi is duplicate. Resolution Cost when vivi is distinct.

Experimental Evaluation 1.Papers (P) 2.Authors (A) 3.Venues (U) = ( Title, Abstract, Keywords, Authors, Venue ). = ( Name, Email, Affiliation, Address, Paper ). = ( Name, Year, Pages, Papers ). Number of Entities Blocking Functions Similarity Functions Resolve Function P 30,00023 Naïve Bayes A 83,15214 Naïve Bayes U 30,00013 Naïve Bayes CiteSeerX Dataset

CiteSeerX - Blocking  Papers (P)  First three characters of title.  Last three characters of title.  Authors (A)  First one character of first name appended with the first two characters of last name.  Venues (U)  First two characters of name appended with the first two digits of year.

Algorithms: 1.DepGraph.  X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2.Static.  S. E. Whang et al. Joint entity resolution. ICDE. 3.Full:  No lazy resolution strategy. 4.Random:  Lazy resolution strategy but with random order. Experimental Evaluation R R1R1 R1R1 R4R4 R4R4 R5R5 R5R5 … T6T6 T6T6 T1T1 T1T1 T3T3 T3T3 … S2S2 S2S2 S6S6 S6S6 S5S5 S5S5 … T S

Time vs. Recall

Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Plan Execution 95.11%96.17%97.40 Lazy Resolution with Workflow Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Reading Blocks 4.70%3.75%2.90% Graph Creation 8.40%6.25%4.72% Node Resolution 82.01%86.17%89.78%  Reading Blocks.  Creating Nodes.  Resolving Nodes.  Reading Blocks.  Creating Nodes.  Resolving Nodes.

Lazy Resolution with Workflow #2 Number of Sim Functions PAU Set_1343 Set_2222 Set_3111

Correlation Among Sim Functions

Synthetic Dataset ParameterDescriptionValue n Number of entity-sets4 s Number of entities per entity-set20,000 b Number of blocks per entity-set100 d Fraction of duplicate pairs in each entity-set0.2 z Zipfian distribution exponent0.15 l Probability of generating an influence0.3

Duplicate Distribution Z = 0.00 Z = 0.15 Z = 0.30

Number Of Influences l = 0.0 l = 0.3 l = 0.6

Conclusion  Progressive Approach to Relational ER.  Cost and benefit model for generating a resolution plan.  Lazy resolution strategy to resolve nodes with the least amount of cost.  Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.

Questions

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Similar presentations

Presentation on theme: "Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Similar presentations

Presentation on theme: "Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution."— Presentation transcript:

Similar presentations

About project

Feedback