Record Linkage in a Distributed Environment

Name: Record Linkage in a Distributed Environment
Uploaded: 2017-12-24T15:24:00+00:00
Duration: PTM10S20
Channel: Alison Snow
Description: Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment
Huang Yipeng Wing group meeting, 11 March 2011

Record Linkage Determining if pairs of personal records refer to the same entity E.g. Distinguishing between data belonging to… <Yipeng, author of this presentation> and <Yipeng, son of PM Lee> Introduction

The Distributed Environment
Amanda Beverley Katherine Why? Dealing with large data Limitation of blocking Advantages Parallel computation Data source flexibility Complementary to blocking methods O(nC2) Amanda Introduction

The Distributed Environment
MapReduce Distributed environment for large data sets Hadoop Open source implementation Convenient model for scaling Record Linkage Protects users from system level concerns Introduction

Research Problem Disconnect between generic parallel framework and specific Record Linkage problem The goal  Tailor Hadoop for Record Linkage tasks Introduction

Outline Introduction Related Work Methodology Evaluation Conclusion

Related Work Record Linkage Literature
Blocking techniques Parallel Record Linkage Literature P-Febrl (P Christen 2003), P-Swoosh (H Kawai 2006), Parallel Linkage (H Kim 2007) Hadoop Literature Evaluation Metrics Pairwise comparisons (T Elsayed 2008) Related Work

Outline Introduction Related Work Methodology Evaluation Conclusion

MapReduce Workflow Partitioner Methodology

Implementation Map Purpose: Reads lines of input and outputs
Parallelism Data manipulation Blocking Reads lines of input and outputs <key, value> pairs. Reduce Purpose: Parallelism Record Linkage ops Records with the same <key> in same Reduce(). Linkage results Methodology

Hash Partitioner Default implementation Hash(Key) mod N
Good for uniformed data but not for skewed distributions Name Distribution Comparisons joshua 5000 emiily 48 1128 jack 35 595 thomas 33 528 lachlan 32 496 benjamin 31 465 comparisons 210 comparisons Methodology

Record Linkage Partitioner
Preprocessing Partitioning Balances # comparisons assigned to each node in online fashion to attain a more consistent running time across nodes Merging TODO: External sorting ??? Methodology

Record Linkage Partitioner
Goal: Have all nodes finish the reduce phase at the same time Attain a better runtime but retaining the same level of accuracy Methodology

Domain principles Counting pairwise comparisons gives a more accurate picture of the true computational workload The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000) United States & Berlin D. H. Zanette and S. C. Manrubia, Phys. Anthropo 295, 1 (2001) Taiwan W. J. Reed and B. D. Hughes, Physica A 319, 579 (2003) Japan S. Miyazima, Y. Lee, T. Nagamine, and H. Miyajima, Physica A 278, 282 (2000). England & Wales (First Name), Douglas A. Galbi, Long-Term Trends in Personal Given Name Frequencies in the UK, FCC, 2002 Korea, China Exponential Zipf Methodology

Record Linkage Workflow
Round 1 Range partition based on comparison workload Round 2 Merge lost comparisons from Round 1 Round 3 Remove cross duplicates Methodology

Round 1 Input Distribution Map Phase Methodology
1. Calc avg comparison workload over N nodes Distribution 2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below. Map Phase 3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes Methodology

Round 2 List X A B R1 B A B R1 R2 Methodology 17

Round 2 A B Job 1 Only acts on lost comparisons
Because input is indistinct, a 3rd round of deduplication may be needed. A B C Job 1 Job 2 Job 3 Methodology 18

Outline Introduction Related Work Methodology Evaluation Conclusion

Performance Metrics Performance evaluation in absolute runtime, speedup & scaleup on a shared cluster. “It’s what users care about” Representative of real operations Evaluation

Input Records 10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution. <rec org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , , 38, , , 9> Methodology

Data sets Synthetic data produced with Febrl data generator
Artificially skewed distribution Name Distribution Comparisons joshua 50 1225 emiily 48 1128 jack 35 595 thomas 33 528 lachlan 32 496 benjamin 31 465 Methodology

Utilization Evaluation

Utilization A B C Evaluation

Utilization A B Evaluation

Round 2 A B C J1 J2 J3 J4 J5 J6 ? Node Utilization % 27

Results so far…. Default Workflow RL Workflow 2 nodes,
5000 records, 2433 duplicates 71.5 secs 75 secs 7000 records, 4814 duplicates >10 mins 196.8 secs Evaluation

Results so far…. RL Workflow runtime
Similar to Hash-based runtime on small datasets Better as the size of the dataset grows Evaluation

Conclusion Parallelism a right step in the right direction for record linkage Complementary to existing approaches Hadoop can be tailored for Record Linkage tasks “Record Linkage” Partitioner / Workflow is just one an example of possible improvements Conclusion

Record Linkage in a Distributed Environment

Similar presentations

Presentation on theme: "Record Linkage in a Distributed Environment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Record Linkage in a Distributed Environment

Similar presentations

Presentation on theme: "Record Linkage in a Distributed Environment"— Presentation transcript:

Similar presentations

About project

Feedback