Presentation is loading. Please wait.

Presentation is loading. Please wait.

Record Linkage in a Distributed Environment

Similar presentations


Presentation on theme: "Record Linkage in a Distributed Environment"— Presentation transcript:

1 Record Linkage in a Distributed Environment
Huang Yipeng Wing group meeting, 11 March 2011

2 Record Linkage Determining if pairs of personal records refer to the same entity E.g. Distinguishing between data belonging to… <Yipeng, author of this presentation> and <Yipeng, son of PM Lee> Introduction

3 The Distributed Environment
Amanda Beverley Katherine Why? Dealing with large data Limitation of blocking Advantages Parallel computation Data source flexibility Complementary to blocking methods O(nC2) Amanda Introduction

4 The Distributed Environment
MapReduce Distributed environment for large data sets Hadoop Open source implementation Convenient model for scaling Record Linkage Protects users from system level concerns Introduction

5 Research Problem Disconnect between generic parallel framework and specific Record Linkage problem The goal  Tailor Hadoop for Record Linkage tasks Introduction

6 Outline Introduction Related Work Methodology Evaluation Conclusion

7 Related Work Record Linkage Literature
Blocking techniques Parallel Record Linkage Literature P-Febrl (P Christen 2003), P-Swoosh (H Kawai 2006), Parallel Linkage (H Kim 2007) Hadoop Literature Evaluation Metrics Pairwise comparisons (T Elsayed 2008) Related Work

8 Outline Introduction Related Work Methodology Evaluation Conclusion

9 MapReduce Workflow Partitioner Methodology

10 Implementation Map Purpose: Reads lines of input and outputs
Parallelism Data manipulation Blocking Reads lines of input and outputs <key, value> pairs. Reduce Purpose: Parallelism Record Linkage ops Records with the same <key> in same Reduce(). Linkage results Methodology

11 Hash Partitioner Default implementation Hash(Key) mod N
Good for uniformed data but not for skewed distributions Name Distribution Comparisons joshua 5000 emiily 48 1128 jack 35 595 thomas 33 528 lachlan 32 496 benjamin 31 465 comparisons 210 comparisons Methodology

12 Record Linkage Partitioner
Preprocessing Partitioning Balances # comparisons assigned to each node in online fashion to attain a more consistent running time across nodes Merging TODO: External sorting ??? Methodology

13 Record Linkage Partitioner
Goal: Have all nodes finish the reduce phase at the same time Attain a better runtime but retaining the same level of accuracy Methodology

14 Domain principles Counting pairwise comparisons gives a more accurate picture of the true computational workload The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000) United States & Berlin D. H. Zanette and S. C. Manrubia, Phys. Anthropo 295, 1 (2001) Taiwan W. J. Reed and B. D. Hughes, Physica A 319, 579 (2003) Japan S. Miyazima, Y. Lee, T. Nagamine, and H. Miyajima, Physica A 278, 282 (2000). England & Wales (First Name), Douglas A. Galbi, Long-Term Trends in Personal Given Name Frequencies in the UK, FCC, 2002 Korea, China Exponential Zipf Methodology

15 Record Linkage Workflow
Round 1 Range partition based on comparison workload Round 2 Merge lost comparisons from Round 1 Round 3 Remove cross duplicates Methodology

16 Round 1 Input Distribution Map Phase Methodology
1. Calc avg comparison workload over N nodes Distribution 2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below. Map Phase 3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes Methodology

17 Round 2 List X A B R1 B A B R1 R2 Methodology 17

18 Round 2 A B Job 1 Only acts on lost comparisons
Because input is indistinct, a 3rd round of deduplication may be needed. A B C Job 1 Job 2 Job 3 Methodology 18

19 Outline Introduction Related Work Methodology Evaluation Conclusion

20 Performance Metrics Performance evaluation in absolute runtime, speedup & scaleup on a shared cluster. “It’s what users care about” Representative of real operations Evaluation

21 Input Records 10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution. <rec org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , , 38, , , 9> Methodology

22 Data sets Synthetic data produced with Febrl data generator
Artificially skewed distribution Name Distribution Comparisons joshua 50 1225 emiily 48 1128 jack 35 595 thomas 33 528 lachlan 32 496 benjamin 31 465 Methodology

23 Utilization Evaluation

24 Utilization Evaluation

25 Utilization A B C Evaluation

26 Utilization A B Evaluation

27 Round 2 A B C J1 J2 J3 J4 J5 J6 ? Node Utilization % 27

28 Results so far…. Default Workflow RL Workflow 2 nodes,
5000 records, 2433 duplicates 71.5 secs 75 secs 7000 records, 4814 duplicates >10 mins 196.8 secs Evaluation

29 Results so far…. RL Workflow runtime
Similar to Hash-based runtime on small datasets Better as the size of the dataset grows Evaluation

30 Conclusion Parallelism a right step in the right direction for record linkage Complementary to existing approaches Hadoop can be tailored for Record Linkage tasks “Record Linkage” Partitioner / Workflow is just one an example of possible improvements Conclusion


Download ppt "Record Linkage in a Distributed Environment"

Similar presentations


Ads by Google