Download presentation

Presentation is loading. Please wait.

Published byEsteban Wonnacott Modified over 3 years ago

1
D ON ' T M ATCH T WICE : R EDUNDANCY - FREE S IMILARITY C OMPUTATION WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig University of Leipzig New York, DanaC 2013

2
2 / 9 P AIRWISE S IMILARITY C OMPUTATION (PSC) Example applications Document clustering Set-similarity joins in databases Entity Resolution Characteristics O(n²) Complex similarity functions Optimizations Clustering Parallelization Don't Match Twice: Redundancy-free Similarity Computation with MapReduce

3
3 / 9 C LUSTERING - BASED PSC Don't Match Twice: Redundancy-free Similarity Computation with MapReduce “mp3” “mobile phone” “Sony” “Samsung” Appropriate signature creation crucial for data quality Efficiency vs. quality Noisy, missing, inconsistent attribute values Multiple signatures Improve pairs completeness Redundant evaluation of the same objects Duplicates in result Partition by product type Partition by manufacturer

4
4 / 9 A VOIDANCE OF R EDUNDANT P AIRS Don't Match Twice: Redundancy-free Similarity Computation with MapReduce

5
5 / 9 M AP R EDUCE - BASED A VOIDANCE OF R EDUNDANT P AIRS Don't Match Twice: Redundancy-free Similarity Computation with MapReduce ObjectSignature A{1, 3} B C{3} D{1, 4} E F{2, 4} G Partitioning by hash(Key) modulo r KeyValue 1A, {1,3} 3 1B, {1,3} 3 3C, {3} 1D, {1,4} 4 KeyValue 1E, {1,4} 4 2F, {2,4} 4 2G, {2,4} 4 Obj A B C D E F G Map: SignaturesReduce: Pair Comparisons KeyValue 1A, {1,3} 1B, {1,3} 1D, {1,4} 1E, {1,4} 3A, {1,3} 3B, {1,3} 3C, {3} KeyValue 2F, {2,4} 2G, {2,4} 4D, {1,4} 4E, {1,4} 4F, {2,4} 4G, {2,4} Pairs A-B, A-D, A-E, B-D, B-E, D-E A-B, A-C, B-C Pairs F-G D-E, D-F, D-G, E-F, E-G, F-G 3 costly operations for each pair: Set intersection + min + key comparison

6
6 / 9 M AP R EDUCE - BASED A VOIDANCE OF R EDUNDANT P AIRS (2) Don't Match Twice: Redundancy-free Similarity Computation with MapReduce ObjectSignature A{1, 3} B C{3} D{1, 4} E F{2, 4} G Partitioning by hash(Key) modulo r KeyValue 1 A, 3A, {1} 1 B, 3B, {1} 3 C, 1 D, 4D, {1} KeyValue 1 E, 4E, {1} 2 F, 4F, {2} 2 G, 4G, {2} Obj A B C D E F G Map: SignaturesReduce: Pair Comparisons KeyValue 1 A, 1 B, 1 D, 1 E, E, 3A, {1} 3B, {1} 3 C, KeyValue 2 F, 2 G, 4D, {1} 4E, {1} 4F, {2} 4G, {2} Pairs A-B, A-D, A-E, B-D, B-E, D-E A-B, A-C, B-C Pairs F-G D-E, D-F, D-G, E-F, E-G, F-G Annotate A with all of its signatures < 1 Annotate A with all of its signatures < 3 Optimizations: Reduction of intermediate data Set intersection + min + key comparison Overlap check of sorted list

7
7 / 9 E XPERIMENTAL E VALUATION Dedoop prototype for MR-based entity resolution (VLDB 2012) 114,000 (noisy) electronic product offers Hadoop 0.20.2@EC2 (20 worker VMs of type c1.medium) Don't Match Twice: Redundancy-free Similarity Computation with MapReduce Multiple signatures crucial for data quality Substantial degree of redundant pairs Run-time savings proportional to the cluster overlap

8
8 / 9 Subset of n=100,000 offers, same environment Systematical variation of the degree of redundancy Fix number of offers and clusters but increase cluster E XPERIMENTAL E VALUATION (2) Don't Match Twice: Redundancy-free Similarity Computation with MapReduce Naïve Execution time grows proportional to the number of comparisons Reduncancy-free PSC Completes much faster (with same recall) 4 x faster for s=10,000

9
9 / 9 C ONCLUSIONS Don't Match Twice: Redundancy-free Similarity Computation with MapReduce

10
T HANK YOU FOR YOUR ATTENTION

Similar presentations

Presentation is loading. Please wait....

OK

Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google