Download presentation

Presentation is loading. Please wait.

Published byEsteban Wonnacott Modified over 2 years ago

1
D ON ' T M ATCH T WICE : R EDUNDANCY - FREE S IMILARITY C OMPUTATION WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig University of Leipzig New York, DanaC 2013

2
2 / 9 P AIRWISE S IMILARITY C OMPUTATION (PSC) Example applications Document clustering Set-similarity joins in databases Entity Resolution Characteristics O(n²) Complex similarity functions Optimizations Clustering Parallelization Don't Match Twice: Redundancy-free Similarity Computation with MapReduce

3
3 / 9 C LUSTERING - BASED PSC Don't Match Twice: Redundancy-free Similarity Computation with MapReduce “mp3” “mobile phone” “Sony” “Samsung” Appropriate signature creation crucial for data quality Efficiency vs. quality Noisy, missing, inconsistent attribute values Multiple signatures Improve pairs completeness Redundant evaluation of the same objects Duplicates in result Partition by product type Partition by manufacturer

4
4 / 9 A VOIDANCE OF R EDUNDANT P AIRS Don't Match Twice: Redundancy-free Similarity Computation with MapReduce

5
5 / 9 M AP R EDUCE - BASED A VOIDANCE OF R EDUNDANT P AIRS Don't Match Twice: Redundancy-free Similarity Computation with MapReduce ObjectSignature A{1, 3} B C{3} D{1, 4} E F{2, 4} G Partitioning by hash(Key) modulo r KeyValue 1A, {1,3} 3 1B, {1,3} 3 3C, {3} 1D, {1,4} 4 KeyValue 1E, {1,4} 4 2F, {2,4} 4 2G, {2,4} 4 Obj A B C D E F G Map: SignaturesReduce: Pair Comparisons KeyValue 1A, {1,3} 1B, {1,3} 1D, {1,4} 1E, {1,4} 3A, {1,3} 3B, {1,3} 3C, {3} KeyValue 2F, {2,4} 2G, {2,4} 4D, {1,4} 4E, {1,4} 4F, {2,4} 4G, {2,4} Pairs A-B, A-D, A-E, B-D, B-E, D-E A-B, A-C, B-C Pairs F-G D-E, D-F, D-G, E-F, E-G, F-G 3 costly operations for each pair: Set intersection + min + key comparison

6
6 / 9 M AP R EDUCE - BASED A VOIDANCE OF R EDUNDANT P AIRS (2) Don't Match Twice: Redundancy-free Similarity Computation with MapReduce ObjectSignature A{1, 3} B C{3} D{1, 4} E F{2, 4} G Partitioning by hash(Key) modulo r KeyValue 1 A, 3A, {1} 1 B, 3B, {1} 3 C, 1 D, 4D, {1} KeyValue 1 E, 4E, {1} 2 F, 4F, {2} 2 G, 4G, {2} Obj A B C D E F G Map: SignaturesReduce: Pair Comparisons KeyValue 1 A, 1 B, 1 D, 1 E, E, 3A, {1} 3B, {1} 3 C, KeyValue 2 F, 2 G, 4D, {1} 4E, {1} 4F, {2} 4G, {2} Pairs A-B, A-D, A-E, B-D, B-E, D-E A-B, A-C, B-C Pairs F-G D-E, D-F, D-G, E-F, E-G, F-G Annotate A with all of its signatures < 1 Annotate A with all of its signatures < 3 Optimizations: Reduction of intermediate data Set intersection + min + key comparison Overlap check of sorted list

7
7 / 9 E XPERIMENTAL E VALUATION Dedoop prototype for MR-based entity resolution (VLDB 2012) 114,000 (noisy) electronic product offers Hadoop 0.20.2@EC2 (20 worker VMs of type c1.medium) Don't Match Twice: Redundancy-free Similarity Computation with MapReduce Multiple signatures crucial for data quality Substantial degree of redundant pairs Run-time savings proportional to the cluster overlap

8
8 / 9 Subset of n=100,000 offers, same environment Systematical variation of the degree of redundancy Fix number of offers and clusters but increase cluster E XPERIMENTAL E VALUATION (2) Don't Match Twice: Redundancy-free Similarity Computation with MapReduce Naïve Execution time grows proportional to the number of comparisons Reduncancy-free PSC Completes much faster (with same recall) 4 x faster for s=10,000

9
9 / 9 C ONCLUSIONS Don't Match Twice: Redundancy-free Similarity Computation with MapReduce

10
T HANK YOU FOR YOUR ATTENTION

Similar presentations

OK

1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM.

1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on phonetic transcription chart Ppt on area of trapezium formula Ppt on personality development in hindi Ppt on drinking water problems in rural areas Ppt on centre of mass Download ppt on classification of industries Ppt on smart note taker Ppt on working of lcd tv Ppt on summary writing for kids Ppt on glorious past of india