Presentation is loading. Please wait.

Presentation is loading. Please wait.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Similar presentations


Presentation on theme: "Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle."— Presentation transcript:

1 Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle

2 Description of algorithms ● 1 st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. ● 2 nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). ● Algorithm SH uses Shingles + MinHashing to compute the signatures. ● Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.

3 Experimentation method ● Run both algorithms on the data set (WebBase), and compute precision. ● Remove duplicates pairs found from the data set. ● Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). ● Run both algorithms on the new dataset, and compute precision and recall.

4 Results (original data set)

5 Results (modified dataset)

6 Conclusion ● Algorithm SK rocks ! ● However, it is computationally more expensive ● Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)


Download ppt "Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle."

Similar presentations


Ads by Google