Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung."— Presentation transcript:

1 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung Ruei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke SIGIR. 2008 國立雲林科技大學 National Yunlin University of Science and Technology

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Methodology Experiments Conclusion Comments 2

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte- by-byte comparisons fail. 3

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective To avoid exact duplicates during the collection of Web archives, near duplicates frequently slip into the corpus. 4

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE  EXTRACTION  MATCHING 5 Web Database Web Database document

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE EXTRACTION  A = {aj(dj, cj)} 6 Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Result S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE MATCHING  Jaccard Similarity for Sets 7 Generalization for Multi-Sets

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE MATCHING 8 SPOT SIGNATURE partition Inverted Index Pruning Inverted Index Pruning Jaccard Similarity for Sets Jaccard Similarity for Sets

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Optimal Partitioning 9

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Inverted Index Pruning 10 Example d1 = {s1:5, s2:4, s3:4}, with |d1| = 13 d2 = {s1:8, s2:4}, |d2| = 12 d3 = {s1:4, s2:5, s3:5}, |d3| = 14 τ = 0.8 δ1 = 0 δ2 = |d1| − |d3| = −1 SPOT SIGNATURE partition Inverted Index Pruning Inverted Index Pruning Jaccard Similarity for Sets Jaccard Similarity for Sets

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Gold Set of Near Duplicate News Articles  SpotSigs vs. Shingling  Choice of Spot Signatures  SpotSigs vs. Hashing TREC WT10g  SpotSigs vs. Hashing 11

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Gold Set of Near Duplicate News Articles 12 SpotSigs vs. Shingling Choice of Spot Signatures SpotSigs vs. Hashing

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments TREC WT10g  SpotSigs vs. Hashing 13

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion MAJOR CINTRIBUTION  SpotSigs proved to provide both increased robustness of signatures as well as highly efficient deduplication compared to various state-of-the- art approaches. FUTURE WORK  Future work will focus on efficient access to disk-based index structures, as well as generalizing the bounding approach toward other metrics such as Cosine. 14

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage  The SpotSigs deduplication algorithm runs “right out of the box” without the need for further tuning, while remaining exact and efficient. Drawback  ….. Application  information retrieval 15


Download ppt "Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung."

Similar presentations


Ads by Google