Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Slides:



Advertisements
Similar presentations
String Similarity Measures and Joins with Synonyms
Advertisements

Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
High Dimensional Search Min-Hashing Locality Sensitive Hashing
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Evaluating Search Engine
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
CSE 373: Data Structures and Algorithms
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Near Duplicate Detection
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir.
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Finding Similar Items.
J Cheng et al,. CVPR14 Hyunchul Yang( 양현철 )
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
CS 349: WebBase 1 What the WebBase can and can’t do?
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
AMOST Experimental Comparison of Code-Based and Model-Based Test Prioritization Bogdan Korel Computer Science Department Illinois Institute of Technology.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Near Duplicate Image Detection: min-Hash and tf-idf weighting
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Case Study.  Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
Fundamentals of Algorithms MCS - 2 Lecture # 15. Bubble Sort.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Over-Trained Network Node Removal and Neurotransmitter-Inspired Artificial Neural Networks By: Kyle Wray.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Dijkstra-Scholten and Shavit-Francez termination algorithms
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
The Value of USAP in Software Architecture Design Presentation by: David Grizzanti.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 28, 2013 Session 6: Similar Item Detection This work is licensed.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Big Data Infrastructure
Optimizing Parallel Algorithms for All Pairs Similarity Search
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Information Management course
Text Based Similarity Metrics and Delta for Semantic Web Graphs
Building Fractal Landscapes
Shell Sort and Merge Sort
Decision Maths Unit 7 Sorting Algorithms 3. Shell Sort.
Saliency Optimization from Robust Background Detection
Sorting Sorting is a fundamental problem in computer science.
Presentation transcript:

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle

Description of algorithms ● 1 st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. ● 2 nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). ● Algorithm SH uses Shingles + MinHashing to compute the signatures. ● Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.

Experimentation method ● Run both algorithms on the data set (WebBase), and compute precision. ● Remove duplicates pairs found from the data set. ● Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). ● Run both algorithms on the new dataset, and compute precision and recall.

Results (original data set)

Results (modified dataset)

Conclusion ● Algorithm SK rocks ! ● However, it is computationally more expensive ● Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)