CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Slides:



Advertisements
Similar presentations
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Advertisements

Indexing DNA Sequences Using q-Grams
Similarity and Distance Sketching, Locality Sensitive Hashing
Clustering Basic Concepts and Algorithms
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
PARTITIONAL CLUSTERING
Similarity and Distance Sketching, Locality Sensitive Hashing
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing as a Dictionary Implementation
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Searching on Multi-Dimensional Data
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
The Statistics of Fingerprints A Matching Algorithm to be used in an Investigation into the Reliability of the Use of Fingerprints for Identification Bob.
Bloom Filters Kira Radinsky Slides based on material from:
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Near Duplicate Detection
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Finding Similar Items.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Fast multiresolution image querying CS474/674 – Prof. Bebis.
CpSc 3220 File and Database Processing Hashing. Exercise – Build a B + - Tree Construct an order-4 B + -tree for the following set of key values: (2,
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Database Applications – Microsoft Access Lesson 2 Modifying a Table and Creating a Form 45 slides in presentation Accessibility check 9/14.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Hashing.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Hashing as a Dictionary Implementation Chapter 19.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CI VERIFICATION METHODOLOGY & PRELIMINARY RESULTS
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Shingling Minhashing Locality-Sensitive Hashing
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 i206: Lecture 17: Exam 2 Prep ; Intro to Regular Expressions Marti Hearst Spring 2012.
Near Duplicate Detection
Clustering Algorithms
Hashing Alexandra Stefan.
Web Data Integration Using Approximate String Join
Evaluation of Relational Operations
Hash Table.
Lecture 11: Nearest Neighbor Search
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
Biologically Inspired Computing: Operators for Evolutionary Algorithms
LSH-based Motion Estimation
Lecture-Hashing.
Presentation transcript:

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at:

3 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Distance Measure  A distance measure d(x,y) must have the following properties: 1. d(x,y) ≥ 0 2. d(x,y) = 0 iff x = y 3. d(x,y) = d(y,x) 4. d(x,y) ≤ d(x,z) + d(z,y)

4 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Euclidean Distance

5 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University L r - Norm

6 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Jaccard Distance

7 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Cosine Distance θ x y

8 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Cosine Distance: Example θ x = [0.1, 0.2, -0.1] y = [2.0, 1.0, 1.0]

9 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Edit Distance  What happens if you search for “Blkent” in Google?  “Showing results for Bilkent.”  Edit distance between x and y: Smallest number of insertions, deletions, or mutations needed to go from x to y.  What is the edit distance between “BILKENT” and “BLANKET”? B I L K E N TB I L K E N T B L A N K E TB L A N K E T dist(BILKENT, BLANKET) = 4  Efficient dynamic-programming algorithms exist to compute edit distance (CS473)

10 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Distance Metrics Summary  Important to choose the right distance metric for your application  Set representation?  Vector space?  Strings?  Distance metric chosen also affects complexity of algorithms  Sometimes more efficient to optimize L 1 norm than L 2 norm.  Computing edit distance for long sequences may be expensive  Many other distance metrics exist.

13 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Entity Resolution  Many records exist for the same person with slight variations  Name: “Robert W. Carson” vs. “Bob Carson Jr.”  Date of birth: “Jan 15, 1957” vs. “1957” vs none  Address: Old vs. new, incomplete, typo, etc.  Phone number: Cell vs. home vs. work, with or without country code, area code  Objective: Match the same people in different databases

14 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Locality Sensitive Hashing (LSH)  Simple implementation of LSH:  Hash each field separately  If two people hash to the same bucket for any field, add them as a candidate pair y.name x.name y.address x.address y.phone x.phone

15 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Candidate Pair Evaluation  Define a scoring metric and evaluate candidate pairs  Example:  Assign a score of 100 for each field. Perfect match gets 100, no match gets 0.  Which distance metric for names? Edit distance, but with quadratic penalty  How to evaluate phone numbers? Only exact matches allowed, but need to take care of missing area codes.  Pick a score threshold empirically and accept the ones above that Depends on the application and importance of false positives vs. negatives Typically need cross validation

17 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Fingerprint Matching  Many-to-many matching: Find out all pairs with the same fingerprints  Example: You want to find out if the same person appeared in multiple crime scenes  One-to-many matching: Find out whose fingerprint is on the gun  Too expensive to compare even one fingerprint with the whole database  Need to use LSH even for one-to-many problem  Preprocessing:  Different sizes, different orientations, different lighting, etc.  Need some normalization in preprocessing (not our focus here)

18 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Fingerprint Features  Minutia: Major features of a fingerprint Ridge ending Bifurcation Short ridge … Image Source: Wikimedia Commons

19 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Fingerprint Grid Representation  Overlay a grid and identify points with minutia X X X X X X X X

20 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Special Hash Function Choose 3 grid points If a fingerprint has minutia in all 3 points, add it to the bucket Otherwise, ignore the fingerprint.

21 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Locality Sensitive Hashing  Define 1024 hash functions  i.e. Each hash function is defined as 3 grid points  Add fingerprints to the buckets hash functions  If multiple fingerprints are in the same bucket, add them as a candidate pair.

22 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Example  Assume:  Probability of finding a minutia at a random grid point = 20%  If two fingerprints belong to the same finger: Probability of finding a minutia at the same grid point = 80%  For two different fingerprints:  Probability that they have minutia at point (x, y)? 0.2 * 0.2 = 0.04  Probability that they hash to the same bucket for a given hash function? =  For two fingerprints from the same finger:  Probability that they have minutia at point (x, y)? 0.2 * 0.8 = 0.16  Probability that they hash to the same bucket for a given hash function? =

23 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Example (cont’d)  For two different fingerprints and 1024 hash functions:  Probability that they hash to the same bucket at least once? 1 – ( ) 1024 =  For two fingerprints from the same finger and 1024 hash functions:  Probability that they hash to the same bucket at least once? 1 – ( ) 1024 =  False positive rate? 6.3%  False negative rate? 1.5%

24 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Example (cont’d)  How to reduce the false positive rate?  Try: Increase the number grid points from 3 to 6  For two different fingerprints and 1024 hash functions:  Probability that they hash to the same bucket at least once? 1 – ( ) 1024 =  For two fingerprints from the same finger and 1024 hash functions:  Probability that they hash to the same bucket at least once? 1 – ( ) 1024 =  False negative rate increased to 98.3%!

25 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Example (cont’d)  Second try: Add another AND function to the original setting 1. Define 2048 hash functions Each hash function is based on 3 grid points as before 2. Define two groups each with 1024 hash functions 3. For each group, apply LSH as before Find fingerprints that share a bucket for at least one hash function 4. If two fingerprints share at least one bucket in both groups, add them as a candidate pair

26 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Example (cont’d)  Reminder:  Probability that two fingerprints hash to the same bucket at least once for 1024 hash functions: If two different fingerprints: 1 – ( ) 1024 = If from the same finger: 1 – ( ) 1024 =  With the AND function at the end:  Probability that two fingerprints are chosen as candidate pair: If two different fingerprints: x = If from the same finger: x = 0.97  Reduced false positives to 0.4%, but increased false negatives to 3%  What if we add another OR function at the end?

28 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Similar News Articles  Typically, news articles come from an agency and distributed to multiple newspapers  A newspaper can modify the article a little, shorten it, add its own name, add advertisement, etc.  How to identify the same news articles?  Shingling + Min-Hashing + LSH  Potential problem: What if ~40% of the page is advertisement? How to distinguish the real article?  Special shingling

29 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Shingling for News Articles  Observation: Articles use stop words (the, a, and, for, …) much for frequently than ads.  Shingle definition: Two words followed by a stop word.  Example:  Advertisement: “Buy XYZ” No shingles  Article: “A spokesperson for the XYZ Corporation revealed today that studies have shown it is good for people to buy XYZ products.” Shingles: “A spokesperson for”, “for the XYZ”, “the XYZ Corporation”, “that studies have”, “have shown it”, “it is good”, “is good for”, “for people to”, “to buy XYZ”.  The content from the real article represented much more in the shingles.

30 CS 425 – Lecture 4Mustafa Ozdal, Bilkent University Identifying Similar News Articles  High level methodology: 1. Special shingling for news articles 2. Min-hashing (as before) 3. Locality sensitive hashing (as before)