Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Slides:

Advertisements

Similar presentations

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Advertisements

Similarity and Distance Sketching, Locality Sensitive Hashing

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Near-Duplicates Detection

High Dimensional Search Min-Hashing Locality Sensitive Hashing

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 Lecture 18 Syntactic Web Clustering CS

Near Duplicate Detection

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Locality Sensitive Hashing Basics and applications.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Outline Problem Background Theory Extending to NLP and Experiment

Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.

Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.

Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CS276 Information Retrieval and Web Search

Locality-sensitive hashing and its applications

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Compression of documents

Packing to fewer dimensions

Near Duplicate Detection

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Locality-sensitive hashing and its applications

Sublinear Algorithmic Tools 3

Lecture 11: Nearest Neighbor Search

LSI, SVD and Data Management

Packing to fewer dimensions

Index Construction: sorting

Detecting Phrase-Level Duplication on the World Wide Web

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Query processing: phrase queries and positional indexes

Minwise Hashing and Efficient Search

On the resemblance and containment of documents (MinHash)

Packing to fewer dimensions

Hash Functions for Network Applications (II)

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Lecture-Hashing.

Presentation transcript:

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000  100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection

A sketch LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees good stretching properties with high probability between pair of points. What about polysemy ?

Notions from linear algebra Matrix A, vector v Matrix transpose (A t ) Matrix product Rank Eigenvalues  and eigenvector v: Av = v

Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled (faster) in this new space

Singular-Value Decomposition Recall m  n matrix of terms  docs, A. A has rank r  m,n Define term-term correlation matrix T=AA t T is a square, symmetric m  m matrix Let P be m  r matrix of eigenvectors of T Define doc-doc correlation matrix D=A t A D is a square, symmetric n  n matrix Let R be n  r matrix of eigenvectors of D

A’s decomposition Do exist matrices P (for T, m  r) and R (for D, n  r) formed by orthonormal columns (unit dot-product) It turns out that A = P  R t Where  is a diagonal matrix with the eigenvalues of T=AA t in decreasing order. = A P  RtRt mnmnmrmr rrrr rnrn

 For some k << r, zero out all but the k biggest eigenvalues in  [choice of k is crucial] Denote by  k this new version of , having rank k Typically k is about 100, while r ( A’s rank ) is > 10,000 = P kk RtRt Dimensionality reduction AkAk document useless due to 0-col/0-row of  k m x r r x n r k k k 00 0 A m x k k x n

Guarantee A k is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m  n matrices of rank k, A k is the best approximation to A wrt the following measures: min B, rank(B)=k ||A-B|| 2 = ||A-A k || 2 =  k  min B, rank(B)=k ||A-B|| F 2 = ||A-A k || F 2 =  k  2  k+2 2  r 2 Frobenius norm ||A|| F 2 =   2   2  r 2

Reduction X k =  k R t is the doc-matrix k x n, hence reduced to k dim Take the doc-correlation matrix: It is D=A t A =(P  R t ) t (P  R t ) = (  R t ) t (  R t ) Approx  with  k, thus get A t A  X k t X k (both are n x n matr.) We use X k to define how to project A and Q: X k =  k R t, substitute R t =   P t A, so get P k t A. In fact,  k   P t = P k t which is a k x m matrix This means that to reduce a doc/query vector is enough to multiply it by P k t Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) R,P are formed by orthonormal eigenvectors of the matrices D,T

Which are the concepts ? c-th concept = c-th row of P k t (which is k x m) Denote it by P k t [c], whose size is m = #terms P k t [c][i] = strength of association between c-th concept and i-th term Projected document: d’ j = P k t d j d’ j [c] = strenght of concept c in d j Projected query: q’ = P k t q q’ [c] = strenght of concept c in q

Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

An interesting math result f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!! Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given  > 0, there exists a function f : P  IR k such that for every pair of points u,v in P it holds: (1 -  ) ||u - v|| 2 ≤ ||f(u) – f(v)|| 2 ≤ (1 +  ) ||u-v|| 2 Where k = O(  -2 log m)

What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above

How to compute a JL-embedding? E[r i,j ] = 0 Var[r i,j ] = 1 If we set R = r i,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions

Finally...  Random projections hide large constants  k  (1/  ) 2 * log m, so it may be large…  it is simple and fast to compute  LSI is intuitive and may scale to any k  optimal under various metrics  but costly to compute

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Duplicate documents The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates E.g., Last modified date the only difference between two copies of a page Sec. 19.6

Natural Approaches Fingerprinting: only works for exact matches, slow Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes Edit-distance metric for approximate string-matching expensive – even for one pair of strings impossible – for web documents Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents  similar samples But – even samples of same document will differ

Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes relatively slow Karp-Rabin’s Scheme Rolling hash: split doc in many pieces Algebraic technique – arithmetic on primes Efficient and other nice properties… Exact-Duplicate Detection

Karp-Rabin Fingerprints Consider – m-bit string A = 1 a 1 a 2 … a m Basic values: Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 2 64 ) Fingerprints: f(A) = A mod p Nice property is that if B = a 2 … a m a m+1 f(B) = [2 m-1 (A – 2 m - a 1 2 m-1 ) + 2 m + a m+1 ] mod p Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U) ≈ (log (A+B)) / #prime(U) = m log U/U

Near-Duplicate Detection Problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors 30% of web-pages are near-duplicates [1997]

Desiderata Storage: only small sketches of each document. Computation: the fastest possible Stream Processing : once sketch computed, source is unavailable Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ] They are near-duplicates if large shingle-sets intersect enough

Similarity of Documents Doc B SBSB SASA Doc A Jaccard measure – similarity of S A, S B Claim: A & B are near-duplicates if sim(S A,S B ) is high

Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ] They are near-duplicates if large shingle-sets intersect enough We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (for further time and space efficiency)

Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint Documents  Sets of 64-bit fingerprints Fingerprints: Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) Fingerprint space [0, …, U-1] In practice, use 64-bit fingerprints, i.e., U=2 64 Prob[collision] ≈ (8q)/2 64 << 1 This reduces space for storing the multi-sets And the time to intersect them, but...

Speeding-up: Sketch of a document Intersecting shingle-sets is too costly Create a “sketch vector” (of size ~200) for each document, for its shingle-set Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates Sec. 19.6

Sketching by Min-Hashing Consider S A, S B  P Pick a random permutation π of P (such as ax+b mod |P|) Define  = π -1 ( min{π(S A )} ),  = π -1 ( min{π(S B )} ) minimal element under permutation π Lemma:

Strengthening it… Similarity sketch sk(A) = k minimal elements under π(S A ) K is fixed or is a fixed ratio of S A,S B ? We might also take K permutations and the min of each Similarity Sketches sk(A): Succinct representation of fingerprint sets S A Allows efficient estimation of sim(S A,S B ) Basic idea is to use min-hash of fingerprints Note : we can reduce the variance by using a larger k

Computing Sketch[i] for Doc1 Document Start with 64-bit f(shingles) Permute on the number line with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document Are these equal? Test for 200 random permutations:  ,  ,…  200 AB Sec. 19.6

However… Document 1 Document A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union B A Sec. 19.6

Sum up… Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B. Locality sensitive hashing (LSH) Compute sk(A) for each document A Use LSH of all sketches, briefly: Take h elements of sk(A) as ID (may induce false positives) Create t IDs (to reduce the false negatives) If one ID matches with another one (wrt same h-selection), then the corresponding docs are probably near-duplicates; hence compare.

Search Engines “Semantic” searches ?

“ Diego Maradona won against Mexico ” Dictionary of terms against Diego Maradona Mexico won Term Vector Similarity(v,w) ≈ cos(  ) t1t1 v w t3t3 t2t2  Vector Space model Classical approach Mainly term-based : polysemy and synonymy issues

38 A new approach: Massive graphs of entities and relations May 2012

the paparazzi photographed the star the astronomer photographed the star A typical issue: polysemy

He is using Microsoft’s browser He is a fan of Internet Explorer Another issue: synonymy

Korean won Win-loss record Only won... “ Diego Maradona won against Mexico ” Diego A. Maradona Diego Maradona jr. Maradona Stadium Maradona Film … Mexico nation Mexico state Mexico football team Mexico baseball team … No Annotation PARSING PRUNING 2 simple features DISAMBIGUATION by a voting scheme TAGME  score

obama asks iran for RQ-170 sentinel drone back us president issues Ahmadinejad ultimatum Barack Obama Iran Lockheed Martin RQ-170 Sentinel President of the United States Mahmoud Ahmadinejad Ultimatum Why is it more powerful ?

44 Text as a sub-graph of topics Mahmoud Ahmadinejad Ultimatum RQ-170 drone President of the United States Barack Obama Iran

Text as a sub-graph of topics Mahmoud Ahmadinejad Ultimatum RQ-170 drone Any relatedness measure over a graph, e.g. [Milne & Witten, 2008] President of the United States Barack Obama Iran Graph analysis allows to find similarities between texts and entities even if they do not match syntactically (so at concept level)

Search Results Clustering  Jaguar Cars  Panthera Onca  Mac OS X  Atari Jaguar  Jacksonville Jags  Fender Jaguar  … TOPICS

Paper at ACM WSDM 2012 Paper at ECIR 2012 Paper at IEEE Software 2012 Pls design your killer app... Releasing open-source…