CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy D. Heilman Presented by Prashanth Kumar Muthoju, Aditya Varakantam
CS791 - Technologies of Google Spring Overview Introduction Previous work Proposed Similarity function Testing & Results Theoretical Analysis Application: Query Suggestion System Evaluation of Query Suggestion System Conclusion Future work References Discussion
CS791 - Technologies of Google Spring
4
5
6
7
8 Introduction Text Snippet - set of words –A query submitted to a Search engine Text Similarity - Compare two text snippets Example: –PDA –Personal digital assistant –Pennsylvania Dental Association
CS791 - Technologies of Google Spring Introduction Traditional Method - cosine function Q1: “Hello how are you” Q2: “Hello how are you now” Q3: “Hello there” v1 [Hello, how, are, you, now] = [1,1,1,1,0] v2 [Hello, how, are, you, now] = [1,1,1,1,1] comparing v1 & v2: cosine value = 0.89 v1 [Hello, how, are, you, there] = [1,1,1,1,0] v3 [Hello, how, are, you, there] = [1,0,0,0,1] comparing v1 & v3: cosine value = 0.35 i.e., Q1 and Q2 are more similar
CS791 - Technologies of Google Spring Introduction Problem with Cosine function Q1: United Nations Secretary-General Q2: Kofi Annan (Ban Ki-Moon ?) When compared, cosine value would be ‘0’ But semantically both of them represent the same person. Another Example: Q1: AI Q2: Artificial Intelligence Here also, cosine value = ‘0’
CS791 - Technologies of Google Spring Introduction Traditional similarity functions do not work efficiently with short text snippets
CS791 - Technologies of Google Spring Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them.
CS791 - Technologies of Google Spring Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them. Convert shorter snippets into bigger ones
CS791 - Technologies of Google Spring Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them. Convert shorter snippets into bigger ones Expand them
CS791 - Technologies of Google Spring Introduction Possible Solution
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function
CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function
CS791 - Technologies of Google Spring Proposed approach is based on the traditional ‘Query Expansion’ technique. Different Motivations: –Traditionally, motivation was to improve the recall of the retrieval –Our goal is to represent a short text snippet in a richer way in order to compare it with another snippet. Alternative methods previously used: –Based on the ordering of documents retrieved in response to the two queries –Based on the normalized set overlap of top ‘m’ documents retrieved for each of the two queries. –Based on kernel methods (e.g. Support Vector Machines) for text classification –Using Semantic Proximity Matrix Previous Work
CS791 - Technologies of Google Spring Notations: x : A short text snippet y : A short text snippet QE(x) : Query expansion of x R(x) : Retrieved document set of x d i : i th retrieved document v i : TFIDF term vector of d i Term Frequency Inverse Document Frequency tf : Number of times a term occurred in a given document idf : log(total number of documents / number of documents containing the term) tfidf = tf x idf C(x): Centroid of L2 normalized vectors v i Proposed Similarity Function
CS791 - Technologies of Google Spring Procedure: 1.Issue query ‘x’ to a search engine ‘S’ 2.Compute TFIDF vector v i for each retrieved document d i in R(x) 3.Truncate each vector v i to include it’s ‘m’ highest weight terms. 4.Find C(x) : 5.Find QE(x) = L2 normalization of C(x) 6.Similarity Score K(x,y) = QE(x). QE(y) Proposed Similarity Function.. Cont.d Formulae taken from Paper
CS791 - Technologies of Google Spring Authors have examined their approach along with cosine method and the set overlap method based on three genres of text snippets. 1. Acronyms 2. Individuals & their positions 3. Multi faceted terms Testing
CS791 - Technologies of Google Spring Results
CS791 - Technologies of Google Spring Results Table taken from Paper
CS791 - Technologies of Google Spring Results.. Cont.d Table taken from Paper
CS791 - Technologies of Google Spring Results.. Cont.d Table taken from Paper
CS791 - Technologies of Google Spring Theoretical Analysis Definition: ε-indistinguishable: Two documents d 1 and d 2 are ε-indistinguishable to a search engine S with respect to a query q if S finds both d 1 and d 2 to be equally relevant to q within the tolerance ε of it’s ranking function Notation: T s (q) : Set of all (ranked) documents retrieved for query ‘q’. All of these are ε-indistinguishable. R(q) : Set of ‘n’ top ranked documents out of T s (q) documents. D : Size of the document collection. Assumption: T s (q 1 ) = T s (q 2 ) i.e., they share same set of the maximally relevant documents.
CS791 - Technologies of Google Spring Theoretical Analysis Expected normalized set overlap for q 1 & q 2 = n/|T| As we increase ‘n’, (to a maximum of |T|), the expected set overlap approaches 1. But as |T| ∞ (i.e., number of retrieval documents increases), expected overlap approaches ‘0’. This is a flaw in this method... Cont.d
CS791 - Technologies of Google Spring Theoretical Analysis We can assume that the document vectors ‘v’ generated from documents in T are distributed according to some arbitrary distribution with μ as it’s mean and σ as it’s standard deviation (it’s the angular deviation from μ). For this distribution, K(q 1, q 2 ) ≤ cos(5.16 σ / √n) K doesn’t depend up on |T| now. As n increases, K value also increases K depends on σ too. –But all documents are equally distributed. So σ will not change much as |T| ∞. i.e., this function is robust. It can be used with very large collection of documents such as web... Cont.d Formula taken from Paper
CS791 - Technologies of Google Spring Theoretical Analysis Even if we remove the initial assumption of q 1 and q 2 having the same set of maximal relevant documents, K(q 1, q 2 ) = cos ((2.58 (σ 1 + σ 2 ) / √n)+θ μ 1, μ 2 ) Even in this case K doesn’t depend up on |T|... Cont.d Formula taken from Paper
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Suggests potentially related queries to the users of search engines. Gives more options for information finding
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries... Cont.d
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries... Cont.d computer laptop touchpad Q
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries. If a new query ‘u’ is issued, compute the kernel function K(u, q i ) for all q i in Q and suggest queries q i with top kernel scores using a simple algorithm:.. Cont.d computer laptop touchpad Q ThinkPad u:qj:qj:
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries. If a new query ‘u’ is issued, compute the kernel function K(u, q i ) for all q i in Q and suggest queries q i with top kernel scores using a simple algorithm:.. Cont.d computer laptop touchpad Q ThinkPad u:qj:qj: K(u,q j ) Assume all these have K< 0.5
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop Q K(u,q j ) touchpad Z Assume MAX = 2
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop Q K(u,q j ) touchpad Z Assume MAX = 2
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop Q K(u,q j ) touchpad Sort by K scores computer laptop L touchpad Z Assume MAX = 2
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad Z Assume MAX = 2 j: k
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad Z Assume MAX = 2 j: k |q j | = 1 |q j z| = 0 |z| = 1 U
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = 2 j: k
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = 2 j: k
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = 2 j: k
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = k j: |q j | = 1 |q j z| = 0 |z| = 1 U
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop touchpad Z Assume MAX = k j:
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop touchpad Z Assume MAX = k j:
CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop touchpad Z Assume MAX = k j: size(Z) = MAX
CS791 - Technologies of Google Spring A total of 118 user queries were given as input A total of 379 queries were suggested Average = 3.2 suggestions per query 9 human evaluators were selected. Each was asked to issue queries from the Google Zeitgeist in a different month of – point scale used for evaluation: 1.Suggestion is totally off topic 2.Suggestion is not as good as original query 3.Suggestion is basically same as original query 4.Suggestion is potentially better than original query 5.Suggestion is fantastic Evaluation of Query Suggestion System
CS791 - Technologies of Google Spring Evaluation of Query Suggestion System.. Cont.d Results: Table taken from Paper
CS791 - Technologies of Google Spring Evaluation of Query Suggestion System.. Cont.d Average ratings at various kernel thresholds: Graphs taken from Paper Average ratings vs. Average number of Suggestions per Query
CS791 - Technologies of Google Spring Conclusions Authors have proposed a new similarity kernel function which was proved both theoretically and practically better than the traditional similarity functions. The query suggestion system which was built on this new kernel function was able to give good results in user evaluation.
CS791 - Technologies of Google Spring Future work Improvements in query expansion –Improving match score Incorporating this new kernel function into various other machine learning methods Building useful applications such as Question answering system
CS791 - Technologies of Google Spring References Wordhoard Code project L2 normalization Normal distribution Latent semantic indexing TFIDF Google Suggest Google Zeitgeist
CS791 - Technologies of Google Spring Discussion Do the kernel scores change if we use different search engines ?
CS791 - Technologies of Google Spring Discussion How can we make sure that the suggested queries are the correct ones? Search engines give popular results – not necessarily correct results
CS791 - Technologies of Google Spring END