Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS791 - Technologies of Google Spring 2007 1 A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Similar presentations


Presentation on theme: "CS791 - Technologies of Google Spring 2007 1 A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy."— Presentation transcript:

1 CS791 - Technologies of Google Spring 2007 1 A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy D. Heilman Presented by Prashanth Kumar Muthoju, Aditya Varakantam

2 CS791 - Technologies of Google Spring 2007 2 Overview Introduction Previous work Proposed Similarity function Testing & Results Theoretical Analysis Application: Query Suggestion System Evaluation of Query Suggestion System Conclusion Future work References Discussion

3 CS791 - Technologies of Google Spring 2007 3

4 4

5 5

6 6

7 7

8 8 Introduction Text Snippet - set of words –A query submitted to a Search engine Text Similarity - Compare two text snippets Example: –PDA –Personal digital assistant –Pennsylvania Dental Association

9 CS791 - Technologies of Google Spring 2007 9 Introduction Traditional Method - cosine function Q1: “Hello how are you” Q2: “Hello how are you now” Q3: “Hello there” v1 [Hello, how, are, you, now] = [1,1,1,1,0] v2 [Hello, how, are, you, now] = [1,1,1,1,1] comparing v1 & v2: cosine value = 0.89 v1 [Hello, how, are, you, there] = [1,1,1,1,0] v3 [Hello, how, are, you, there] = [1,0,0,0,1] comparing v1 & v3: cosine value = 0.35 i.e., Q1 and Q2 are more similar

10 CS791 - Technologies of Google Spring 2007 10 Introduction Problem with Cosine function Q1: United Nations Secretary-General Q2: Kofi Annan (Ban Ki-Moon ?) When compared, cosine value would be ‘0’ But semantically both of them represent the same person. Another Example: Q1: AI Q2: Artificial Intelligence Here also, cosine value = ‘0’

11 CS791 - Technologies of Google Spring 2007 11 Introduction Traditional similarity functions do not work efficiently with short text snippets

12 CS791 - Technologies of Google Spring 2007 12 Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them.

13 CS791 - Technologies of Google Spring 2007 13 Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them. Convert shorter snippets into bigger ones

14 CS791 - Technologies of Google Spring 2007 14 Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them. Convert shorter snippets into bigger ones Expand them

15 CS791 - Technologies of Google Spring 2007 15 Introduction Possible Solution

16 CS791 - Technologies of Google Spring 2007 16 Introduction Possible Solution Search Engine

17 CS791 - Technologies of Google Spring 2007 17 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2

18 CS791 - Technologies of Google Spring 2007 18 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2

19 CS791 - Technologies of Google Spring 2007 19 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2

20 CS791 - Technologies of Google Spring 2007 20 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector

21 CS791 - Technologies of Google Spring 2007 21 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector

22 CS791 - Technologies of Google Spring 2007 22 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector

23 CS791 - Technologies of Google Spring 2007 23 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

24 CS791 - Technologies of Google Spring 2007 24 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

25 CS791 - Technologies of Google Spring 2007 25 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

26 CS791 - Technologies of Google Spring 2007 26 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

27 CS791 - Technologies of Google Spring 2007 27 Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

28 CS791 - Technologies of Google Spring 2007 28 Proposed approach is based on the traditional ‘Query Expansion’ technique. Different Motivations: –Traditionally, motivation was to improve the recall of the retrieval –Our goal is to represent a short text snippet in a richer way in order to compare it with another snippet. Alternative methods previously used: –Based on the ordering of documents retrieved in response to the two queries –Based on the normalized set overlap of top ‘m’ documents retrieved for each of the two queries. –Based on kernel methods (e.g. Support Vector Machines) for text classification –Using Semantic Proximity Matrix Previous Work

29 CS791 - Technologies of Google Spring 2007 29 Notations: x : A short text snippet y : A short text snippet QE(x) : Query expansion of x R(x) : Retrieved document set of x d i : i th retrieved document v i : TFIDF term vector of d i Term Frequency Inverse Document Frequency tf : Number of times a term occurred in a given document idf : log(total number of documents / number of documents containing the term) tfidf = tf x idf C(x): Centroid of L2 normalized vectors v i Proposed Similarity Function

30 CS791 - Technologies of Google Spring 2007 30 Procedure: 1.Issue query ‘x’ to a search engine ‘S’ 2.Compute TFIDF vector v i for each retrieved document d i in R(x) 3.Truncate each vector v i to include it’s ‘m’ highest weight terms. 4.Find C(x) : 5.Find QE(x) = L2 normalization of C(x) 6.Similarity Score K(x,y) = QE(x). QE(y) Proposed Similarity Function.. Cont.d Formulae taken from Paper

31 CS791 - Technologies of Google Spring 2007 31 Authors have examined their approach along with cosine method and the set overlap method based on three genres of text snippets. 1. Acronyms 2. Individuals & their positions 3. Multi faceted terms Testing

32 CS791 - Technologies of Google Spring 2007 32 Results

33 CS791 - Technologies of Google Spring 2007 33 Results Table taken from Paper

34 CS791 - Technologies of Google Spring 2007 34 Results.. Cont.d Table taken from Paper

35 CS791 - Technologies of Google Spring 2007 35 Results.. Cont.d Table taken from Paper

36 CS791 - Technologies of Google Spring 2007 36 Theoretical Analysis Definition: ε-indistinguishable: Two documents d 1 and d 2 are ε-indistinguishable to a search engine S with respect to a query q if S finds both d 1 and d 2 to be equally relevant to q within the tolerance ε of it’s ranking function Notation: T s (q) : Set of all (ranked) documents retrieved for query ‘q’. All of these are ε-indistinguishable. R(q) : Set of ‘n’ top ranked documents out of T s (q) documents. D : Size of the document collection. Assumption: T s (q 1 ) = T s (q 2 ) i.e., they share same set of the maximally relevant documents.

37 CS791 - Technologies of Google Spring 2007 37 Theoretical Analysis Expected normalized set overlap for q 1 & q 2 = n/|T| As we increase ‘n’, (to a maximum of |T|), the expected set overlap approaches 1. But as |T| ∞ (i.e., number of retrieval documents increases), expected overlap approaches ‘0’. This is a flaw in this method... Cont.d

38 CS791 - Technologies of Google Spring 2007 38 Theoretical Analysis We can assume that the document vectors ‘v’ generated from documents in T are distributed according to some arbitrary distribution with μ as it’s mean and σ as it’s standard deviation (it’s the angular deviation from μ). For this distribution, K(q 1, q 2 ) ≤ cos(5.16 σ / √n) K doesn’t depend up on |T| now. As n increases, K value also increases K depends on σ too. –But all documents are equally distributed. So σ will not change much as |T| ∞. i.e., this function is robust. It can be used with very large collection of documents such as web... Cont.d Formula taken from Paper

39 CS791 - Technologies of Google Spring 2007 39 Theoretical Analysis Even if we remove the initial assumption of q 1 and q 2 having the same set of maximal relevant documents, K(q 1, q 2 ) = cos ((2.58 (σ 1 + σ 2 ) / √n)+θ μ 1, μ 2 ) Even in this case K doesn’t depend up on |T|... Cont.d Formula taken from Paper

40 CS791 - Technologies of Google Spring 2007 40 Sample Application: Query Suggestion System Suggests potentially related queries to the users of search engines. Gives more options for information finding

41 CS791 - Technologies of Google Spring 2007 41 Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries... Cont.d

42 CS791 - Technologies of Google Spring 2007 42 Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries... Cont.d computer laptop touchpad........ Q

43 CS791 - Technologies of Google Spring 2007 43 Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries. If a new query ‘u’ is issued, compute the kernel function K(u, q i ) for all q i in Q and suggest queries q i with top kernel scores using a simple algorithm:.. Cont.d computer laptop touchpad........ Q ThinkPad u:qj:qj:

44 CS791 - Technologies of Google Spring 2007 44 Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries. If a new query ‘u’ is issued, compute the kernel function K(u, q i ) for all q i in Q and suggest queries q i with top kernel scores using a simple algorithm:.. Cont.d computer laptop touchpad........ Q ThinkPad u:qj:qj: K(u,q j ) 0.58 0.88 0.72........ Assume all these have K< 0.5

45 CS791 - Technologies of Google Spring 2007 45 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop........ Q K(u,q j ) 0.58 0.88 0.72........ touchpad Z Assume MAX = 2

46 CS791 - Technologies of Google Spring 2007 46 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop........ Q K(u,q j ) 0.58 0.88 0.72........ touchpad Z Assume MAX = 2

47 CS791 - Technologies of Google Spring 2007 47 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop........ Q K(u,q j ) 0.58 0.88 0.72........ touchpad Sort by K scores computer laptop........ L touchpad Z Assume MAX = 2

48 CS791 - Technologies of Google Spring 2007 48 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad Z Assume MAX = 2 j: 1 2 3.... k

49 CS791 - Technologies of Google Spring 2007 49 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad Z Assume MAX = 2 j: 1 2 3.... k |q j | = 1 |q j z| = 0 |z| = 1 U

50 CS791 - Technologies of Google Spring 2007 50 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop Z Assume MAX = 2 j: 1 2 3.... k

51 CS791 - Technologies of Google Spring 2007 51 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop Z Assume MAX = 2 j: 1 2 3.... k

52 CS791 - Technologies of Google Spring 2007 52 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop Z Assume MAX = 2 j: 1 2 3.... k

53 CS791 - Technologies of Google Spring 2007 53 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop Z Assume MAX = 2 1 2 3.... k j: |q j | = 1 |q j z| = 0 |z| = 1 U

54 CS791 - Technologies of Google Spring 2007 54 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop touchpad Z Assume MAX = 2 1 2 3.... k j:

55 CS791 - Technologies of Google Spring 2007 55 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop touchpad Z Assume MAX = 2 1 2 3.... k j:

56 CS791 - Technologies of Google Spring 2007 56 Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop...... L touchpad laptop touchpad Z Assume MAX = 2 1 2 3.... k j: size(Z) = MAX

57 CS791 - Technologies of Google Spring 2007 57 A total of 118 user queries were given as input A total of 379 queries were suggested Average = 3.2 suggestions per query 9 human evaluators were selected. Each was asked to issue queries from the Google Zeitgeist in a different month of 2003. 5 – point scale used for evaluation: 1.Suggestion is totally off topic 2.Suggestion is not as good as original query 3.Suggestion is basically same as original query 4.Suggestion is potentially better than original query 5.Suggestion is fantastic Evaluation of Query Suggestion System

58 CS791 - Technologies of Google Spring 2007 58 Evaluation of Query Suggestion System.. Cont.d Results: Table taken from Paper

59 CS791 - Technologies of Google Spring 2007 59 Evaluation of Query Suggestion System.. Cont.d Average ratings at various kernel thresholds: Graphs taken from Paper Average ratings vs. Average number of Suggestions per Query

60 CS791 - Technologies of Google Spring 2007 60 Conclusions Authors have proposed a new similarity kernel function which was proved both theoretically and practically better than the traditional similarity functions. The query suggestion system which was built on this new kernel function was able to give good results in user evaluation.

61 CS791 - Technologies of Google Spring 2007 61 Future work Improvements in query expansion –Improving match score Incorporating this new kernel function into various other machine learning methods Building useful applications such as Question answering system

62 CS791 - Technologies of Google Spring 2007 62 References Wordhoard Code project L2 normalization Normal distribution Latent semantic indexing TFIDF Google Suggest Google Zeitgeist

63 CS791 - Technologies of Google Spring 2007 63 Discussion Do the kernel scores change if we use different search engines ?

64 CS791 - Technologies of Google Spring 2007 64 Discussion How can we make sure that the suggested queries are the correct ones? Search engines give popular results – not necessarily correct results

65 CS791 - Technologies of Google Spring 2007 65 END


Download ppt "CS791 - Technologies of Google Spring 2007 1 A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy."

Similar presentations


Ads by Google