CS791 - Technologies of Google Spring 2007 1 A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.

Introduction to Information Retrieval (Part 2) By Evren Ermis.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Evaluating Search Engine

Search Engines and Information Retrieval

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Ch 4: Information Retrieval and Text Mining

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Mehran Sahami Timothy D. Heilman A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets.

Hinrich Schütze and Christina Lioma

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Evaluating the Performance of IR Sytems

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

Vector Space Model CS 652 Information Extraction and Integration.

The Vector Space Model …and applications in Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Search Engines and Information Retrieval Chapter 1.

Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

Querying Structured Text in an XML Database By Xuemei Luo.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

SINGULAR VALUE DECOMPOSITION (SVD)

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

Vector Space Models.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

1 CS 430: Information Discovery Lecture 5 Ranking.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.

Plan for Today’s Lecture(s)

Information Storage and Retrieval Fall Lecture 1: Introduction and History.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Martin Rajman, Martin Vesely

Chapter 5: Information Retrieval and Web Search

INF 141: Information Retrieval

Discussion Class 9 Google.

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy D. Heilman Presented by Prashanth Kumar Muthoju, Aditya Varakantam

CS791 - Technologies of Google Spring Overview Introduction Previous work Proposed Similarity function Testing & Results Theoretical Analysis Application: Query Suggestion System Evaluation of Query Suggestion System Conclusion Future work References Discussion

CS791 - Technologies of Google Spring

4

5

6

7

8 Introduction Text Snippet - set of words –A query submitted to a Search engine Text Similarity - Compare two text snippets Example: –PDA –Personal digital assistant –Pennsylvania Dental Association

CS791 - Technologies of Google Spring Introduction Traditional Method - cosine function Q1: “Hello how are you” Q2: “Hello how are you now” Q3: “Hello there” v1 [Hello, how, are, you, now] = [1,1,1,1,0] v2 [Hello, how, are, you, now] = [1,1,1,1,1] comparing v1 & v2: cosine value = 0.89 v1 [Hello, how, are, you, there] = [1,1,1,1,0] v3 [Hello, how, are, you, there] = [1,0,0,0,1] comparing v1 & v3: cosine value = 0.35 i.e., Q1 and Q2 are more similar

CS791 - Technologies of Google Spring Introduction Problem with Cosine function Q1: United Nations Secretary-General Q2: Kofi Annan (Ban Ki-Moon ?) When compared, cosine value would be ‘0’ But semantically both of them represent the same person. Another Example: Q1: AI Q2: Artificial Intelligence Here also, cosine value = ‘0’

CS791 - Technologies of Google Spring Introduction Traditional similarity functions do not work efficiently with short text snippets

CS791 - Technologies of Google Spring Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them.

CS791 - Technologies of Google Spring Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them. Convert shorter snippets into bigger ones

CS791 - Technologies of Google Spring Introduction Possible Solution Concentrate on capturing the semantic context of the text snippets rather than just comparing the terms in them. Convert shorter snippets into bigger ones Expand them

CS791 - Technologies of Google Spring Introduction Possible Solution

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

CS791 - Technologies of Google Spring Introduction Possible Solution Search Engine Documents Retrieved for Q1 Documents Retrieved for Q2 Q1 Context vector Q2 Context vector Traditional Similarity Function

CS791 - Technologies of Google Spring Proposed approach is based on the traditional ‘Query Expansion’ technique. Different Motivations: –Traditionally, motivation was to improve the recall of the retrieval –Our goal is to represent a short text snippet in a richer way in order to compare it with another snippet. Alternative methods previously used: –Based on the ordering of documents retrieved in response to the two queries –Based on the normalized set overlap of top ‘m’ documents retrieved for each of the two queries. –Based on kernel methods (e.g. Support Vector Machines) for text classification –Using Semantic Proximity Matrix Previous Work

CS791 - Technologies of Google Spring Notations: x : A short text snippet y : A short text snippet QE(x) : Query expansion of x R(x) : Retrieved document set of x d i : i th retrieved document v i : TFIDF term vector of d i Term Frequency Inverse Document Frequency tf : Number of times a term occurred in a given document idf : log(total number of documents / number of documents containing the term) tfidf = tf x idf C(x): Centroid of L2 normalized vectors v i Proposed Similarity Function

CS791 - Technologies of Google Spring Procedure: 1.Issue query ‘x’ to a search engine ‘S’ 2.Compute TFIDF vector v i for each retrieved document d i in R(x) 3.Truncate each vector v i to include it’s ‘m’ highest weight terms. 4.Find C(x) : 5.Find QE(x) = L2 normalization of C(x) 6.Similarity Score K(x,y) = QE(x). QE(y) Proposed Similarity Function.. Cont.d Formulae taken from Paper

CS791 - Technologies of Google Spring Authors have examined their approach along with cosine method and the set overlap method based on three genres of text snippets. 1. Acronyms 2. Individuals & their positions 3. Multi faceted terms Testing

CS791 - Technologies of Google Spring Results

CS791 - Technologies of Google Spring Results Table taken from Paper

CS791 - Technologies of Google Spring Results.. Cont.d Table taken from Paper

CS791 - Technologies of Google Spring Results.. Cont.d Table taken from Paper

CS791 - Technologies of Google Spring Theoretical Analysis Definition: ε-indistinguishable: Two documents d 1 and d 2 are ε-indistinguishable to a search engine S with respect to a query q if S finds both d 1 and d 2 to be equally relevant to q within the tolerance ε of it’s ranking function Notation: T s (q) : Set of all (ranked) documents retrieved for query ‘q’. All of these are ε-indistinguishable. R(q) : Set of ‘n’ top ranked documents out of T s (q) documents. D : Size of the document collection. Assumption: T s (q 1 ) = T s (q 2 ) i.e., they share same set of the maximally relevant documents.

CS791 - Technologies of Google Spring Theoretical Analysis Expected normalized set overlap for q 1 & q 2 = n/|T| As we increase ‘n’, (to a maximum of |T|), the expected set overlap approaches 1. But as |T| ∞ (i.e., number of retrieval documents increases), expected overlap approaches ‘0’. This is a flaw in this method... Cont.d

CS791 - Technologies of Google Spring Theoretical Analysis We can assume that the document vectors ‘v’ generated from documents in T are distributed according to some arbitrary distribution with μ as it’s mean and σ as it’s standard deviation (it’s the angular deviation from μ). For this distribution, K(q 1, q 2 ) ≤ cos(5.16 σ / √n) K doesn’t depend up on |T| now. As n increases, K value also increases K depends on σ too. –But all documents are equally distributed. So σ will not change much as |T| ∞. i.e., this function is robust. It can be used with very large collection of documents such as web... Cont.d Formula taken from Paper

CS791 - Technologies of Google Spring Theoretical Analysis Even if we remove the initial assumption of q 1 and q 2 having the same set of maximal relevant documents, K(q 1, q 2 ) = cos ((2.58 (σ 1 + σ 2 ) / √n)+θ μ 1, μ 2 ) Even in this case K doesn’t depend up on |T|... Cont.d Formula taken from Paper

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Suggests potentially related queries to the users of search engines. Gives more options for information finding

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries... Cont.d

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries... Cont.d computer laptop touchpad Q

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries. If a new query ‘u’ is issued, compute the kernel function K(u, q i ) for all q i in Q and suggest queries q i with top kernel scores using a simple algorithm:.. Cont.d computer laptop touchpad Q ThinkPad u:qj:qj:

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation: Let a query expansion system starts with an initial repository Q of previously issued user queries. If a new query ‘u’ is issued, compute the kernel function K(u, q i ) for all q i in Q and suggest queries q i with top kernel scores using a simple algorithm:.. Cont.d computer laptop touchpad Q ThinkPad u:qj:qj: K(u,q j ) Assume all these have K< 0.5

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop Q K(u,q j ) touchpad Z Assume MAX = 2

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop Q K(u,q j ) touchpad Z Assume MAX = 2

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop Q K(u,q j ) touchpad Sort by K scores computer laptop L touchpad Z Assume MAX = 2

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad Z Assume MAX = 2 j: k

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad Z Assume MAX = 2 j: k |q j | = 1 |q j z| = 0 |z| = 1 U

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = 2 j: k

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = 2 j: k

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = 2 j: k

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop Z Assume MAX = k j: |q j | = 1 |q j z| = 0 |z| = 1 U

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop touchpad Z Assume MAX = k j:

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop touchpad Z Assume MAX = k j:

CS791 - Technologies of Google Spring Sample Application: Query Suggestion System Implementation:.. Cont.d Figure taken from Paper MAX = Maximum number of suggestions computer laptop L touchpad laptop touchpad Z Assume MAX = k j: size(Z) = MAX

CS791 - Technologies of Google Spring A total of 118 user queries were given as input A total of 379 queries were suggested Average = 3.2 suggestions per query 9 human evaluators were selected. Each was asked to issue queries from the Google Zeitgeist in a different month of – point scale used for evaluation: 1.Suggestion is totally off topic 2.Suggestion is not as good as original query 3.Suggestion is basically same as original query 4.Suggestion is potentially better than original query 5.Suggestion is fantastic Evaluation of Query Suggestion System

CS791 - Technologies of Google Spring Evaluation of Query Suggestion System.. Cont.d Results: Table taken from Paper

CS791 - Technologies of Google Spring Evaluation of Query Suggestion System.. Cont.d Average ratings at various kernel thresholds: Graphs taken from Paper Average ratings vs. Average number of Suggestions per Query

CS791 - Technologies of Google Spring Conclusions Authors have proposed a new similarity kernel function which was proved both theoretically and practically better than the traditional similarity functions. The query suggestion system which was built on this new kernel function was able to give good results in user evaluation.

CS791 - Technologies of Google Spring Future work Improvements in query expansion –Improving match score Incorporating this new kernel function into various other machine learning methods Building useful applications such as Question answering system

CS791 - Technologies of Google Spring References Wordhoard Code project L2 normalization Normal distribution Latent semantic indexing TFIDF Google Suggest Google Zeitgeist

CS791 - Technologies of Google Spring Discussion Do the kernel scores change if we use different search engines ?

CS791 - Technologies of Google Spring Discussion How can we make sure that the suggested queries are the correct ones? Search engines give popular results – not necessarily correct results

CS791 - Technologies of Google Spring END