Query Operations J. H. Wang Mar. 26, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

Chapter 5: Query Operations Hassan Bashiri April
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
1 Relevance Feedback and other Query Modification Techniques 課程名稱 : 資訊擷取與推薦技術 指導教授 : 黃三益 教授 報告者 : 博一 楊錦生 (d ) 博一 曾繁絹 (d )
IR Models: Overview, Boolean, and Vector
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modeling Modern Information Retrieval
Modern Information Retrieval Chapter 5 Query Operations.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Relevance Feedback Main Idea:
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Query Operations Relevance Feedback & Query Expansion.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Relevance Feedback Hongning Wang
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Text Based Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
Information Retrieval and Web Design
Query Operations Berlin Chen
Presentation transcript:

Query Operations J. H. Wang Mar. 26, 2008

The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, Text Database Text

Query Modification Improving initial query formulation –Relevance feedback approaches based on feedback information from users –Local analysis approaches based on information derived from the set of documents initially retrieved (called the local set of documents ) –Global analysis approaches based on global information derived from the document collection

Relevance Feedback Relevance feedback process –shields the user from the details of the query reformulation process –breaks down the whole searching task into a sequence of small steps which are easier to grasp –provides a controlled process designed to emphasize some terms and de-emphasize others Two basic techniques –Query expansion addition of new terms from relevant documents –Term reweighting modification of term weights based on the user relevance judgement

Vector Space Model Definition w i,j : the i th term in the vector for document d j w i,k : the i th term in the vector for query q k t : the number of unique terms in the data set

Query Expansion and Term Reweighting for the Vector Model Ideal situation –C R : set of relevant documents among all documents in the collection Rocchio (1965, 1971) –R: set of relevant documents, as identified by the user among the retrieved documents –S: set of non-relevant documents among the retrieved documents

Rocchio’s Algorithm Ide_Regular (1971) Ide_Dec_Hi Parameters –  =  =  =1 –  > 

Probabilistic Model Definition – p i : the probability of observing term t i in the set of relevant documents – q i : the probability of observing term t i in the set of nonrelevant documents Initial search assumption – p i is constant for all terms t i (typically 0.5) – q i can be approximated by the distribution of t i in the whole collection

Term Reweighting for the Probabilistic Model Robertson and Sparck Jones (1976) With relevance feedback from user N : the number of documents in the collection R : the number of relevant documents for query q n i : the number of documents having term t i r i : the number of relevant documents having term t i Document Relevance Document Indexing riri R-r i R N-n i -R+r i - n i -r i N-R nini N-n i N

Initial search assumption p i is constant for all terms t i (typically 0.5) q i can be approximated by the distribution of t i in the whole collection With relevance feedback from users p i and q i can be approximated by hence the term weight is updated by Term Reweighting for the Probabilistic Model (cont.)

However, the last formula poses problems for certain small values of R and r i ( R =1, r i =0) Instead of 0.5, alternative adjustments have been proposed Term Reweighting for the Probabilistic Model (Cont.)

Characteristics –Advantage the term reweighting is optimal under the assumptions of –term independence –binary document indexing ( w i,q  {0,1} and w i,j  {0,1}) –Disadvantage no query expansion is used weights of terms in the previous query formulations are also disregarded document term weights are not taken into account during the feedback loop Term Reweighting for the Probabilistic Model (Cont.)

Evaluation of relevance feedback Standard evaluation method is not suitable –(i.e., recall-precision) because the relevant documents used to reweight the query terms are moved to higher ranks The residual collection method –the set of all documents minus the set of feedback documents provided by the user –because highly ranked documents are removed from the collection, the recall-precision figures for tend to be lower than the figures for the original query –as a basic rule of thumb, any experimentation involving relevance feedback strategies should always evaluate recall- precision figures relative to the residual collection

Automatic Strategies In relevance feedback, use separates the documents into two classes: relevant vs. non-relevant –An underlying notion of clustering supporting the feedback strategy –Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents –This can be done automatically

Automatic Strategies Two types of strategies –Global All documents in the collection are used to determine a global thesaurus-like structure which defines term relationships –Local The documents retrieved for a given query are examined at query time to determine terms for query expansion Local clustering (Attar and Fraenkel, 1977) Local context analysis (Xu and Croft, 1996)

Automatic Local Analysis Definition –local document set D l : the set of documents retrieved by a query –local vocabulary V l : the set of all distinct words in D l – stemmed vocabulary S l : the set of all distinct stems derived from V l Local feedback strategies are based on expanding the query with terms correlated to the query terms –Such terms are those present in local clusters built from the local document set Building local clusters –association clusters –metric clusters –scalar clusters

Association Clusters Idea –co-occurrence of stems (or terms) inside documents f u,j : the frequency of a stem k u in a document d j –local association cluster for a stem k u the set of k largest values in c( k u, k v ) –given a query q, find clusters for the | q | query terms –normalized form

Metric Clusters Idea –consider the distance between two terms in the same cluster Definition –V( k u ): the set of keywords which have the same stem form as k u –distance r( k i, k j )=the number of words between term k u and k v –normalized form

Scalar Clusters Idea –two stems with similar neighborhoods have some synonymity relationships Definition – c u,v = c ( k u, k v ) –vectors of correlation values for stem k u and k v –scalar association matrix –scalar clusters the set of k largest values of scalar association

Automatic Global Analysis A thesaurus-like structure Short history –Until the beginning of the 1990s, global analysis was considered to be a technique which failed to yield consistent improvements in retrieval performance with general collections –This perception has changed with the appearance of modern procedures for global analysis

Query Expansion based on a Similarity Thesaurus Idea by Qiu and Frei [1993] –Similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence –Terms for expansion are selected based on their similarity to the whole query rather than on their similarities to individual query terms Definition – N : total number of documents in the collection – t : total number of terms in the collection – tf i,j : occurrence frequency of term k i in the document d j – t j : the number of distinct index terms in the document d j – itf j : the inverse term frequency for document d j

Similarity Thesaurus Each term is associated with a vector –where w i,j is a weight associated to the index-document pair The relationship between two terms k u and k v is –Note that this is a variation of the correlation measure used for computing scalar association matrices

Term weighting vs. Term concept space tf ij Term k i Doc d j tf ij Term k i Doc d j

Query Expansion Procedure with Similarity Thesaurus 1. Represent the query in the concept space by using the representation of the index terms 2. Compute the similarity sim( q, k v ) between each term k v and the whole query 3. Expand the query with the top r ranked terms according to sim(q,k v )

Example of Similarity Thesaurus The distance of a given term k v to the query centroid Q C might be quite distinct from the distances of k v to the individual query terms kaka kbkb kiki kjkj kvkv QCQC Q C ={k a,k b }

Query Expansion based on a Similarity Thesaurus –A document d j is represented term-concept space by –If the original query q is expanded to include all the t index terms, then the similarity sim( q, d j ) between the document d j and the query q can be computed as which is similar to the generalized vector space model

Query Expansion based on a Statistical Thesaurus Idea by Crouch and Yang (1992) –Use complete link algorithm to produce small and tight clusters –Use term discrimination value to select terms for entry into a particular thesaurus class Term discrimination value –A measure of the change in space separation which occurs when a given term is assigned to the document collection

Term Discrimination Value Terms –good discriminators: (terms with positive discrimination values) index terms –indifferent discriminators: (near-zero discrimination values) thesaurus class –poor discriminators: (negative discrimination values) term phrases Document frequency df k – df k > n /10: high frequency term (poor discriminators) – df k < n /100: low frequency term (indifferent discriminators) – n/100  df k  n /10: good discriminator

Statistical Thesaurus Term discrimination value theory –the terms which make up a thesaurus class must be indifferent discriminators The proposed approach –cluster the document collection into small, tight clusters –A thesaurus class is defined as the intersection of all the low frequency terms in that cluster –documents are indexed by the thesaurus classes –the thesaurus classes are weighted by

Discussion Query expansion –useful –little explored technique Trends and research issues –The combination of local analysis, global analysis, visual displays, and interactive interfaces is also a current and important research problem