Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Chapter 5: Query Operations Hassan Bashiri April
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Clustering Categorical Data The Case of Quran Verses
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Relevance Feedback Main Idea:
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Term and Document Clustering Manual thesaurus generation Automatic thesaurus generation Term clustering techniques: –Cliques,connected components,stars,strings.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Query Operations. Query Models n IR Systems usually adopt index terms to process queries; n Index term: u A keyword or a group of selected words; u Any.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
An Efficient Algorithm for Incremental Update of Concept space
Advanced information retrieval
Latent Semantic Indexing
Information Retrieval and Web Search
Information Organization: Clustering
Representation of documents and queries
6. Implementation of Vector-Space Retrieval
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
CS 430: Information Discovery
Recuperação de Informação B
Relevance Feedback and Query Expansion
Information Retrieval and Web Design
Query Operations Berlin Chen
Relevance Feedback and Query Expansion
Presentation transcript:

Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the query using information from the whole set of documents  Until the beginning of the 1990s this techniques failed to yield consistent improvements in retrieval performance  Now, with modern variants, sometimes based in thesaurus, this perception has changed

Automatic Local Analysis There are two modern variants based on a thesaurus-like structure built using all documents in collection  Query Expansion based on a Similarity Thesaurus  Query Expansion based on a Statistical Thesaurus

Similarity Thesaurus  The similarity thesaurus is based on term to term relationships rather than on a matrix of co- occurrence.  This relationship are not derived directly from co- occurrence of terms inside documents.  They are obtained by considering that the terms are concepts in a concept space.  In this concept space, each term is indexed by the documents in which it appears.  Terms assume the original role of documents while documents are interpreted as indexing elements

Similarity Thesaurus The following definitions establish the proper framework t: number of terms in the collection N: number of documents in the collection fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj

Similarity Thesaurus Inverse term frequency for document dj To ki we associate a vector

Similarity Thesaurus where wi,j is a weight associated to index- document pair[ki,dj]. These weights are computed as follows

Similarity Thesaurus The relationship between two terms ku and kv is computed as a correlation factor cu,v given by The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

Similarity Thesaurus  This computation is expensive  Global similarity thesaurus has to be computed only once and can be updated incrementally

Query Expansion based on a Similarity Thesaurus Query expansion is done in three steps as follows:  Represent the query in the concept space used for representation of the index terms  Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q.  Expand the query with the top r ranked terms according to sim(q,kv)

Query Expansion - step one To the query q is associated a vector q in the term-concept space given by where wi,q is a weight associated to the index-query pair[ki,q]

Query Expansion - step two Compute a similarity sim(q,kv) between each term kv and the user query q where cu,v is the correlation factor

Query Expansion - step three  Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’  To each expansion term kv in the query, q’ is assigned a weight wv  The expanded query q’ is then used to retrieve new documents to the user

Query Expansion Sample Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A c(A,A) = c(A,C) = c(A,D) = c(D,E) = c(B,E) = c(E,E) =

Query Expansion Sample Query: q = A E E sim(q,A) = sim(q,C) = sim(q,D) = sim(q,B) = sim(q,E) = New query: q’ = A C D E E w(A,q')= 6.88 w(C,q')= 6.75 w(D,q')= 6.75 w(E,q')= 6.64

Query Expansion Based on a Global Statistical Thesaurus  Global thesaurus is composed of classes which group correlated terms in the context of the whole collection  Such correlated terms can then be used to expand the original user query  These terms must be low frequency terms  However, it is difficult to cluster low frequency terms  To circumvent this problem, we cluster documents into classes instead, and use the low frequency terms in these documents to define our thesaurus classes.  This algorithm must produce small and tight clusters.

Complete link algorithm  This is a document clustering algorithm that produces small and tight clusters 1.Compute the similarity between all pairs of clusters. 2.Place each document in a distinct cluster. 3.Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity. 4.Merge the clusters Cu and Cv 5.Verify a stop criterion. If this criterion is not met then go back to step 2. 6.Return a hierarchy of clusters.  Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents

Selecting the terms that compose each class Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document frequency

Selecting the terms that compose each class  Use the parameter TC as threshold value for determining the document clusters that will be used to generate thesaurus classes  This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class

Selecting the terms that compose each class  Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered.  A low value of NDC might restrict the selection to the smaller cluster Cu+v

Selecting the terms that compose each class  Consider the set of documents in each document cluster as pre-selected above.  Only the lower frequency documents are used as sources of terms for the thesaurus classes  The parameter MIDF defines the minimum value of inverse document frequency for any term which is selected to participate in a thesaurus class

Query Expansion based on a Statistical Thesaurus  Use the thesaurus class to query expansion.  Compute an average term weight wtc for each thesaurus class C

Query Expansion based on a Statistical Thesaurus wtc can be used to compute a thesaurus class weight wc as

Query Expansion Sample n TC = 0.90 NDC = 2.00 MIDF = 0.2 sim(1,3) = 0.99 sim(1,2) = 0.40 sim(2,3) = 0.29 sim(4,1) = 0.00 sim(4,2) = 0.00 sim(4,3) = 0.00 Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A idf A = 0.0 idf B = 0.3 idf C = 0.12 idf D = 0.12 idf E = 0.60 q'=A B E E q= A E E

Query Expansion based on a Statistical Thesaurus Problems with this approach  Initialization of parameters TC,NDC and MIDF  TC depends on the collection  Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC.  A high value of TC might yield classes with too few terms

Conclusions  Thesaurus is an efficient method to expand queries  The computation is expensive but it is executed only once  Query expansion based on similarity thesaurus may use high term frequency to expand the query  Query expansion based on statistical thesaurus needs well defined parameters