Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Chapter 5: Query Operations Hassan Bashiri April
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
A Vector Space Model for Automatic Indexing
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Relevance Feedback Main Idea:
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
An Efficient Algorithm for Incremental Update of Concept space
Advanced information retrieval
An Automatic Construction of Arabic Similarity Thesaurus
Information Organization: Clustering
Representation of documents and queries
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
Query Operations Berlin Chen
Presentation transcript:

Query Operations: Automatic Global Analysis

Motivation Methods of local analysis extract information from local set of documents retrieved to expand the query An alternative is to expand the query using information from the whole set of documents Until the beginning of the 1990s these techniques failed to yield consistent improvements in retrieval performance Now, with moderns variants, sometimes based on a thesaurus, this perception has changed

Automatic Global Analysis There are two modern variants based on a thesaurus-like structure built using all documents in collection –Query Expansion based on a Similarity Thesaurus –Query Expansion based on a Statistical Thesaurus

Similarity Thesaurus The similarity thesaurus is based on term-to-term relationships rather than on a matrix of co- occurrence. –These relationships are not derived directly from co- occurrence of terms inside documents. –They are obtained by considering that the terms are concepts in a concept space. –In this concept space, each term is indexed by the documents in which it appears. Terms assume the original role of documents while documents are interpreted as indexing elements

Similarity Thesaurus vs. Vector Model The frequency factor: –In vector model f (i,j) = freq ( term k i in doc d j ) / freq ( most common term in d j ) –In similarity thesaurus f (i,j) = freq ( term k i in doc d j ) / freq ( doc where term k i appears most) The inverse frequency factor: –In vector model Idf (i) = log (# of docs in collection / # of docs with term k i ) –In similarity thesaurus Itf (j) = log (# of terms in collection / # of terms in doc d j )

Similarity Thesaurus Definitions: –t: number of terms in the collection –N: number of documents in the collection –f i,j : frequency of occurrence of the term k i in the document d j –t j : vocabulary of document d j –itf j : inverse term frequency for document d j Inverse term frequency for document d j For each term k i where w i,j is a weight associated between the term and the documents.

Similarity Thesaurus The relationship between two terms k u and k v is computed as a correlation factor c u,v given by The global similarity thesaurus is built through the computation of correlation factor C u,v for each pair of indexing terms [k u,k v ] in the collection The computation is expensive but only has to be computed once and can be updated incrementally

Query Expansion based on a Similarity Thesaurus Query expansion is done in three steps as follows:  Represent the query in the concept space used for representation of the index terms 2 Based on the global similarity thesaurus, compute a similarity sim(q,k v ) between each term k v correlated to the query terms and the whole query q. 3 Expand the query with the top r ranked terms according to sim(q,k v )

Statistical Thesaurus Global thesaurus is composed of classes which group correlated terms in the context of the whole collection –Such correlated terms can then be used to expand the original user query –These terms must be low frequency terms –However, it is difficult to cluster low frequency terms –To circumvent this problem, we cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. –This algorithm must produce small and tight clusters.

Complete Link Algorithm Document clustering algorithm –Place each document in a distinct cluster. –Compute the similarity between all pairs of clusters. –Determine the pair of clusters [C u,C v ] with the highest inter-cluster similarity. –Merge the clusters C u and C v –Verify a stop criterion. If this criterion is not met then go back to step 2. –Return a hierarchy of clusters. Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents –Use of minimum ensures small, focussed clusters

Generating the Thesaurus Given the document cluster hierarchy for the whole collection –Which clusters become classes? –Which terms represent classes? Answers based on three parameters specified by operator based on characteristics of the collection. –TC: Threshold class –NDC: Number of documents in class –MIDF: Minimum inverse document frequency

Selecting Thesaurus Classes TC is the minimum similarity between two subclusters for the parent to be considered a class. –A high value makes classes smaller and more focussed. NDC is an upper limit on the size of clusters. –A low value of NDC restricts the selection to smaller, more focussed clusters

Picking Terms for Each Class Consider the set of documents in each class selected above Only the lower frequency terms are used for the thesaurus classes The parameter MIDF defines the minimum inverse document frequency for any term which is selected to participate in a thesaurus class

Query Expansion with Statisitcal Thesaurus For each thesaurus class C: –Compute an average term weight wt c –Compute the thesaurus class weight w c –Where |C| is the number of terms in the thesaurus class and w i,C is a precomputed weight associated with the term-class pair [K i, C]

Initializing TC, NDC, and MIDF TC depends on the collection –Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC. –A high value of TC might yield classes with too few terms –A low value of TC yields too few classes NDC is easier to set once TC is set MIDF can be difficult to set

Conclusions Automatically generated thesaurus is an efficient method to expand queries Thesaurus generation is expensive but it is executed only once Query expansion based on similarity thesaurus uses term frequencies to expand the query Query expansion based on statistical thesaurus needs well defined parameters