Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
eClassifier: Tool for Taxonomies
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Relevance Feedback & Query Expansion
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
© Arjen P. de Vries Arjen P. de Vries Fascinating Relationships between Media and Text.
Text Categorization.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Chapter 5: Query Operations Hassan Bashiri April
Traditional IR models Jian-Yun Nie.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Boolean and Vector Space Retrieval Models
Addition 1’s to 20.
Week 1.
1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Vector Space Models.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Multimedia Information Retrieval
Basic Information Retrieval
Chapter 5: Information Retrieval and Web Search
Retrieval Utilities Relevance feedback Clustering
Yet another Example T This happens to be a rank-7 matrix
Information Retrieval and Web Design
Presentation transcript:

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh

2 Previous chapter: Conclusions Query languages (width-wide): owords, phrases, proximity, fuzzy Boolean, natural language Query languages (depth-wide): oPattern matching If return sets, can be combined using Boolean model Combining with structure oHierarchical structure Standardized low level languages: protocols oReusable

3 Previous chapter: Trends and research topics Models: to better understand the user needs Query languages: flexibility, power, expressiveness, functionality Visual languages oExample: library shown on the screen. Act: take books, open catalogs, etc. oBetter Boolean queries: I need books by Cervantes AND Lope de Vega?!

4 Query operations Users have difficulties formulating queries Program improves the query oInteractive mode: using the users feedback oUsing info from the retrieved set oUsing linguistic information or information from the collection Query expansion oadd new terms Term rewriting omodify weights

5 1 st method: User relevance feedback User examines to 10 (20) docs and marks relevant ones System uses this to construct new query oMoved toward relevant docs oAway from irrelevant Good: simplicity Note: In all the chapter, the correct spelling is Rocchio

6 User relevance feedback: Vector Space Model Best vector to distinguish good from bad docs: avg good minus avg bad

7 User relevance feedback: Vector Space Model Equally good results Original query gives important info: Relevant docs give more info than irrelevant ones: < = 0: Positive feedback

8 User relevance feedback: Probabilistic Model User feedback: Smoothing is usually applied Bad: oNo document weights oPrevious history lost oNo new terms, only weights are changed

9... a variant for Probabilistic Model Similarity is multiplied by TF (term frequency) oNot exactly, but this is the idea oInitially, IDF is also taken into account oDetails in the book Still no query expansion, only re-weighting the original terms

10 Evaluation of Relevance Feedback Simplistic: oEvaluate precision and recall after the feedback cycle oNot realistic since includes the users own feedback Better: oOnly consider unseen data oUse the rest of the collection oNot as good figures oUseful to compare different methods, not to compare precision/recall before and after feedback

11 2 nd method: Automatic local analysis Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationships oBased on clustering technoques Global vs Local strategy: oGlobal: the whole collection is used for this oLocal: the retrieved set. Similar to feedback, but automatic. Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming. oGood for local collections, not for Web Build clusters of words; add to each keyword its neighbors

12 Clustering (words) Association clusters oTerms that co-occur in the docs oThe clusters are the n terms that occur most frequently together with the query terms (normalized vs. non-) Metric clusters (better) oMultiplies the number of co-occurrences by the proximity in the text oTerms that occur in the same sentence are more related Scalar clusters oTerms co-occurring with the same other terms are related oRelatedness of two words = scalar product of centroids of their association clusters

13... variant (local clustering) Metric-like reasoning: Break the retrieved docs into passages (say, 300 words) Use them as docs; use TF-IDF Choose words related (use TF-IDF) to the whole query Better: words occuring near each other are more related Tune for each collection, not 5:

14 3 rd Method: Automatic Global Analysis Uses all docs in the collection Builds a thesaurus The terms related to the whole query are added (query expansion)

15 Similarity thesaurus Relatedness = occur in the same docs. Matrix doc x term frequency Inverse term frequency: divided by the size of the doc Relatedness = correlation between rows of the matrix Query: centroid, weighted (weighted sum). Relatedness between a term and this centroid = cosine Add best terms are added to the query, with weights:

16 (Global) Statistical thesaurus... Terms added must be discriminative low frequency Difficult to cluster (no info) Solution: First cluster docs; the frequency increases Clustering docs, e.g.: oEach doc is a cluster oMerge two most similar clusters = their docs are similar oRepeat until page 136:

17... statistical thesaurus Convert the cluster hierarchy into a set of clusters oUse a threshold similarity level to cut the hierarchy oDont take too large clusters Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same class othreshold oThese give clusters of words Calculate weight of each class of terms. Add these terms with this weight to the query terms

18 Research topics Interactive interfaces oGraphical, 2D or 3D Refining global analysis techniques Application of linguistics methods. Stemming. Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)

19 Conclusions Relevance feedback oSimple, understandable oNeeds user attention oTerm re-weighting Local analysis for query expansion oCo-occurrences in the retrieved docs oUsually gives better results than global analysis oComputationally expensive Global analysis oNot as good results, since what is good for the whole collection is not good for a specific query oLinguistic methods, dictionaries, ontologies, stemming,...

20 Exam Questions and exercises You do what you consider appropriate On Oct 23 or maybe Nov 6 (??), discuss The class on Oct 30 is moved to Oct 23

21 Thank you! Till October 23 October 23: discussion of the midterm exam, class moved from October 30 The class of Oct 30 is moved to 23