Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Slides:



Advertisements
Similar presentations
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Advertisements

Language Models Hongning Wang
Vector Space Model Hongning Wang Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly.
Probabilistic Ranking Principle
Inverted Index Hongning Wang
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 1 ChengXiang Zhai (
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Search Engine Architecture
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Vector Space Models.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval Models: Vector Space Models
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
KNN & Naïve Bayes Hongning Wang
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
CS 430: Information Discovery
Implementation Issues & IR Systems
Vector Space Model Hongning Wang
Information Retrieval Models: Probabilistic Models
Relevance Feedback Hongning Wang
Probabilistic Ranking Principle
Language Models Hongning Wang
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Boolean Model Hongning Wang

Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation Query Rep (Query) Evaluation Feedback 4501: Information Retrieval2 Indexed corpus Ranking procedure

Search with Boolean query Boolean query – E.g., “obama” AND “healthcare” NOT “news” Procedures – Lookup query term in the dictionary – Retrieve the posting lists – Operation AND: intersect the posting lists OR: union the posting list NOT: diff the posting list 4501: Information Retrieval3

Search with Boolean query Example: AND operation Term1 Term2 scan the postings Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones 4501: Information Retrieval4

Deficiency of Boolean model The query is unlikely precise – “Over-constrained” query (terms are too specific): no relevant documents can be found – “Under-constrained” query (terms are too general): over delivery – It is hard to find the right position between these two extremes (hard for users to specify constraints) Even if it is accurate – Not all users would like to use such queries – All relevant documents are not equally important No one would go through all the matched results Relevance is a matter of degree! 4501: Information Retrieval5

Document Selection vs. Ranking Doc Ranking rel(d,q)=? 0.98 d d d d d d d d d 9 - Rel’(q) Doc Selection f(d,q)=? Rel’(q) True Rel(q) 4501: Information Retrieval6

Ranking is often preferred Relevance is a matter of degree – Easier for users to find appropriate queries A user can stop browsing anywhere, so the boundary is controlled by the user – Users prefer coverage would view more items – Users prefer precision would view only a few Theoretical justification: Probability Ranking Principle 4501: Information Retrieval7

Retrieval procedure in modern IR Boolean model provides all the ranking candidates – Locate documents satisfying Boolean condition E.g., “obama healthcare” -> “obama” OR “healthcare” Rank candidates by relevance – Important: the notation of relevance Efficiency consideration – Top-k retrieval (Google)Google 4501: Information Retrieval8

Notion of relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Today’s lecture 4501: Information Retrieval9

Recap: search result display General considerations in search result display Challenges and opportunities in mobile device search result display 4501: Information Retrieval10

Recap: ranking is often preferred Relevance is a matter of degree – Easier for users to find appropriate queries A user can stop browsing anywhere, so the boundary is controlled by the user – Users prefer coverage would view more items – Users prefer precision would view only a few Theoretical justification: Probability Ranking Principle 4501: Information Retrieval11

Recap: retrieval procedure in modern IR Boolean model provides all the ranking candidates – Locate documents satisfying Boolean condition E.g., “obama healthcare” -> “obama” OR “healthcare” Rank candidates by relevance – Important: the notation of relevance Efficiency consideration – Top-k retrieval (Google)Google 4501: Information Retrieval12

Intuitive understanding of relevance Fill in magic numbers to describe the relation between documents and words 4501: Information Retrieval13 informationretrievalretrievedishelpfulforyoueveryone Doc Doc Query E.g., 0/1 for Boolean models, probabilities for probabilistic models

Some notations Vocabulary V={w 1, w 2, …, w N } of language Query q = t 1,…,t m, where t i  V Document d i = t i1,…,t in, where t ij  V Collection C= {d 1, …, d k } Rel(q,d): relevance of doc d to query q Rep(d): representation of document d Rep(q): representation of query q 4501: Information Retrieval14

Notion of relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Today’s lecture 4501: Information Retrieval15

Vector Space Model Hongning Wang

Relevance = Similarity 4501: Information Retrieval17

Vector space model Represent both doc and query by concept vectors – Each concept defines one dimension – K concepts define a high-dimensional space – Element of vector corresponds to concept weight E.g., d=(x 1,…,x k ), x i is “importance” of concept i Measure relevance – Distance between the query vector and document vector in this concept space 4501: Information Retrieval18

VS Model: an illustration Which document is closer to the query? Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 Query 4501: Information Retrieval19

What the VS model doesn’t say How to define/select the “basic concept” – Concepts are assumed to be orthogonal How to assign weights – Weight in query indicates importance of the concept – Weight in doc indicates how well the concept characterizes the doc How to define the similarity/distance measure 4501: Information Retrieval20

What is a good “basic concept”? Orthogonal – Linearly independent basis vectors “Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and accurately Existing solutions – Terms or N-grams, i.e., bag-of-words – Topics, i.e., topic model We will come back to this later 4501: Information Retrieval21

How to assign weights? Important! Why? – Query side: not all terms are equally important – Doc side: some terms carry more information about the content How? – Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) 4501: Information Retrieval22

TF weighting 4501: Information Retrieval23

TF normalization Query: iphone 6s – D1: iPhone 6s receives pre-orders on September 12. – D2: iPhone 6 has three color options. – D3: iPhone 6 has three color options. iPhone 6 has three color options. iPhone 6 has three color options. 4501: Information Retrieval24

TF normalization Two views of document length – A doc is long because it is verbose – A doc is long because it has more content Raw TF is inaccurate – Document length variation – “Repeated occurrences” are less informative than the “first occurrence” – Relevance does not increase proportionally with number of term occurrence Generally penalize long doc, but avoid over- penalizing – Pivoted length normalization 4501: Information Retrieval25

TF normalization Norm. TF Raw TF 4501: Information Retrieval26

TF normalization Norm. TF Raw TF : Information Retrieval27

Document frequency Idea: a term is more discriminative if it occurs only in fewer documents 4501: Information Retrieval28

IDF weighting Total number of docs in collection Non-linear scaling 4501: Information Retrieval29

Why document frequency Wordttfdf try insurance Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection. 4501: Information Retrieval30

TF-IDF weighting “Salton was perhaps the leading computer scientist working in the field of information retrieval during his time.” - wikipedia Gerard Salton Award – highest achievement award in IR 4501: Information Retrieval31

How to define a good similarity measure? Euclidean distance? Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 Query TF-IDF space 4501: Information Retrieval32

How to define a good similarity measure? 4501: Information Retrieval33

From distance to angle Angle: how vectors are overlapped – Cosine similarity – projection of one vector onto another Sports Finance D1D1 D2D2 Query TF-IDF space The choice of angle The choice of Euclidean distance 4501: Information Retrieval34

Cosine similarity Unit vector TF-IDF vector Sports Finance D1D1 D2D2 Query TF-IDF space 4501: Information Retrieval35

Fast computation of cosine in retrieval 4501: Information Retrieval36

Fast computation of cosine in retrieval Maintain a score accumulator for each doc when scanning the postings 4501: Information Retrieval Query = “info security” S(d,q)=g(t 1 )+…+g(t n ) [sum of TF of matched terms] Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5) Security: (d2, 3), (d4, 1), (d5, 3) Accumulators: d1 d2 d3 d4 d5 (d1,3) => (d2,4) => (d3,1) => (d4,5) => (d2,3) => (d4,1) => (d5,3) => info security Can be easily applied to TF-IDF weighting! Keep only the most promising accumulators for top K retrieval 37

Advantages of VS Model Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/Mostly evaluated The Smart system – Developed at Cornell: – Still widely used Warning: Many variants of TF-IDF! 4501: Information Retrieval38

Disadvantages of VS Model Assume term independence Assume query and document to be the same Lack of “predictive adequacy” – Arbitrary term weighting – Arbitrary similarity measure Lots of parameter tuning! 4501: Information Retrieval39

What you should know Document ranking v.s. selection Basic idea of vector space model Two important heuristics in VS model – TF – IDF Similarity measure for VS model – Euclidean distance v.s. cosine similarity 4501: Information Retrieval40

Today’s reading Chapter 1: Boolean retrieval – 1.3 Processing Boolean queries – 1.4 The extended Boolean model versus ranked retrieval Chapter 6: Scoring, term weighting and the vector space model – 6.2 Term frequency and weighting – 6.3 The vector space model for scoring – 6.4 Variant tf-idf functions 4501: Information Retrieval41