Retrieval Models: Boolean and Vector Space

Presentation on theme: "Retrieval Models: Boolean and Vector Space"— Presentation transcript:

Retrieval Models: Boolean and Vector Space
11-741: Information Retrieval Retrieval Models: Boolean and Vector Space Jamie Callan Carnegie Mellon University

Lecture Outline Introduction to retrieval models Exact-match retrieval
Unranked Boolean retrieval model Ranked Boolean retrieval model Best-match retrieval Vector space retrieval model Term weights Boolean query operators (p-norm) © 2003, Jamie Callan

What is a Retrieval Model?
A model is an abstract representation of a process or object Used to study properties, draw conclusions, make predictions The quality of the conclusions depends upon how closely the model represents reality A retrieval model describes the human and computational processes involved in ad-hoc retrieval Example: A model of human information seeking behavior Example: A model of how documents are ranked computationally Components: Users, information needs, queries, documents, relevance assessments, …. Retrieval models define relevance, explicitly or implicitly © 2003, Jamie Callan

Basic IR Processes Information Need Representation Query Document
Indexed Objects Comparison Retrieved Objects Evaluation/Feedback A retrieval model tells us how to instantiate the different parts of this very abstract diagram, i.e., how to represent documents, how to represent information needs, how to decide whether a document satisfies an information need. When we talk about different retrieval models in future lectures, they will still follow this same abstract diagram, but they will instantiate the different elements differently. © 2003, Jamie Callan

Overview of Major Retrieval Models
Boolean Vector space Basic vector space Extended Boolean model Probabilistic models Basic probabilistic model Bayesian inference networks Language models Citation analysis models Hubs & authorities (Kleinberg, IBM Clever) Page rank (Google) © 2003, Jamie Callan

Types of Retrieval Models: Exact Match vs. Best Match Retrieval
Query specifies precise retrieval criteria Every document either matches or fails to match query Result is a set of documents Usually in no particular order Often in reverse-chronological order Best Match Query describes retrieval criteria for desired documents Every document matches a query to some degree Result is a ranked list of documents, “best” first © 2003, Jamie Callan

Overview of Major Retrieval Models
Boolean Exact Match Vector space Best Match Basic vector space Extended Boolean model Latent Semantic Indexing (LSI) Probabilistic models Best Match Basic probabilistic model Bayesian inference networks Language models Citation analysis models Hubs & authorities (Kleinberg, IBM Clever) Best Match Page rank (Google) Exact Match © 2003, Jamie Callan

Exact Match vs. Best Match Retrieval
Best-match models are usually more accurate / effective Good documents appear at the top of the rankings Good documents often don’t exactly match the query Query was too strict “Multimedia AND NOT (Apple OR IBM OR Microsoft)” Document didn’t match user expectations “multi database retrieval” vs “multi db retrieval” Exact match still prevalent in some markets Large installed base that doesn’t want to migrate to new software Efficiency: Exact match is very fast, good for high volume tasks Sufficient for some tasks Web “advanced search” For the Multimedia query, point out that many of the relevant documents for this query *did* mention Apple or IBM or Microsoft, because they would say sometime like “Real Player has a larger market share than Microsoft’s Windows Media Player” or “WinAmp has features not yet found in Apple’s QuickTime”. The user didn’t want a document to *focus* on Apple or IBM or Microsoft, but a brief mention was actually fine, so this is a case where the query was not a good representation of the underlying (in someone’s head) information need. For the multi database retrieval query, the user expressed the information need in one way “multi database retrieval” but some relevant document said “multi db retrieval”, so an exact match model would fail to find that document. © 2003, Jamie Callan

Retrieval Model #1: Unranked Boolean
Unranked Boolean is most common Exact Match model Model Retrieve documents iff they satisfy a Boolean expression Query specifies precise relevance criteria Documents returned in no particular order Operators: Logical operators: AND, OR, AND NOT Unconstrained NOT is expensive, so often not included Distance operators: Near/1, Sentence/1, Paragraph/1, … String matching operators: Wildcard Field operators: Date, Author, Title Note: Unranked Boolean model not the same as Boolean queries Unconstrained NOT is expensive because (NOT Callan) matches almost every document in the database, and if the database is large, that means an intermediate data structure containing a very large number of documents (even if the final set is small). It’s more common to use an ANDNOT operator, as in “(distributed AND retrieval) ANDNOT callan”. The first part of the operator forms a (presumably small) set of documents, and then the second part removes some of them. Much more efficient. Make sure they understand the last point. A Boolean retrieval model always uses Boolean queries. However, Boolean queries can also be used with other retrieval models, e.g., probabilistic. © 2003, Jamie Callan

Unranked Boolean: Example
A typical Boolean query (distributed NEAR/1 information NEAR/1 retrieval) OR ((collection OR database OR resource) NEAR/1 selection) OR (multi NEAR/1 database NEAR/1 search) OR ((FIELD author callan) AND CORI) OR ((FIELD author hawking) AND NOT stephen) Studies show that people are not good at creating Boolean queries People overestimate the quality of the queries they create Queries are too strict: Few relevant documents found Queries are too loose: Too many documents found (few relevant) The wildcard operator search* matches “search”, “searching”, “searched”, etc. This was common before stemming algorithms were introduced. Still used in some applications, e.g., to match against names with variable spellings (e.g., Chinese names expressed in English, Arab names expressed in English) and some genome retrieval tasks. People overestimate query quality because they only see what was retrieved. If the set of retrieved documents is mostly good (high Precision), they are happy. It is very difficult to tell if a query has low Recall, because the user doesn’t see how many good documents were not retrieved. © 2003, Jamie Callan

Unranked Boolean: WESTLAW
Large commercial system Serves legal and professional markets Legal: Court cases, statutes, regulations, ….. News: Newspapers, magazines, journals, …. Financial: Stock quotes, SEC materials, financial analyses Total collection size: 5-7 Terabytes 700,000 users In operation since 1974 Best-match and free text queries added in 1992 © 2003, Jamie Callan

Unranked Boolean: WESTLAW
Boolean operators Proximity operators Phrases: “Carnegie Mellon” Word proximity: Language /3 Technologies Same sentence (/s) or paragraph (/p): Microsoft /s “anti trust” Restrictions: Date (After 1996 & Before 1999) Query expansion: Wildcard and truncation: Al*n Lev! Automatic expansion of plurals and possessives Document structure (fields): Title (gGlOSS) Citations: Cites (Callan) & Date (After 1996) The wildcard example matches Alon Levy, and other strings. We haven’t covered query expansion yet, so mention that it is a process of automatically inserting additional words into a query to help bridge the vocabulary mismatch between queries and documents. In this case it is used for plurals (computers, computer) and possessives (John, John’s, I guess). © 2003, Jamie Callan

Unranked Boolean: WESTLAW
Queries are typically developed incrementally Implicit relevance feedback V1: machine AND learning V2: (machine AND learning) OR (neural AND networks) OR (decision AND tree) V3: (machine AND learning) OR (neural AND networks) OR (decision AND tree) AND C4.5 OR Ripper OR EG OR EM Queries are complex Proximity operators used often NOT is rare Queries are long (9-10 words, on average) © 2003, Jamie Callan

Unranked Boolean: WESTLAW
Two stopword lists: 1 for indexing, 50 for queries Legal thesaurus: To help people expand queries Document ordering: Determined by user (often date) Query term highlighting: Help people see how document matched Performance: Response time controlled by varying amount of parallelism Average response time set to 7 seconds Even for collections larger than 100 GB We haven’t covered stopword lists yet, so you can talk about that, or you can just say it will be covered in a later lecture. Explain why response time is rarely *less* than 7 seconds. Essentially West wants people to think the system is working hard (remember, these are paying customers). If the system responds instantly, then the user thinks it was an “easy” search and is less happy about paying for it. If the system seems to work very hard, then the user is happier to pay for it. Also, users are happier if the system is always about the same speed, i.e., very predictable; if the system is fast some days and slow others, users don’t know what to expect, and they get grumpy on “slow” days. So, if the load is low, the system searches (quickly), but waits before displaying an answer. © 2003, Jamie Callan

Unranked Boolean: Summary
Advantages: Very efficient Predictable, easy to explain Structured queries Works well when searchers knows exactly what is wanted Disadvantages: Most people find it difficult to create good Boolean queries Difficulty increases with size of collection Precision and recall usually have strong inverse correlation Predictability of results causes people to overestimate recall Documents that are “close” are not retrieved © 2003, Jamie Callan

Term Weights: A Brief Introduction
The words of a text are not equally indicative of its meaning “Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to determine which way is north. Scientists think that butterflies may use other cues, such as the earth's magnetic field, but we have a lot to learn about monarchs' sense of direction.” Important: Butterflies, monarchs, scientists, direction, compass Unimportant: Most, think, kind, sky, determine, cues, learn Term weights reflect the (estimated) importance of each term There are many variations on how they are calculated Historically it was tf weights The standard approach for many IR systems is tf.idf weights tf and tf.idf were discussed briefly in the last lecture. © 2003, Jamie Callan

Retrieval Model #2: Ranked Boolean
Ranked Boolean is another common Exact Match retrieval model Model Retrieve documents iff they satisfy a Boolean expression Query specifies precise relevance criteria Documents returned ranked by frequency of query terms Operators: Logical operators: AND, OR, AND NOT Unconstrained NOT is expensive, so often not included Distance operators: Proximity String matching operators: Wildcard Field operators: Date, Author, Title © 2003, Jamie Callan

Exact Match Retrieval: Ranked Boolean
How document scores are calculated: Term weight: How often term i occurs in document j (tfi,j) tfi,j  idfi weight AND weight: Minimum of argument weights OR weight: Maximum of argument weights Sum of argument weights AND OR OR © 2003, Jamie Callan

Exact Match Retrieval: Ranked Boolean
Advantages: All of the advantages of the unranked Boolean model Very efficient, predictable, easy to explain, structured queries, works well when searchers know exactly what is wanted Result set is ordered by how redundantly a document satisfies a query Usually enables a person to find relevant documents more quickly Other term weighting methods can be used, too Example: tf, tf.idf, … © 2003, Jamie Callan

Exact Match Retrieval: Ranked Boolean
Disadvantages: It’s still an Exact-Match model Good Boolean queries hard to create Difficulty increases with size of collection Precision and recall usually have strong inverse correlation Predictability of results causes people to overestimate recall The returned documents match expectations…. …so it is easy to forget that many relevant documents are missed Documents that are “close” are not retrieved Difficulty increases with size of collection because it becomes harder to create a “small” result set that contains most of the relevant documents. Also because it is harder for a user to “know” a large collection, i.e., how concepts are expressed, vocabulary patterns, etc, so it is more likely that a user will express a concept in ways that don’t exactly match the documents (“multi db” vs “multi database”). © 2003, Jamie Callan

Are Boolean Retrieval Models Still Relevant?
Many people prefer Boolean Professional searchers (e.g., librarians, paralegals) Some Web surfers (e.g., “Advanced Search” feature) About 80% of WESTLAW & Dialog searches are Boolean What do they like? Control, predictability, understandability Boolean and free-text queries find different documents Solution: Retrieval models that support free-text and Boolean queries Recall that almost any retrieval model can be Exact Match Extended Boolean (vector space) retrieval model Bayesian inference networks © 2003, Jamie Callan

Retrieval Model #3: Vector Space Model
Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. Similarity is determined by distance in a vector space Example: The cosine of the angle between the vectors The SMART system: Developed at Cornell University, Still used widely © 2003, Jamie Callan

How Different Retrieval Models View Ad-hoc Retrieval
Boolean Query: A set of FOL conditions a document must satisfy Retrieval: Deductive inference Vector space Query: A short document Retrieval: Finding similar objects © 2003, Jamie Callan

Vector Space Retrieval Model: Introduction
How are documents represented in the binary vector model? Term1 Term2 Term Term4 … Termn Doc … 1 Doc … 0 Doc … 0 : : : : : : : A document is represented as a vector of binary values One dimension per term in the corpus vocabulary An unstructured query can also be represented as a vector Query … 1 Linear algebra is used to determine which vectors are similar © 2003, Jamie Callan

Vector Space Representation: Linear Algebra Review
Formally, a vector space is defined by a set of linearly independent basis vectors. Basis vectors: correspond to the dimensions or directions in the vector space; determine what can be described in the vector space; and must be orthogonal, or linearly independent,i.e. a value along one dimension implies nothing about a value along another. Basis vectors for 3 dimensions Basis vectors for 2 dimensions © 2003, Jamie Callan

Vector Space Representation
What should be the basis vectors for information retrieval? “Basic” concepts: Difficult to determine (Philosophy? Cognitive science?) Orthogonal (by definition) A relatively static vector space (“there are no new ideas”) Terms (Words, Word stems): Easy to determine Not really orthogonal (orthogonal enough?) A constantly growing vector space new vocabulary creates new dimensions © 2003, Jamie Callan

Vector Space Representation: Vector Coefficients
The coefficients (vector elements, term weights) represent term presence, importance, or “representativeness” The vector space model does not specify how to set term weights Some common elements: Document term weight: Importance of the term in this document Collection term weight: Importance of the term in this collection Length normalization: Compensate for documents of different lengths Naming convention for term weight functions: DCL.DCL First triple is document term weights, second triple is query term weights n=none (no weighting on that factor) Example: lnc.ltc Make sure that they really understand that the vector space model says nothing about how to set term weights. This is important. Otherwise they will be confused by the next few slides, which show that the vector space folks try all sorts of strange methods to set term weights. © 2003, Jamie Callan

Vector Space Representation: Term Weights
There have been many vector-space term weight functions lnc.ltc: A very popular and commonly-used choice “l”: log (tf) + 1 “n”: No weight / normalization (i.e., 1.0) “t”: log (N / df) “c”: cosine normalization For example: © 2003, Jamie Callan

Term Weights: A Brief Introduction
Inverse document frequency (idf): Terms that occur in many documents in the collection are less useful for discriminating among documents Document frequency (df): number of documents containing the term idf often calculated as: Idf is –log p, the entropy of the term © 2003, Jamie Callan

Vector Space Representation: Term Weights
Cosine normalization favors short documents “Pivoting” helps, but shifts the bias towards long documents Empirically, cosine normalization approximates lnu and Lnu document weighting: “u”: pivoted unique normalization “L”: compensate for higher tfs in long documents (Singhal, 1996; Singhal, 1998) © 2003, Jamie Callan

Vector Space Representation: Term Weights
dnb.dtn: A more recent attempt to handle varied document lengths “d”: 1 + ln (1 + ln (tf)), (0 if tf = 0) “t”: log ((N+1)/df) “b”: 1 / ( * doclen / avg_doclen) What next? (Singhal, et al, 1999) © 2003, Jamie Callan

Vector Space Retrieval Model
How are documents represented using a tf.idf representation? TermID T1 T2 T3 T4 … Tn Doc … 0.53 Doc … 0.00 Doc … 0.00 : : : : : : : A document is represented as a vector of tf.idf values One dimension per term in the corpus vocabulary An unstructured query can also be represented as a vector Q … 2.00 Linear algebra is used to determine which vectors are similar © 2003, Jamie Callan

Vector Space Similarity
Similarity is inversely related to the angle between the vectors. Doc2 is the most similar to the Query. Rank the documents by their similarity Term2 Term3 Doc1 Doc2 Query Term1 © 2003, Jamie Callan

Vector Space Similarity: Common Measures
Sim(X,Y) Binary Term Vectors Weighted Term Vectors Inner product (# nonzero dimensions) Dice coefficient (Length normalized Inner Product) Cosine coefficient (like Dice, but lower penalty with diff # features) Jackard coefficient (like Dice, but penalizes low overlap cases) Inner product counts number of non-zero dimensions Dice coefficient is a length normalized Inner Product Jaccard (or Tanimoto) coefficient penalizes low overlap cases Cosine coefficient has less of a penalty when vectors have different numbers of features (as in query and document) Inner product, Dice, and Jackard are not used anymore. Mentioned here for historical significance. © 2003, Jamie Callan

Vector Space Similarity: Example

Vector Space Similarity: The Extended Boolean (p-norm) Model
Some similarity measures mimic Boolean AND, OR, NOT p indicates the degree of “strictness”. 1 is the least strict.  is the most strict, i.e. the Boolean case. 2 tends to be a good choice. © 2003, Jamie Callan

Vector Space Similarity: The Extended Boolean (p-norm) Model
Characteristics: Full Boolean queries (White AND House) OR (Bill AND Clinton AND (NOT Hillary)) Ranked retrieval Handle tf • idf weighting Effective Boolean operators are not used often with the vector space model It is not clear why © 2003, Jamie Callan

Vector Space Retrieval Model: Summary
Standard vector space each dimension corresponds to a term in the vocabulary vector elements are real-valued, reflecting term importance any vector (document,query, ...) can be compared to any other cosine correlation is the similarity metric used most often Ranked Boolean vector elements are binary Extended Boolean multiple nested similarity measures © 2003, Jamie Callan

Assumed independence relationship among terms Lack of justification for some vector operations e.g. choice of similarity function e.g., choice of term weights Barely a retrieval model Doesn’t explicitly model relevance, a person’s information need, language models, etc. Assumes a query and a document can be treated the same Lack of a cognitive (or other) justification © 2003, Jamie Callan