Basic Information Retrieval

Basic Information Retrieval
CS246 Basic Information Retrieval

Today’s Topic Basic Information Retrieval (IR) Bag of words assumption
Boolean Model Inverted index Vector-space model Document-term matrix TF-IDF vector and cosine similarity Phrase queries Spell correction

Information-Retrieval System
Information source: Existing text documents Keyword-based/natural-language query The system returns best-matching documents given the query Challenge Both queries and data are “fuzzy” Unstructured text and “natural language” query What documents are good matches for a query? Computers do not “understand” the documents or the queries Developing a computerizable “model” is essential to implement this approach

Bag of Words: Major Simplification
Consider each document as a “bag of words” “bag” vs “set” Ignore word ordering, but keep word count Consider queries as bag of words as well Great oversimplification, but works adequately in many cases “John loves only Jane” vs “Only John loves Jane” The limitation still shows up on current search engines Still how do we match documents and queries?

Boolean Model Return all documents that contain the words in the query
Simplest model for information retrieval No notion of “ranking” A document is either a match or non-match Q: How to find and return matching documents? Basic algorithm? Useful data structure?

Inverted Index Allows quick lookup of document ids with a particular word Q: How can we use this to answer “UCLA Physics”? Postings list lexicon/dictionary DIC 3 8 10 13 16 20 PL(Stanford) Stanford 1 2 3 9 16 18 PL(UCLA) UCLA MIT 4 5 8 10 13 19 20 22 PL(MIT) …

Inverted Index Allows quick lookup of document ids with a particular word Postings list lexicon/dictionary DIC 3 8 10 13 16 20 PL(Stanford) Stanford 1 2 3 9 16 18 PL(UCLA) UCLA MIT 4 5 8 10 13 19 20 22 PL(MIT) …

Size of Inverted Index (1)
100M docs, 10KB/doc, unique words/doc, 10B/word, 4B/docid Q: Document collection size? Q: Inverted index size? Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 k = 50 and b = 0.5 are good rule of thumb

Size of Inverted Index (2)
Q: Between dictionary and postings lists, which one is larger? Q: Lengths of postings lists? Zipf’s law: collection term frequency ∝1/frequency rank Q: How do we construct an inverted index?

Inverted Index Construction
C: set of all documents (corpus) DIC: dictionary of inverted index PL(w): postings list of word w 1: For each document d in C: 2: Extract all words in content(d) into W 3: For each w in W: 4: If w in DIC, then add w to DIC 5: Append id(d) to PL(w) Q: What if the index is larger than main memory?

Inverted-Index Construction
For large text corpus Block-sorted based construction Partition and merge

Evaluation: Precision and Recall
Q: Are all matching documents what users want? Basic idea: a model is good if it returns document if and only if it is “relevant”. R: set of “relevant” document D: set of documents returned by a model

Vector-Space Model Main problem of Boolean model
Too many matching documents when the corpus is large Any way to “rank” documents? Matrix interpretation of Boolean model Document – Term matrix Boolean 0 or 1 value for each entry Basic idea Assign real-valued weight to the matrix entries depending on the importance of the term “the” vs “UCLA” Q: How should we assign the weights?

TF-IDF Vector A term t is important for document d TF: term frequency
If t appears many times in d or If t is a “rare” term TF: term frequency # occurrence of t in d IDF: inverse document frequency # documents containing t TF-IDF weighting TF X Log(N/IDF) Q: How to use it to compute query-document relevance?

Cosine Similarity Represent both query and document as a TF-IDF vector
Take the inner product of the two normalized vectors to compute their similarity Note: |Q| does not matter for document ranking Division by |D| penalizes longer document.

Cosine Similarity: Example
idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = idf(admission)=1 Q = (UCLA, university), D = (car, racing) Q = (UCLA, university), D = (UCLA, admission) Q = (UCLA, university), D = (university, admission)

Finding High Cosine-Similarity Documents
Q: Under vector-space model, does precision/recall make sense? Q: How to find the documents with highest cosine similarity from corpus? Q: Any way to avoid complete scan of corpus?

Inverted Index for TF-IDF
Q · di = 0 if di has no query words Consider only the documents with query words Inverted Index: Word  Document Word IDF Stanford UCLA MIT … 1/3530 1/9860 1/937 docid TF D1 D14 D376 2 30 8 (TF may be normalized by document size) Posting list Lexicon 18

Phrase Queries “Havard University Boston” exactly as a phrase
Q: How can we support this query? Two approaches Biword index Positional index Q: Pros and cons of each approach? Rule of thumb: x2 – x4 size increase for positional index compared to docid only

Spell correction Q: What is the user’s intention for the query “Britnie Spears”? How can we find the correct spelling? Given a user-typed word w, find its correct spelling c. Probabilistic approach: Find c with the highest probability P(c|w). Q: How to estimate it? Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) Q: What are these probabilities and how can we estimate them? Rule of thumb: 75% misspells are within edit distance % are within edit distance 2.

Summary Boolean model Vector-space model Inverted index
TF-IDF weight, cosine similarity Inverted index TF-IDF model Phrase queries Spell correction

Basic Information Retrieval

Similar presentations

Presentation on theme: "Basic Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basic Information Retrieval

Similar presentations

Presentation on theme: "Basic Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback