Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.

Similar presentations


Presentation on theme: "CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space."— Presentation transcript:

1 CS246 Basic Information Retrieval

2 Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space model  Document-term matrix  TF-IDF vector and cosine similarity  Phrase queries  Spell correction

3 Information-Retrieval System  Information source: Existing text documents  Keyword-based/natural-language query  The system returns best-matching documents given the query  Challenge  Both queries and data are “fuzzy”  Unstructured text and “natural language” query  What documents are good matches for a query?  Computers do not “understand” the documents or the queries  Developing a computerizable “model” is essential to implement this approach

4 Bag of Words: Major Simplification  Consider each document as a “bag of words”  “bag” vs “set”  Ignore word ordering, but keep word count  Consider queries as bag of words as well  Great oversimplification, but works adequately in many cases  “John loves only Jane” vs “Only John loves Jane”  The limitation still shows up on current search engines  Still how do we match documents and queries?

5 Boolean Model  Return all documents that contain the words in the query  Simplest model for information retrieval  No notion of “ranking”  A document is either a match or non-match  Q: How to find and return matching documents?  Basic algorithm?  Useful data structure?

6 Inverted Index  Allows quick lookup of document ids with a particular word  Q: How can we use this to answer “UCLA Physics”? lexicon/dictionary DIC 3810131620 Stanford UCLA MIT … 12391618 PL(Stanford) PL(UCLA) Postings list 4581013192022 PL(MIT)

7 Inverted Index  Allows quick lookup of document ids with a particular word lexicon/dictionary DIC 3810131620 Stanford UCLA MIT … 12391618 PL(Stanford) PL(UCLA) Postings list 4581013192022 PL(MIT)

8 Size of Inverted Index (1)  100M docs, 10KB/doc, 1000 unique words/doc, 10B/word, 4B/docid  Q: Document collection size?  Q: Inverted index size?  Heap’s Law: Vocabulary size = k n b with 30 < k < 100 and 0.4 < b < 1  k = 50 and b = 0.5 are good rule of thumb

9 Size of Inverted Index (2)  Q: Between dictionary and postings lists, which one is larger?  Q: Lengths of postings lists?  Zipf’s law: collection term frequency  1/frequency rank  Q: How do we construct an inverted index?

10 Inverted Index Construction C: set of all documents (corpus) DIC: dictionary of inverted index PL( w ): postings list of word w 1:For each document d  C : 2:Extract all words in content( d ) into W 3: For each w  W : 4:If w  DIC, then add w to DIC 5:Append id( d ) to PL( w ) Q: What if the index is larger than main memory?

11 Inverted-Index Construction  For large text corpus  Block-sorted based construction  Partition and merge

12 Evaluation: Precision and Recall  Q: Are all matching documents what users want?  Basic idea: a model is good if it returns document if and only if it is “relevant”.  R: set of “relevant” document D: set of documents returned by a model

13 Vector-Space Model  Main problem of Boolean model  Too many matching documents when the corpus is large  Any way to “rank” documents?  Matrix interpretation of Boolean model  Document – Term matrix  Boolean 0 or 1 value for each entry  Basic idea  Assign real-valued weight to the matrix entries depending on the importance of the term  “the” vs “UCLA”  Q: How should we assign the weights?

14 TF-IDF Vector  A term t is important for document d  If t appears many times in d or  If t is a “rare” term  TF: term frequency  # occurrence of t in d  IDF: inverse document frequency  # documents containing t  TF-IDF weighting  TF X Log(N/IDF)  Q: How to use it to compute query-document relevance?

15 Cosine Similarity  Represent both query and document as a TF-IDF vector  Take the inner product of the two normalized vectors to compute their similarity  Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

16 Cosine Similarity: Example  idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = 1  Q = (UCLA, university), D = (car, racing)  Q = (UCLA, university), D = (UCLA, good)  Q = (UCLA, university), D = (university, good)

17 Finding High Cosine-Similarity Documents  Q: Under vector-space model, does precision/recall make sense?  Q: How to find the documents with highest cosine similarity from corpus?  Q: Any way to avoid complete scan of corpus?

18 Inverted Index for TF-IDF  Q · d i = 0 if d i has no query words  Consider only the documents with query words  Inverted Index: Word  Document 18 Word IDF Stanford UCLA MIT … 1/3530 1/9860 1/937 docid TF D1 D14 D376 2 30 8 (TF may be normalized by document size) Posting list Lexicon

19 Phrase Queries  “Havard University Boston” exactly as a phrase  Q: How can we support this query?  Two approaches  Biword index  Positional index  Q: Pros and cons of each approach?  Rule of thumb: x2 – x4 size increase for positional index compared to docid only

20 Spell correction  Q: What is the user’s intention for the query “Britnie Spears”? How can we find the correct spelling?  Given a user-typed word w, find its correct spelling c.  Probabilistic approach: Find c with the highest probability P(c|w).  Q: How to estimate it?  Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w)  Q: What are these probabilities and how can we estimate them?  Rule of thumb: 75% misspells are within edit distance 1. 98% are within edit distance 2.

21 Summary  Boolean model  Vector-space model  TF-IDF weight, cosine similarity  Inverted index  Boolean model  TF-IDF model  Phrase queries  Spell correction


Download ppt "CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space."

Similar presentations


Ads by Google