Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S.

Similar presentations


Presentation on theme: "Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S."— Presentation transcript:

1 Searching the Web Basic Information Retrieval

2 Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S. from SNU in Physics  Got involved in early Web-search engine projects  Particularly in Web crawling part  Research on search engines and social Web

3 Brief Overview of the Course  Basic principles and theories behind Web-search engines  Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions  Topics  Basic IR models, data structures, and algorithms  Topic-based models  Latent Semantic index  Latent Dirichlet Analysis  Link-based ranking  Search-engine architecture  Issues of scale, Web crawling

4 Who Are You?  Background  Expectation  Career goal

5 Today’s Topic  Basic Information Retrieval (IR)  Three approaches for computer-based information management  Bag of words assumption  Boolean Model  String-matching algorithm  Inverted index  Vector-space model  Document-term matrix  TF-IDF vector and cosine similarity  Phrase queries  Spell correction

6 Computer-based Information Management  Basic problem  How to use computers to help humans store, organize and retrieve information?  What approaches have been taken and what has been successful?

7 Three Major Approaches  Database approach  Expert-system approach  Information-retrieval approach

8 Database Approach  Information is stored in a highly-structured way  Data is stored in relational tables as tuples  Simple data model and query language  Relational model and SQL query language  Clear interpretation of data and query  No ambition to be “intelligent” like humans  Mainly focus on highly efficient system  “Performance, performance, performance”  It has been hugely successful  All major businesses use a RDB system  >$20B market  What are the pros and cons?

9 Expert-System Approach  Information is stored as a set of logical predicates  Bird(x), Cat(x), Fly(x), …  Given a query, the system infers the answer through logical inference  Bird(Ostrich)  Fly(Ostrich)?  Popular approach in 80s, but has not been successful for general information retrieval  What are the pros and cons?

10 Information-Retrieval Approach  Uses existing text documents as information source  No special structuring or database construction required  Text-based query language  Keyword-based query or natural-language query  The system returns best-matching documents given the query  Had a limited appeal until the Web became popular  What are the pros and cons?

11 Main Challenge of IR Approach  Relational Model  Interpretation of query and data is straightforward  Student(name, birthdate, major, GPA)  SELECT * FROM Student WHERE GPA > 3.0  Information Retrieval  Both queries and data are “fuzzy”  Unstructured text and “natural language” query  What documents are good matches for a query?  Computers do not “understand” the documents or the queries  Developing a computerizable “model” is essential to implement this approach

12 Bag of Words: Major Simplification  Consider each document as a “bag of words”  “bag” vs “set”  Ignore word ordering, but keep word count  Consider queries as bag of words as well  Great oversimplification, but works adequately in many cases  “John loves only Jane” vs “Only John loves Jane”  The limitation still shows up on current search engines  Still how do we match documents and queries?

13 Boolean Model  Return all documents that contain the words in the query  Simplest model for information retrieval  No notion of “ranking”  A document is either a match or non-match  Q: How to find and return matching documents?  Basic algorithm?  Useful data structure?

14 String-Matching Algorithm  Given string “abcde”, find what documents contain the string  Q: Computational complexity of naïve matching of string of length m over a document of length n?  Q: Any efficient way

15 String Matching Example (1)  m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234

16  Two cursors: m=2, i=1  m: beginning of matching part in D  i: the location of matching char in W String Matching Example (2)

17  m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234  Mismatch at m=0,i=2  Q: What can we do? Start again at m=1,i=0 ? String Matching Example (2)

18  m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234  Mismatch at m=3,i=4  Q: What can we do? Start at m=7,i=0 ? String Matching Example (3)

19 Algorithm KMP  If no substring in W is self-repeated, we can slide W “completely” for matched portion  m <- m + i  i <- 0  If the suffix of the matched part is equal to the prefix of W, we have to slide back a little bit  m <- m + i – x // x is how much to slide back  i <- x  The exact value of x depends on the length of the prefix matching the the suffix of the matched part  T[0…m]: “slide-back” table recording x values

20 Algorithm KMP W: string to look for D: document T: “slide-back” table in case of mismatch while (m + i) < |D| do: if W[i] = D[m + i], let i = i + 1 if i = |W|, return m otherwise, let m = m + i - T[i], if i > 0, let i = T[i] return no-match

21 Algorithm KMP: T[i] Table  W: ABCDABD (word) i 0123456  m <- m + i – T[i]  T[0]= -1, T[1]= 0  Q: What should be T[i] for i=2…6 ?

22 Data Structure for Quick Document Matching  Boolean model  Find all documents that contain the keywords in Q.  Q: What data structure will be useful to do it fast?

23 Inverted Index  Allows quick lookup of document ids with a particular word  Q: How can we use this to answer “UCLA Physics”? lexicon/dictionary DIC 3810131620 Stanford UCLA MIT … 12391618 PL(Stanford) PL(UCLA) Postings list 4581013192022 PL(MIT)

24 Inverted Index  Allows quick lookup of document ids with a particular word lexicon/dictionary DIC 3810131620 Stanford UCLA MIT … 12391618 PL(Stanford) PL(UCLA) Postings list 4581013192022 PL(MIT)

25 Size of Inverted Index (1)  100M docs, 10KB/doc, 1000 unique words/doc, 10B/word, 4B/docid  Q: Document collection size?  Q: Inverted index size?  Heap’s Law: Vocabulary size = k n b with 30 < k < 100 and 0.4 < b < 1  k = 50 and b = 0.5 are good rule of thumb

26 Size of Inverted Index (2)  Q: Between dictionary and postings lists, which one is larger?  Q: Lengths of postings lists?  Zipf’s law: collection term frequency  1/frequency rank  Q: How do we construct an inverted index?

27 Inverted Index Construction C: set of all documents (corpus) DIC: dictionary of inverted index PL( w ): postings list of word w 1:For each document d  C : 2:Extract all words in content( d ) into W 3: For each w  W : 4:If w  DIC, then add w to DIC 5:Append id( d ) to PL( w ) Q: What if the index is larger than main memory?

28 Inverted-Index Construction  For large text corpus  Block-sorted based construction  Partition and merge

29 Evaluation: Precision and Recall  Q: Are all matching documents what users want?  Basic idea: a model is good if it returns document if and only if it is “relevant”.  R: set of “relevant” document D: set of documents returned by a model

30 Vector-Space Model  Main problem of Boolean model  Too many matching documents when the corpus is large  Any way to “rank” documents?  Matrix interpretation of Boolean model  Document – Term matrix  Boolean 0 or 1 value for each entry  Basic idea  Assign real-valued weight to the matrix entries depending on the importance of the term  “the” vs “UCLA”  Q: How should we assign the weights?

31 TF-IDF Vector  A term t is important for document d  If t appears many times in d or  If t is a “rare” term  TF: term frequency  # occurrence of t in d  IDF: inverse document frequency  # documents containing t  TF-IDF weighting  TF X Log(N/IDF)  Q: How to use it to compute query-document relevance?

32 Cosine Similarity  Represent both query and document as a TF-IDF vector  Take the inner product of the two normalized vectors to compute their similarity  Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

33 Cosine Similarity: Example  idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = 1  Q = (UCLA, university), D = (car, racing)  Q = (UCLA, university), D = (UCLA, good)  Q = (UCLA, university), D = (university, good)

34 Finding High Cosine-Similarity Documents  Q: Under vector-space model, does precision/recall make sense?  Q: How to find the documents with highest cosine similarity from corpus?  Q: Any way to avoid complete scan of corpus?

35 Inverted Index for TF-IDF  Q · d i = 0 if d i has no query words  Consider only the documents with query words  Inverted Index: Word  Document 35 Word IDF Stanford UCLA MIT … 1/3530 1/9860 1/937 docid TF D1 D14 D376 2 30 8 (TF may be normalized by document size) Posting list Lexicon

36 Phrase Queries  “Havard University Boston” exactly as a phrase  Q: How can we support this query?  Two approaches  Biword index  Positional index  Q: Pros and cons of each approach?  Rule of thumb: x2 – x4 size increase for positional index compared to docid only

37 Spell correction  Q: What the user may have truly intended for the query “Britnie Spears”? How can we find the correct spelling?  Given a user-typed word w, find its correct spelling c.  Probabilistic approach: Find c with the highest probability P(c|w).  Q: How to estimate it?  Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w)  Q: What are these probabilities and how can we estimate them?  Rule of thumb: 4/3 misspells are within edit distance 1. 98% are within edit distance 2.

38 Summary  Boolean model  Vector-space model  TF-IDF weight, cosine similarity  String-matching algorithm  Algorithm KMP  Inverted index  Boolean model  TF-IDF model  Phrase queries  Spell correction


Download ppt "Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S."

Similar presentations


Ads by Google