The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

how do we best convert documents to their index terms how do we make acquired documents searchable?

 Simplest approach is find, which requires no text transformation  Useful in user applications, but not in search (why?)  Optional transformation handled during the find operation: case sensitivity

 English documents are predictable:  Top two most frequently occurring words are “the” and “of” (10% of word occurrences)  Top six most frequently occurring words account for 20% of word occurrences  Top fifty most frequently occurring words account for 50% of word occurrences  Given all unique words in a (large) document, approximately 50% occur only once

 Zipf’s law:  Rank words in order of decreasing frequency  The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) r x f = k  In other words, the frequency of the rth most common word is inversely proportional to r George Kingsley Zipf (1902-1950)

 The probability of occurrence (P r ) of a word is the word frequency divided by the total number of words in the document  Revise Zipf’s law as: r x P r = c for English, c ≈ 0.1

 Verify Zipf’s law using the AP89 dataset:  Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov):http://trec.nist.gov Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064

 Top 50 words of AP89

 As the corpus grows, so does vocabulary size  Fewer new words when corpus is already large  The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law: v = k x n β  Constants k and β vary  Typically 10 ≤ k ≤ 100 and β ≈ 0.5

note values of k and β

Web pages crawled from.gov in early 2004

 Word occurrence statistics can be used to estimate result set size of a user query  Aside from stop words, how many pages contain all of the query terms? ▪ To figure this out, first assume that words occur independently of one another ▪ Also assume that the search engine knows N, the number of documents it indexes

 Given three query terms a, b, and c  Probability of a document containing all three is the product of individual probabilities for each query term: P(a  b  c) = P(a) x P(b) x P(c)  P(a  b  c) is the joint probability of events a, b, and c occurring

 We assume the search engine knows the number of documents that a word occurs in  Call these n a, n b, and n c ▪ Note that the book uses f a, f b, and f c  Estimate individual query term probabilities:  P(a) = n a / N P(b) = n b / N P(c) = n c / N

 Given P(a), P(b), and P(c), we estimate the result set size as: n abc = N x (n a / N) x (n b / N) x (n c / N) n abc = (n a x n b x n c ) / N 2  This estimation sounds good, but is lacking due to our query term independence assumption

 Using the GOV2 dataset with N = 25,205,179  Poor results, because of the query term independence assumption  Could use word co-occurrence data...

 Extrapolate based on the size of the current result set:  The current result set is the subset of documents that have been ranked thus far  Let C be the number of documents found thus far containing all the query words  Let s be the proportion of the total documents ranked (use least frequently occurring term)  Estimate result set size via n abc = C / s

 Given example query: tropical fish aquarium  Least frequently occurring term is aquarium (which occurs in 26,480 documents)  After ranking 3,000 documents, 258 documents contain all three query terms  Thus, n abc = C / s = 258 / (3,000 ÷ 26,480) = 2,277  After processing 20% of the documents, the estimate is 1,778 ▪ Which overshoots actual value of 1,529

 Read and study Chapter 4  Do Exercises 4.1, 4.2, and 4.3  Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations

Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations

Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

Similar presentations

About project

Feedback