The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms n Why?  Convert the many forms of words into more consistent index terms that.
Search Engines and Information Retrieval
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Chapter 5: Information Retrieval and Web Search
AP STATISTICS.   Theoretical: true mathematical probability  Empirical: the relative frequency with which an event occurs in a given experiment  Subjective:
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.
3. Processing Text Azreen Azman, PhD SMM 5891 All slides ©Addison Wesley, 2008.
Search Engines and Information Retrieval Chapter 1.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Chapter 5: Elementary Probability Theory
Bibliometric research methods Faculty Brown Bag IUPUI Cassidy R. Sugimoto.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Section 3.2 Notes Conditional Probability. Conditional probability is the probability of an event occurring, given that another event has already occurred.
Lesson Probability Rules. Objectives Understand the rules of probabilities Compute and interpret probabilities using the empirical method Compute.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
BIOSTAT 3 Three tradition views of probabilities: Classical approach: make certain assumptions (such as equally likely, independence) about situation.
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Vector Space Models.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Statistical Properties of Text
Independent Events The occurrence (or non- occurrence) of one event does not change the probability that the other event will occur.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition.
Probability Class Homework Check Assignment: Chapter 7 – Exercise 7.5, 7.7, 7.10 and 7.12 Reading: Chapter 7 – p
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Queries and Interfaces
Search Engines and Search techniques
Information Retrieval in Practice
CS 430: Information Discovery
Chapter 3 Probability.
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Inverted Indexing for Text Retrieval
CS 430: Information Discovery
Presentation transcript:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN

how do we best convert documents to their index terms how do we make acquired documents searchable?

 Simplest approach is find, which requires no text transformation  Useful in user applications, but not in search (why?)  Optional transformation handled during the find operation: case sensitivity

 English documents are predictable:  Top two most frequently occurring words are “the” and “of” (10% of word occurrences)  Top six most frequently occurring words account for 20% of word occurrences  Top fifty most frequently occurring words account for 50% of word occurrences  Given all unique words in a (large) document, approximately 50% occur only once

 Zipf’s law:  Rank words in order of decreasing frequency  The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) r x f = k  In other words, the frequency of the rth most common word is inversely proportional to r George Kingsley Zipf ( )

 The probability of occurrence (P r ) of a word is the word frequency divided by the total number of words in the document  Revise Zipf’s law as: r x P r = c for English, c ≈ 0.1

 Verify Zipf’s law using the AP89 dataset:  Collection of Associated Press (AP) news stories from 1989 (available at Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064

 Top 50 words of AP89

 As the corpus grows, so does vocabulary size  Fewer new words when corpus is already large  The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law: v = k x n β  Constants k and β vary  Typically 10 ≤ k ≤ 100 and β ≈ 0.5

note values of k and β

Web pages crawled from.gov in early 2004

 Word occurrence statistics can be used to estimate result set size of a user query  Aside from stop words, how many pages contain all of the query terms? ▪ To figure this out, first assume that words occur independently of one another ▪ Also assume that the search engine knows N, the number of documents it indexes

 Given three query terms a, b, and c  Probability of a document containing all three is the product of individual probabilities for each query term: P(a  b  c) = P(a) x P(b) x P(c)  P(a  b  c) is the joint probability of events a, b, and c occurring

 We assume the search engine knows the number of documents that a word occurs in  Call these n a, n b, and n c ▪ Note that the book uses f a, f b, and f c  Estimate individual query term probabilities:  P(a) = n a / N P(b) = n b / N P(c) = n c / N

 Given P(a), P(b), and P(c), we estimate the result set size as: n abc = N x (n a / N) x (n b / N) x (n c / N) n abc = (n a x n b x n c ) / N 2  This estimation sounds good, but is lacking due to our query term independence assumption

 Using the GOV2 dataset with N = 25,205,179  Poor results, because of the query term independence assumption  Could use word co-occurrence data...

 Extrapolate based on the size of the current result set:  The current result set is the subset of documents that have been ranked thus far  Let C be the number of documents found thus far containing all the query words  Let s be the proportion of the total documents ranked (use least frequently occurring term)  Estimate result set size via n abc = C / s

 Given example query: tropical fish aquarium  Least frequently occurring term is aquarium (which occurs in 26,480 documents)  After ranking 3,000 documents, 258 documents contain all three query terms  Thus, n abc = C / s = 258 / (3,000 ÷ 26,480) = 2,277  After processing 20% of the documents, the estimate is 1,778 ▪ Which overshoots actual value of 1,529

 Read and study Chapter 4  Do Exercises 4.1, 4.2, and 4.3  Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4