Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Term weighting and vector representation of text Lecture 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Search Engines and Information Retrieval Chapter 1.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Evidence from Content INST 734 Module 2 Doug Oard.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
INFORMATION RETRIEVAL Introduction. Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
Information Retrieval and Web Search
Text Properties and Languages
Information Retrieval and Web Search
CS 430: Information Discovery
Inf 722 Information Organisation
Chapter 5: Information Retrieval and Web Search
TEXT ANALYSIS BY MEANS OF ZIPF’S LAW
Presentation transcript:

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan at U.Massachusetts, Amherst) Instructor: Rada Mihalcea Class web page:

Slide 1 Last time Text preprocessing –Required for processing the text collection prior to indexing –Required for processing the query prior to retrieval Main steps –Tokenization –Stopword removal –Lemmatization / Stemming Porter stemmer New stemming algorithm?

Slide 2 Today’s topics Text properties –Distribution of words in language –Why are they important? Zipf’s Law Heap’s Law

Slide 3 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow with the size of a corpus? Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system.

Slide 4 Word Frequency A few words are very common. –2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences. Most words are very rare. –Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”) Called a “heavy tailed” distribution, since most of the probability mass is in the “tail”

Slide 5 Sample Word Frequency Data (from B. Croft, UMass)

Slide 6 Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: If probability of word of rank r is p r and N is the total number of word occurrences:

Slide 7 Zipf and Term Weighting Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.

Slide 8 Predicting Occurrence Frequencies By Zipf, a word appearing n times has rank r n =AN/n Several words may occur n times, assume rank r n applies to the last of these. Therefore, r n words occur n or more times and r n+1 words occur n+1 or more times. So, the number of words appearing exactly n times is:

Slide 9 Predicting Word Frequencies (cont’d) Assume highest ranking term occurs once and therefore has rank D = AN/1 Fraction of words with frequency n is: Fraction of words appearing only once is therefore ½.

Slide 10 Occurrence Frequency Data (from B. Croft, UMass)

Slide 11 Does Real Data Fit Zipf’s Law? A law of the form y = kx c is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.

Slide 12 Fit to Zipf for Brown Corpus k = 100,000

Slide 13 Mandelbrot (1954) Correction The following more general form gives a bit better fit:

Slide 14 Mandelbrot Fit P = , B = 1.15,  = 100

Slide 15 Explanations for Zipf’s Law Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. Debate ( ) between Mandelbrot and H. Simon over explanation. Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. –

Slide 16 Zipf’s Law Impact on IR Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted- index storage costs. Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Slide 17 Exercise Assuming Zipf’s Law with a corpus independent constant A=0.1, what is the fewest number of most common words that together account for more than 25% of word occurrences (I.e. the minimum value of m such as at least 25% of word occurrences are one of the m most common words)

Slide 18 Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.

Slide 19 Heaps’ Law If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: –K  10  100 –   0.4  0.6 (approx. square-root)

Slide 20 Heaps’ Law Data

Slide 21 Explanation for Heaps’ Law Can be derived from Zipf’s law by assuming documents are generated by randomly sampling words from a Zipfian distribution. Heap’s Law holds on distribution of other data –Own experiments on types of questions asked by users show a similar behavior

Slide 22 Exercise We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However, we only know statistics computed on smaller corpora sizes: –For 100,000 words, there are 50,000 unique words –For 500,000 words, there are 150,000 unique words –Estimate the vocabulary size for the 1,000,000 words corpus –How about for a corpus of 1,000,000,000 words?