Information Retrieval Lecture 2: The term vocabulary and postings lists.

Slides:



Advertisements
Similar presentations
ITCS 6265 Information Retrieval and Web Mining Lecture 2: The term vocabulary and postings lists.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Introduction to Information Retrieval Introduction to Information Retrieval Christopher Manning and Prabhakar Raghavan Lecture 2: The term vocabulary and.
Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining.
CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
COMP4201 Information Retrieval and Search Engines Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from:
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
The term vocabulary and postings lists
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CS276A Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 2: The term vocabulary and postings.
CpSc 881: Information Retrieval
WMES3103 : INFORMATION RETRIEVAL
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
Chapter 5: Information Retrieval and Web Search
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 2 8/25/2011.
Index Construction David Kauchak cs458 Fall 2012 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Text Pre-processing and Faster Query Processing David Kauchak cs458 Fall 2012 adapted from:
Chapter 6: Information Retrieval and Web Search
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.
Information Retrieval Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Basic Text Processing Word tokenization.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Information Retrieval
Lecture 2: The term vocabulary and postings lists
Ch 2 Term Vocabulary & Postings List
Overview Recap Documents Terms Skip pointers Phrase queries
7CCSMWAL Algorithmic Issues in the WWW
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
Document ingestion.
Token generation - stemming
CS276: Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
Recap of the previous lecture
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
PageRank GROUP 4.
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval Lecture 2: The term vocabulary and postings lists

Plan for this lecture Elaborate basic indexing Preprocessing to form the term vocabulary Documents Tokenization What terms do we put in the index? Postings Faster merges: skip lists Positional postings and phrase queries

Recall basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen.

Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically …

Complications: Format/language Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French with a German pdf attachment.

Tokens and Terms

Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing Described below But what are valid tokens to emit?

Why tokenization is difficult – even in English Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. Tokenize this sentence

One word or two? (or several) fault-finder co-education state-of-the-art data base San Francisco cheap San Francisco-Los Angeles fares

Numbers 3/12/91 Mar. 12, B.C. B-52 My PGP key is 324a3df234cb23e (800) Often have embedded spaces Often, don’t index as text But often very useful: think about things like looking up error codes on the web (One answer is using n-grams: Lecture 3)

Tokenization: language issues Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique tokenization

Ambiguous segmentation in Chinese The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.

Normalization Need to “normalize” terms in indexed text as well as query terms into the same form. Example: We want to match U.S.A. and USA We most commonly implicitly define equivalence classes of terms. Alternatively: do asymmetric expansion window → window, windows windows → Windows, windows Windows (no expansion) More powerful, but less efficient

Case folding Reduce all letters to lower case exception: upper case in mid-sentence? Fed vs. fed Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form

Stemming Definition of stemming: Crude heuristic process that chops off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

Porter algorithm Most common algorithm for stemming English Results suggest that it is at least as good as other stemming options Phases are applied sequentially Each phase consists of a set of commands. Sample command: Delete final ement if what remains is longer than 1 character replacement → replac cement → cement Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

Porter stemmer: A few rules Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat

Other stemmers Other stemmers exist, e.g., Lovins stemmer Single-pass, longest suffix removal (about 250 rules) Full morphological analysis – at most modest benefits for retrieval Do stemming and other normalizations help? English: very mixed results. Helps recall for some queries but harms precision on others E.g., Porter Stemmer equivalence class oper contains all of operate operating operates operation operative operatives operational Definitely useful for Spanish, German, Finnish, …

Thesauri Handle synonyms and homonyms Hand-constructed equivalence classes e.g., car = automobile color = colour Rewrite to form equivalence classes Index such equivalences When the document contains automobile, index it under car as well (usually, also vice- versa) Or expand query? When the query contains automobile, look under car as well

Stop words stop words = extremely common words which would appear to be of little value in helping select documents matching a user need They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with Stop word elimination used to be standard in older IR systems. But the trend is away from doing this: Good compression techniques (lecture 5) means the space for including stopwords in a system is very small Good query optimization techniques mean you pay little at query time for including stop words. You need them for: Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London” Most web search engines index stop words