The term vocabulary and postings lists

Slides:



Advertisements
Similar presentations
ITCS 6265 Information Retrieval and Web Mining Lecture 2: The term vocabulary and postings lists.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
1 Web Search and Text Mining Lecture 2 Adapted from Manning, Raghaven and Schuetze.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 2 The term vocabulary and postings lists.
Information Retrieval and Text Mining Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in.
Introduction to Information Retrieval Introduction to Information Retrieval Christopher Manning and Prabhakar Raghavan Lecture 2: The term vocabulary and.
Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining.
CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
COMP4201 Information Retrieval and Search Engines Lecture 2: The term vocabulary and postings lists.
CS276: Information Retrieval and Web Search
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from:
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CS276A Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 2 8/25/2011.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
LIS618 lecture 2 the Boolean model Thomas Krichel
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Text Pre-processing and Faster Query Processing David Kauchak cs458 Fall 2012 adapted from:
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.
Information Retrieval Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Basic Text Processing Word tokenization.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Information Retrieval
Lecture 2: The term vocabulary and postings lists
Ch 2 Term Vocabulary & Postings List
CS122B: Projects in Databases and Web Applications Winter 2017
7CCSMWAL Algorithmic Issues in the WWW
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
Document ingestion.
CS276: Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
Recap of the previous lecture
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
PageRank GROUP 4.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

The term vocabulary and postings lists Lecture 2: The term vocabulary and postings lists Related to Chapter 2: http://nlp.stanford.edu/IR-book/pdf/02voc.pdf

Recap of the previous lecture Ch. 1 Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Boolean query processing Intersection by linear time “merging” Simple optimizations

Plan for this lecture Preprocessing to form the term vocabulary Tokenization Normalization Postings Faster merges: skip lists Positional postings and phrase queries Elaborate: با دقت شرح دادن

Recall the basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 13 16 1

The first step: Parsing a document Sec. 2.1 The first step: Parsing a document What format is it in? pdf/word/excel/html What language is it in? What character set is in use? Parsing: تجزیه کردن و تشخیص اجزاء آن

Complications: Format/language Sec. 2.1 Complications: Format/language Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment. Complications: پیچیدگی

Tokenization

Tokenization Given a character sequence, tokenization is the task of chopping it up into pieces, called tokens. Perhaps at the same time throwing away certain characters, such as punctuation.

Tokenization Input: “university of Qom, computer department” Sec. 2.2.1 Tokenization Input: “university of Qom, computer department” Output: Tokens university of Qom computer Department Each such token is now a candidate for an index entry, after further processing: Normalization. But what are valid tokens to emit? Emit: بیرون دادن

Issues in tokenization Sec. 2.2.1 Issues in tokenization Iran’s capital  Iran? Irans? Iran’s? Hyphen Hewlett-Packard  Hewlett and Packard as two tokens? the hold-him-back-and-drag-him-away maneuver co-author lowercase, lower-case, lower case ? Space San Francisco: How do you decide it is one token? Hewlett-Packard: شرکت کاليفرنيايى که محدوده گسترده اى را از تجهيزات الکترونيکى شامل کامپيوتر و دستگاههاى جانبى را توليد مى کند اسم دو نفر است که این شرکت را ساخته اند

Issues in tokenization Sec. 2.2.1 Issues in tokenization Numbers Older IR systems may not index numbers But often very useful: looking up error codes/stack traces on the web 3/12/91 Mar. 12, 1991 12/3/91 55 B.C. B-52 My PGP key is 324a3df234cb23e (800) 234-2333

Language issues in tokenization Sec. 2.2.1 Language issues in tokenization German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ German retrieval systems benefit greatly from a compound splitter module Can give a 15% performance boost for German Boost: بالا بردن

Language issues in tokenization Sec. 2.2.1 Language issues in tokenization Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Further complicated in Japanese, with multiple alphabets intermingled Intermingle: با هم آمیختن フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji

Sec. 2.2.2 Stop words With a stop list, you exclude from the dictionary words like the, a, and, to, be Intuition: They have little semantic content. The commonest words. Using a stop list significantly reduces the number of postings that a system has to store, because there are a lot of them. Two ways for construction: By expert By machine

Stop words But you need them for: Phrase queries: “President of Iran” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London” The general trend in IR systems: from large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list. Good compression techniques (lecture 5) means the space for including stop words in a system is very small Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)

normalization

Normalization to terms Sec. 2.2.3 Normalization to terms Example: We want to match I.R. and IR Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens Result is terms: A term is a (normalized) word type, which is an entry in our IR system dictionary Canonical: استاندارد (ریاضی)

Normalization Example: Case folding Sec. 2.2.3 Normalization Example: Case folding Reduce all letters to lower case Exception: upper case in mid-sentence? Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization. Folding: تا کردن

Normalization to terms One way is using Equivalence Classes: car = automobile color = colour Searches for one term will retrieve documents that contain each of these members. We most commonly implicitly define equivalence classes of terms rather than being fully calculated in advance (hand constructed), e.g., deleting periods to form a term I.R., IR deleting hyphens to form a term anti-discriminatory, antidiscriminatory

Normalization: other languages Sec. 2.2.3 Normalization: other languages Accents: e.g., French résumé vs. resume. Umlauts: e.g., German: Tuebingen vs. Tübingen Normalization of things like date forms 7月30日 vs. 7/30 Tokenization and normalization may depend on the language and so is intertwined with language detection Crucial: Need to “normalize” indexed text as well as query terms into the same form Accents: تکیه صدا Umlauts: ادغام Intertwined: درهم پیچیده شده

Normalization to terms Sec. 2.2.3 Normalization to terms What is the disadvantage of equivalence classing? An example: Enter: window Search: window, windows Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows An alternative to equivalence classing is to do asymmetric expansion It is hand constructed Potentially more powerful, but less efficient Why not the reverse?

Stemming and lemmatization Documents are going to use different forms of a word, organize, organizes, and organizing am, are, and is There are families of derivationally related words with similar meanings, democracy, democratic, and democratization. Reduce tokens to their “roots” before indexing.

Stemming “Stemming” suggest crude affix chopping Example: Sec. 2.2.4 Stemming “Stemming” suggest crude affix chopping language dependent Example: Porter’s algorithm http://www.tartarus.org/~martin/PorterStemmer Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Stem: ریشه و اصل Crude: زمخت Affix: پیوست

Typical rules in Porter Sec. 2.2.4 Typical rules in Porter sses  ss presses  press ies  i bodies  bodi ss  ss press  press s  cats  cat

Sec. 2.2.4 Lemmatization Reduce inflectional/variant forms to base form (lemma) properly with the use of a vocabulary and morphological analysis of words Lemmatizer: a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Inflection: خمیدگی

Language-specificity Sec. 2.2.4 Language-specificity Many of the above features embody transformations that are Language-specific and often, application-specific There are “plug-in” addenda to the indexing process Both open source and commercial plug-ins are available for handling these Addenda: attachment

Helpfulness of normalization Do stemming and other normalizations help? Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish! What about English? Not so considerable help! Helps a lot for some queries, hurts performance a lot for others.

Helpfulness of normalization Example: operative (dentistry) ⇒ oper operational (research) ⇒ oper operating (systems) ⇒ oper For a case like this, moving to using a lemmatizer would not completely fix the problem Operate: عمل کردن ، به کار انداختن ، بهره بردارى کردن ، به فعاليت واداشتن ، گرداندن ، اداره کردن ، راه انداختن ، داير بودن ، عمل جراحى کردن

Faster postings merges: Skip lists

Sec. 2.3 Recall basic merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 4 8 41 48 64 128 Brutus 2 8 1 2 3 8 11 17 21 31 Caesar If the list lengths are m and n, the merge takes O(m+n) operations. Can we do better? Yes!

Augment postings with skip pointers Sec. 2.3 Augment postings with skip pointers 41 128 128 2 4 8 41 48 64 11 31 1 2 3 8 11 17 21 31 At indexing time. The resulted list is skip list.

Query processing with skip pointers Sec. 2.3 Query processing with skip pointers 41 128 2 4 8 41 48 64 128 11 31 But the skip successor of 11 is 31, so we can skip ahead past the intervening postings. 1 2 3 8 11 17 21 31 Suppose we’ve stepped through the lists until we process 8 on each list. We then have 41 and 11. 11 is smaller.

Where do we place skips? Tradeoff: Sec. 2.3 Where do we place skips? Tradeoff: More skips  More likely to skip. But lots of comparisons to skip pointers. Fewer skips  Few successful skips. But few pointer comparison.

Sec. 2.3 Placing skips Simple heuristic: for postings of length L, use L evenly-spaced skip pointers. This ignores the distribution of query terms. Easy if the index is relatively static; harder if L keeps changing because of updates. The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging! D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp. 215-221.

Phrase queries and positional indexes

Sec. 2.4 Phrase queries Want to be able to answer queries such as “stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. Most recent search engines support a double quotes syntax

Phrase queries PHRASE QUERIES has proven to be very easily understood and successfully used by users. As many as 10% of web queries are phrase queries. For this, it no longer suffices to store only <term : docs> entries Solutions?

A first attempt: Biword indexes Sec. 2.4.1 A first attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text “Qom computer department” would generate the biwords Qom computer computer department Two-word phrase query-processing is now immediate.

Longer phrase queries The query “modern information retrieval course” can be broken into the Boolean query on biwords: modern information AND information retrieval AND retrieval course Work fairly well in practice. But there can and will be occasional errors.

Issues for biword indexes Sec. 2.4.1 Issues for biword indexes Errors, as noted before. Index blowup due to bigger dictionary Infeasible for more than biwords, big even for them. Biword indexes are not the standard solution but can be part of a compound strategy. Blowup: انفجار

Solution 2: Positional indexes Sec. 2.4.2 Solution 2: Positional indexes In the postings, store for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>

Positional index example Sec. 2.4.2 Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? For phrase queries, we need to deal with more than just equality.

Proximity queries LIMIT /3 STATUTE /3 FEDERAL /2 TORT Sec. 2.4.2 Proximity queries LIMIT /3 STATUTE /3 FEDERAL /2 TORT /k means “within k words of (on either side)”. Clearly, positional indexes can be used for such queries; biword indexes cannot. STATUTE : قانون Tort: شبه جرم

Sec. 2.4.2 Positional index size Need an entry for each occurrence, not just once per document Consider a term with frequency 0.1% 100 1 100,000 1000 Positional postings Postings Document size

Sec. 2.4.2 Rules of thumb A positional index is 2–4 as large as a non-positional index Positional index size 35–50% of volume of original text Caveat: all of this holds for “English-like” languages Caveat: آگهی

Combination schemes These two approaches can be profitably combined Sec. 2.4.3 Combination schemes These two approaches can be profitably combined For particular phrases (“Hossein Rezazadeh”) it is inefficient to keep on merging positional postings lists Profitably: به روشی ارزنده

Combination schemes Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme A typical web query mixture was executed in ¼ of the time of using just a positional index It required 26% more space than having a positional index alone H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.

Exercise Write a pseudo-code for biwork phrase queries using positional index. Do exercises 2.5 and 2.6 of your book.