Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Slides:



Advertisements
Similar presentations
Lecture 4: Index Construction
Advertisements

Index Construction David Kauchak cs160 Fall 2009 adapted from:
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
CpSc 881: Information Retrieval. 2 Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary:
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Information Retrieval IR 4. Plan This time: Index construction.
CS347 Lecture 3 April 16, 2001 ©Prabhakar Raghavan.
Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Lecture 4 Index construction
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 4 9/1/2011.
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Index Construction David Kauchak cs458 Fall 2012 adapted from:
LIS618 lecture 2 the Boolean model Thomas Krichel
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Dictionaries and Tolerant retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
An Introduction to IR Lecture 4 Index construction 1.
Index Construction 1 Lecture 5: Index Construction Web Search and Mining.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Index Construction (Modified from Stanford CS276 Class Lecture 4 Index construction)
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 3 8 Oct 2002.
1 ITCS 6265 Lecture 4 Index construction. 2 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex This lecture:
Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Dictionaries and Tolerant retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 4: Index Construction United International College.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 4: Index Construction.
Information Retrieval On the use of the Inverted Lists.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER IV Index Construction 1.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
CS315 Introduction to Information Retrieval Boolean Search 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Index Construction.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Large Scale Search: Inverted Index, etc.
Chapter 4 Index construction
Modified from Stanford CS276 slides Lecture 4: Index Construction
Lecture 7: Index Construction
CS276: Information Retrieval and Web Search
Lecture 7: Index Construction
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Lecture 4: Index Construction
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS

Quick Review Inverted Index Construction (Exercise) Query Processing Using Inverted Index Faster Posting Merges: Skip Pointers Phrase Queries – Bi-word Index – Extended Bi-Word Index – Positional Index

Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT – Again, here, /k means “within k words of”. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? – This is a little tricky to do correctly and efficiently – See Figure 2.12 of IIR – There’s likely to be a problem on it! Sec

4 Proximity search 4  We just saw how to use a positional index for phrase searches.  We can also use it for proximity search.  For example: employment /4 place  Find all documents that contain EMPLOYMENT and PLACE within 4 words of each other.  Employment agencies that place healthcare workers are seeing growth is a hit.  Employment agencies that have learned to adapt now place healthcare workers is not a hit.

5 Proximity search? 5  Use the positional index  Simplest algorithm: look at cross-product of positions of (i) EMPLOYMENT in document and (ii) PLACE in document  Very inefficient for frequent words, especially stop words  Note that we want to return the actual matching positions, not just a list of documents.  This is important for dynamic summaries etc.

Positional index size You can compress position values/offsets: Nevertheless, a positional index expands postings storage substantially linklink Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system. Sec

Positional index size Need an entry for each occurrence, not just once per document Index size depends on average document size – Average web page has <1000 terms – SEC filings, books, even some epic poems … easily 100,000 terms Consider a term with frequency 0.1% Why? , Positional postings Postings Document size Sec

Rules of thumb A positional index is 2–4 as large as a non- positional index Positional index size 35–50% of volume of original text Caveat: all of this holds for “English-like” languages Sec

Combination schemes These two approaches can be profitably combined – For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists Even more so for phrases like “The Who” Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme – A typical web query mixture was executed in ¼ of the time of using just a positional index – It required 26% more space than having a positional index alone Sec

Inverted Index Construction Positional index size Dictionary size Hardware issues Large collection requirements analysis

11 Inverted index 11

12 Dictionaries  The dictionary is the data structure for storing the term vocabulary.  Term vocabulary: the data  Dictionary: the data structure for storing the term vocabulary 12

13 Dictionary as array of fixed-width entries  For each term, we need to store a couple of items:  document frequency  pointer to postings list ...  Assume for the time being that we can store this information in a fixed-length entry.  Assume that we store these entries in an array. 13

14 Dictionary as array of fixed-width entries space needed: 20 bytes 4 bytes 4 bytes How do we look up a query term q i in this array at query time? That is: which data structure do we use to locate the entry (row) in the array where q i is stored? 14

15 Data structures for looking up term  Two main classes of data structures: hashes and trees  Some IR systems use hashes, some use trees.  Criteria for when to use hashes vs. trees:  Is there a fixed number of terms or will it keep growing?  What are the relative frequencies with which various keys will be accessed?  How many terms are we likely to have? 15

16 Hashes  Each vocabulary term is hashed into an integer.  Try to avoid collisions  At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array  Pros: Lookup in a hash is faster than lookup in a tree.  Lookup time is constant.  Cons  no way to find minor variants (resume vs. résumé)  no prefix search (all terms starting with automat)  need to rehash everything periodically if vocabulary keeps growing 16

17 Trees  Trees solve the prefix problem (find all terms starting with automat).  Simplest tree: binary tree  Search is slightly slower than in hashes: O(logM), where M is the size of the vocabulary.  O(logM) only holds for balanced trees.  Rebalancing binary trees is expensive.  B-trees mitigate the rebalancing problem.  B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. 17

18 Binary tree 18

19 B-tree 19

Index construction How do we construct an index? What strategies can we use with limited main memory? Ch. 4

Hardware basics Many design decisions in information retrieval are based on the characteristics of hardware We begin by reviewing hardware basics Sec. 4.1

Hardware basics Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB. Sec. 4.1

Hardware basics Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. Available disk space is several (2–3) orders of magnitude larger. Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine. Sec. 4.1

Documents are parsed to extract words and these are saved with the Document ID. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Recall IIR 1 index construction Sec. 4.2

Key step After all documents have been parsed, the inverted file is sorted by terms. We focus on this sort step. We have 100M items to sort. Sec. 4.2

Scaling index construction In-memory index construction does not scale – Can’t stuff entire collection into memory, sort, then write back How can we construct an index for very large collections? Taking into account the hardware constraints we just learned about... Memory, disk, speed, etc. Sec. 4.2

Sort-based index construction As we build the index, we parse docs one at a time. – While building the index, we cannot easily exploit compression tricks (you can, but much more complex) The final postings for any term are incomplete until the end. At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections. T = 100,000,000 in the case of RCV1 – So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of >150 years of newswire Thus: We need to store intermediate results on disk. Sec. 4.2

Sort using disk as “memory”? Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks. We need an external sorting algorithm. Sec. 4.2

29 RCV1 collection  Shakespeare’s collected works are not large enough for demonstrating many of the points in this course.  As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.  English newswire articles sent over the wire in 1995 and 1996 (one year). 29

30 Same algorithm for disk?  Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?  No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.  We need an external sorting algorithm. 30

31 “External” sorting algorithm (using few disk seeks)  We must sort T = 100,000,000 non-positional postings.  Each posting has size 12 bytes (4+4+4: termID, docID, document frequency).  Define a block to consist of 10,000,000 such postings  We can easily fit that many postings into memory.  We will have 10 such blocks for RCV1.  Basic idea of algorithm:  For each block: (i) accumulate postings, (ii) sort in memory, (iii) write to disk  Then merge the blocks into one long sorted order. 31

32 Merging two blocks 32

33 Blocked Sort-Based Indexing  Key decision: What is the size of one block? 33

34 Problem with sort-based algorithm  Our assumption was: we can keep the dictionary in memory.  We need the dictionary (which grows dynamically) in order to implement a term to termID mapping.  Actually, we could work with term,docID postings instead of termID,docID postings... ... but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.) 34

35 Single-pass in-memory indexing  Abbreviation: SPIMI  Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.  Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.  With these two ideas we can generate a complete inverted index for each block.  These separate indexes can then be merged into one big index. 35

Using wildcard in queries

37 Wildcard queries  mon*: find all docs containing any term beginning with mon  Easy with B-tree dictionary: retrieve all terms t in the range: mon ≤ t < moo  *mon: find all docs containing any term ending with mon  Maintain an additional tree for terms backwards  Then retrieve all terms t in the range: nom ≤ t < non  Result: A set of terms that are matches for wildcard query  Then retrieve documents that contain any of these terms 37

38 How to handle * in the middle of a term  Example: m*nchen  We could look up m* and *nchen in the B-tree and intersect the two term sets.  Expensive  Alternative: permuterm index  Basic idea: Rotate every wildcard query, so that the * occurs at the end.  Store each of these rotations in the dictionary, say, in a B-tree 38

39 Permuterm index  For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol 39

40 Permuterm → term mapping 40

41 Permuterm index  For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and o$hell  Queries  For X, look up X$  For X*, look up X*$  For *X, look up X$*  For *X*, look up X*  For X*Y, look up Y$X*  Example: For hel*o, look up o$hel*  Permuterm index would better be called a permuterm tree.  But permuterm index is the more common name. 41

42 Processing a lookup in the permuterm index  Rotate query wildcard to the right  Use B-tree lookup as before  Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number) 42

43 k-gram indexes  More space-efficient than permuterm index  Enumerate all character k-grams (sequence of k characters) occurring in a term  2-grams are called bigrams.  Example: from April is the cruelest month we get the bigrams: $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$  $ is a special word boundary symbol, as before.  Maintain an inverted index from bigrams to the terms that contain the bigram 43

44 Postings list in a 3-gram inverted index 44

45 k-gram (bigram, trigram,... ) indexes  Note that we now have two different types of inverted indexes  The term-document inverted index for finding documents based on a query consisting of terms  The k-gram index for finding terms based on a query consisting of k-grams 45

46 Processing wildcarded terms in a bigram index  Query mon* can now be run as: $m AND mo AND on  Gets us all terms with the prefix mon... ... but also many “false positives” like MOON.  We must postfilter these terms against query.  Surviving terms are then looked up in the term-document inverted index.  k-gram index vs. permuterm index  k-gram index is more space efficient.  Permuterm index doesn’t require postfiltering. 46

47 Exercise  Google has very limited support for wildcard queries.  For example, this query doesn’t work very well on Google: [gen* universit*]  Intention: you are looking for the University of Geneva, but don’t know which accents to use for the French words for university and Geneva.  According to Google search basics, : “Note that the * operator works only on whole words, not parts of words.”  But this is not entirely true. Try [pythag*] and [m*nchen]  Exercise: Why doesn’t Google fully support wildcard queries? 47

48 Processing wildcard queries in the term- document index  Problem 1: we must potentially execute a large number of Boolean queries.  Most straightforward semantics: Conjunction of disjunctions  For [gen* universit*]: geneva university OR geneva université OR genève university OR genève université OR general universities OR...  Very expensive  Problem 2: Users hate to type.  If abbreviated queries like [pyth* theo*] for [pythagoras’ theorem] are allowed, users will use them a lot.  This would significantly increase the cost of answering queries.  Somewhat alleviated by Google Suggest 48

49 Spelling correction  Two principal uses  Correcting documents being indexed  Correcting user queries  Two different methods for spelling correction  Isolated word spelling correction  Check each word on its own for misspelling  Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky  Context-sensitive spelling correction  Look at surrounding words  Can correct form/from error above 49

50 Correcting documents  We’re not interested in interactive spelling correction of documents (e.g., MS Word) in this class.  In IR, we use document correction primarily for OCR’ed documents. (OCR = optical character recognition)  The general philosophy in IR is: don’t change the documents. 50

51 Correcting queries  First: isolated word spelling correction  Premise 1: There is a list of “correct words” from which the correct spellings come.  Premise 2: We have a way of computing the distance between a misspelled word and a correct word.  Simple spelling correction algorithm: return the “correct” word that has the smallest distance to the misspelled word.  Example: informaton → information  For the list of correct words, we can use the vocabulary of all words that occur in our collection.  Why is this problematic? 51

52 Alternatives to using the term vocabulary  A standard dictionary (Webster’s, OED etc.)  An industry-specific dictionary (for specialized IR systems)  The term vocabulary of the collection, appropriately weighted 52

CENG 213 Data Structures53 Hash function for strings: ali key KeySize = 3; hash(“ali”) = (105 * * *37 2 ) % 10,007 = i key[i] hash function ali …… ,006 (TableSize) “ali”

QUESTIONS?