Manning, Raghavan, Schutze

Manning, Raghavan, Schutze
IR Indexing Thanks to B. Arms SIMS Baldi, Frasconi, Smyth Manning, Raghavan, Schutze

What we have covered What is IR Evaluation
Tokenization, terms and properties of text Web crawling; robots.txt Vector model of documents Bag of words Queries Measures of similarity This presentation Indexing Inverted files

Last time Vector models of documents and queries Similarity measures
Bag of words model Similarity measures Text similarity typically used Similarity is a measure of relevance (and ranking) Query is a small document Match query to document All stored and indexed before a query is matched.

Summary: What’s the point of using vector spaces?
A well-formed algebraic space for retrieval Key: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same space as the docs. Can measure each doc’s proximity to it. Natural measure of scores/ranking – no longer Boolean. Queries are expressed as bags of words Other similarity measures: see for a survey

A Typical Web Search Engine
YouSeer Nutch LucidWorks Solr/Lucene Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

A Typical Web Search Engine
Indexing Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Why indexing? For efficient searching for documents with unstructured text (not databases) using terms as queries Databases can still be searched with an index Sphinx Online sequential text search (grep) Small collection Text volatile Data structures - indexes Large, semi-stable document collection Efficient search

Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return) Grep is line-oriented; IR is document oriented.

Term-document incidence matrices
Sec. 1.1 Term-document incidence matrices 1 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia

Incidence vectors So we have a 0/1 vector for each term.
Sec. 1.1 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. AND AND = 100100

Answers to query Antony and Cleopatra, Act III, Scene ii
Sec. 1.1 Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. Wikimedia commons picture of Shake

Sec. 1.1 Bigger collections Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these.

Sec. 1.1 Can’t build the matrix 500K x 1M matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s. matrix is extremely sparse. What’s a better representation? We only record the 1 positions.

Sec. 1.2 Inverted index For each term t, we must store a list of all documents that contain t. Identify each doc by a docID, a document serial number Can we used fixed-size arrays for this? Brutus Calpurnia Caesar 1 2 4 5 6 16 57 132 11 31 45 173 174 54 101 What happens if the word Caesar is added to document 14?

Inverted index We need variable-size postings lists
Sec. 1.2 We need variable-size postings lists On disk, a continuous run of postings is normal and best In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion Posting Brutus 1 2 4 11 31 45 173 174 Dictionary Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54 101 Postings Sorted by docID (more later on why).

Inverted index construction
Sec. 1.2 Inverted index construction Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Document icons from free icon set: Indexer Inverted index friend roman countryman 2 4 13 16 1

Initial stages of text processing
Tokenization Cut character sequence into word tokens Deal with “John’s”, a state-of-the-art solution Normalization Map text and query term to same form You want U.S.A. and USA to match Stemming We may wish different forms of a root to match authorize, authorization Stop words We may omit very common words (or not) the, a, to, of

Dictionary data structures for inverted indexes
Sec. 3.1 Dictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

A naïve dictionary An array of struct: char[20] int Postings *
Sec. 3.1 A naïve dictionary An array of struct: char[20] int Postings * 20 bytes 4/8 bytes /8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?

Dictionary data structures
Sec. 3.1 Dictionary data structures Two main choices: Hashtables Trees Some IR systems use hashtables, some trees

Hashtables Each vocabulary term is hashed to an integer Pros: Cons:
Sec. 3.1 Each vocabulary term is hashed to an integer (We assume you’ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

Tree: binary tree Sec. 3.1 Root a-m n-z a-hu hy-m n-sh si-z sickle
zygot huygens aardvark

Sec. 3.1 Tree: B-tree a-hu n-z hy-m Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

Trees Solves the prefix problem (terms starting with hyp)
Sec. 3.1 Trees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings … but we typically have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem

Unstructured vs structured data
What’s available? Web Organizations You Unstructured usually means “text” Structured usually means “databases” Semistructured somewhere in between

Unstructured (text) vs. structured (database) data companies in 1996

Unstructured (text) vs. structured (database) data companies in 2006

IR vs. databases: Structured vs unstructured data
Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < AND Manager = Smith.

Unstructured data Typically refers to free text Allows
Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents

Semi-structured data In fact almost no data is “unstructured”
E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure

More sophisticated semi-structured search
Title is about Object Oriented Programming AND Author something like stro*rup where * is the wild-card operator Issues: how do you process “about”? how do you rank results? The focus of XML search.

Clustering and classification
Given a set of docs, group them into clusters based on their contents. Given a set of topics, plus a new doc D, decide which topic(s) D belongs to. Not discussed in this course

The web and its challenges
Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks link analysis, clickstreams ... How do search engines work? And how can we make them better?

More sophisticated information retrieval not covered here
Cross-language information retrieval Question answering Summarization Text mining …

Search Subsystem query parse query query tokens ranked document set
stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

Lexical Analysis: Tokens
What is a token? Free text indexing A token is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, tokens are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc.

Decisions in Building Inverted Files: What is the Term/Token?
• Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? Stemming? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

Decisions in Building an Inverted File: Efficiency and Query Languages
Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Major Indexing Methods
Inverted index effective for very large collections of documents associates lexical items to their occurrences in the collection Positional index Non-positional indexs Block-sort Suffix trees and arrays Faster for phrase searches; harder to build and maintain Signature files Word oriented index structures based on hashing (usually not used for large texts)

Sparse Vectors Terms (tokens) as vectors
Vocabulary and therefore dimensionality of vectors can be very large, ~104 . However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). Need efficient methods for storing and computing with sparse vectors.

Sparse Vectors as Lists
Store vectors as linked lists of non-zero-weight tokens paired with a weight. Space proportional to number of unique terms (n) in document. Requires linear search of the list to find (or change) the weight of a specific term. Requires quadratic time in worst case to compute vector for a document:

Sparse Vectors as Trees
Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves. memory  < film variable Balanced Binary Tree   < < bit 2 film 1 memory 1 variable 2

Sparse Vectors as Trees (cont.)
Space overhead for tree structure: ~2n nodes. O(log n) time to find or update weight of a specific term. O(n log n) time to construct vector. Need software package to support such data structures.

Sparse Vectors as Hash Tables
Hashing: well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array Store tokens in hash table, with token string as key and weight as value. Storage overhead for hash table ~1.5n Table must fit in main memory. Constant time to find or update weight of a specific token (ignoring collisions). O(n) time to construct vector (ignoring collisions).

Implementation Based on Inverted Files
In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). Critical issue is logarithmic or constant-time access to token information.

Index File Types - Lucene
Field File Field infos: .fnm Stored Fields field index: .fdx field data: .fdt Term Dictionary file Term infos: .tis Term info index: .tii Frequency file: .frq Position file: .prx

Representation of Inverted Files
Field File Field infos: Field names are stored in the field info file Stored Fields Field index: contains, for each document, a pointer to its field data Field data: contains the stored fields of each document Term Dictionary file (stores Document Frequency) Term infos: This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text Term info index: contains entries and locations in term infos file. This is designed to be read entirely into memory and used to provide random access to the term infos file. Frequency file (posting file): contains the lists of documents which contain each term, along with the frequency of the term in that document Ordered by term (the term is implicit, from the term infos file). Term Frequency entries are ordered by increasing document number. Position file (Inverted index file): contains the lists of positions that each term occurs at within documents. Can be used to highlight search results, etc. Positions entries are ordered by increasing document number

Organization of Inverted Files
Term Frequency (Posting file) Term Position File (Inverted index) Term Dictionary Prefix Length Suffix Field Num Doc Freq cat 1 2 3 hy dog 6 om Term (implicit) DocDelta, Freq? cat 3, 2, 2 cathy 3, 3, 7 dog 3, 3, 5, 3, 4, 2, 7 doom 7 Term(implicit) Doc1 Doc2 … DocN cat 1 2, 1 2 cathy 3 dog 6 doom Document Files

Efficiency Criteria Storage
Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

Document File The documents file stores the documents that are being indexed. The documents may be: • primary documents, e.g., electronic journal articles • surrogates, e.g., catalog records or abstracts

Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

Document File The storage of the document file may be:
Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents are stored on independently managed servers (e.g., Web) Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document.

Documents File for Web Search System
For web search systems: • A document is a web page. • The documents file is the web. • The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Term Frequency (Postings File)
The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each term, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence on a term in documents Each list consists of one or many individual postings.

Postings File in Lucene: A Linked List for Each Term
1 abacus : 3, 6, 5 meaning: “abacus” appeared once in document 1 ( |3/2| ), 5 times in document 4 ( 1+|6/2| ). 2 abc : 3, 5, 2, 3 meaning: “abc” appeared once in document 1 ( |3/2| ), once in document 3 ( 1+ |5/2| ), 3 times in document 4 ( 3+|2/2| ). A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

Length of Postings File
For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document 1012 postings By Zipf's law, the 10th ranking word occurs, approximately: (1012/10)/10 times = 1010 times

Inverted Index Primary data structure for text indexes
Basically two elements: (Vocabulary, Occurrences) Main Idea: Invert documents into a big index Basic steps: Make a “dictionary” of all the terms/tokens in the collection For each term/token, list all the docs it occurs in. Possibly location in document (Lucene stores the positions) Compress to reduce redundancy in the data structure Also reduces I/O and storage required

Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created
Documents are parsed one document at a time to extract tokens/terms. These are saved with the Document ID (DID). <term, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created
After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

Multiple term/token entries for a single document are merged. Within-document term frequency information is compiled. Result <term,DID,tf> <the,1,2>

Then the inverted file can be split into A Dictionary file File of unique terms Fit in memory if possible and A Postings file File of what document the term/token is in and how often. Sometimes where the term is in the document. Store on disk Worst case O(n); n size of tokens.

Dictionary and Posting Files
Dictionary Posting

Inverted indexes Permit fast search for individual terms
For each term/token, you get a list consisting of: document ID (DID) frequency of term in doc (optional, implied in Lucene) position of term in doc (optional, Lucene) <term,DID,tf,position> <term,(DIDi,tf,positionij),…> Lucene: <positionij,…> (term and DID are implied from other files) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2

How Inverted Files are Used
Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Inverted index Associates a posting list with each term
POSTING LIST example a (d1, 1) … the (d1,2) (d2,2) Replace term frequency(tf) with tfidf Lucene only uses tf, idf added at query time Compress index and put hash links Hashing (a type of compression) speeds up the access Match query to index and rank

Position in inverted file posting
POSTING LIST example now (1, 1) … time (1,4) (2,13) Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

Change weight Multiple term entries for a single document are merged.
Within-document term frequency information is compiled.

Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search

Time Complexity of Indexing
Complexity of creating vector and indexing a document of n terms is O(n). So building an index of m such documents is O(m n). Computing vector lengths is also O(m n), which is also the complexity of reading in the corpus.

Retrieval with an Inverted Index
Terms that are not in both the query and the document do not effect cosine similarity. Product of term weights is zero and does not contribute to the dot product. Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words. Retrieval time O(log M) due to hashing where M is the size of the document collection.

Inverted Query Retrieval Efficiency
Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all M documents, O(|V| M), because |Q| << |N| and B << M. |V| size of vocabulary Q = q q … qN D11…D1B D21…D2B Dn1…DnB

Processing the Query Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value.

Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Index File Structures: Linear Index
Advantages Can be searched quickly, e.g., by binary search, O(log M) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Documents File for Web Search System
For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information)

Index File Structures: Binary Tree
Input: elk, hog, bee, fox, cat, gnu, ant, dog elk bee hog ant cat fox gnu dog

Binary Tree Advantages Can be searched quickly
Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

Binary Tree Calculation of maximum depth of tree.
Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

Right Threaded Binary Tree
Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when index held in memory Knuth vol 1, 2.3.1, page 325.

Right Threaded Binary Tree
From: Robert F. Rossa

B-trees B-tree of order m: A balanced, multiway search tree:
• Each node stores many keys • Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. • If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki • All leaves are at the same depth

B-trees B-tree example (order 2) 50 65 55 59 10 19 35 70 90 98 66 68
50 65 55 59 66 68 36 47 72 73 Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers.

B+-tree • A B-tree is used as an index
• Data is stored in the leaves of the tree, known as buckets Example: B+-tree of order 2, bucket size 4 50 65 10 25 55 59 ... D9 D D54 D66... D81 ...

Time Complexity of Indexing
Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabularly V is O(|V|). Computing vector lengths is also O(m n). Since |V|  m n, complete process is O(m n), which is also the complexity of just reading in the corpus.

Index on disk vs. memory Most retrieval systems keep the dictionary in memory and the postings on disk Web search engines frequently keep both in memory massive memory requirement feasible for large web service installations less so for commercial usage where query loads are lighter

Indexing in the real world
Typically, don’t have all documents sitting on a local filesystem Documents need to be crawled Could be dispersed over a WAN with varying connectivity Must schedule distributed spiders/indexers Could be (secure content) in Databases Content management applications applications

Content residing in applications
Mail systems/groupware, content management contain the most “valuable” documents http often not the most efficient way of fetching these documents - native API fetching Specialized, repository-specific connectors These connectors also facilitate document viewing when a search result is selected for viewing

Secure documents Each document is accessible to a subset of users
Usually implemented through some form of Access Control Lists (ACLs) Search users are authenticated Query should retrieve a document only if user can access it So if there are docs matching your search but you’re not privy to them, “Sorry no results found” E.g., as a lowly employee in the company, I get “No results” for the query “salary roster” Solr has this with post filtering

Users in groups, docs from groups
Index the ACLs and filter results by them Often, user membership in an ACL group verified at query time – slowdown Documents Users 0/1 0 if user can’t read doc, 1 otherwise.

Compound documents What if a doc consisted of components
Each component has its own ACL. Your search should get a doc only if your query meets one of its components that you have access to. More generally: doc assembled from computations on components e.g., in Lotus databases or in content management systems How do you index such docs? No good answers …

“Rich” documents (How) Do we index images?
Researchers have devised Query based on Image Content (QBIC) systems “show me a picture similar to this orange circle” Then use vector space retrieval In practice, image search usually based on meta- data such as file name e.g., monalisa.jpg New approaches exploit social tagging E.g., flickr.com

Passage/sentence retrieval
Suppose we want to retrieve not an entire document matching a query, but only a passage/sentence - say, in a very long document Can index passages/sentences as mini- documents – what should the index units be? This is the subject of XML search

Sec. 2.4 Phrase queries We want to be able to answer queries such as “stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works Many more queries are implicit phrase queries For this, it no longer suffices to store only <term : docs> entries

A first attempt: Biword indexes
Sec A first attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate.

Can have false positives!
Sec Longer phrase queries Longer phrases can be processed by breaking them down stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Can have false positives!

Issues for biword indexes
Sec Issues for biword indexes False positives, as noted before Index blowup due to bigger dictionary Infeasible for more than biwords, big even for them Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

Solution 2: Positional indexes
Sec Solution 2: Positional indexes In the postings, store, for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>

Positional index example
Sec Positional index example <be: ; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? For phrase queries, we use a merge algorithm recursively at the document level But we now need to deal with more than just equality

Processing a phrase query
Sec Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches

Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
Sec Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Again, here, /k means “within k words of”. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? This is a little tricky to do correctly and efficiently See Figure 2.12 of IIR

Sec Positional index size A positional index expands postings storage substantially Even though indices can be compressed Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system.

Positional index size Sec Need an entry for each occurrence, not just once per document Index size depends on average document size Average web page has <1000 terms SEC filings, books, even some epic poems … easily 100,000 terms Consider a term with frequency 0.1% Why? 100 1 100,000 1000 Positional postings Postings Document size

Sec Rules of thumb A positional index is 2–4 as large as a non-positional index Positional index size 35–50% of volume of original text Caveat: all of this holds for “English-like” languages

Combination schemes These two approaches can be profitably combined
Sec Combination schemes These two approaches can be profitably combined For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists Even more so for phrases like “The Who” Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme A typical web query mixture was executed in ¼ of the time of using just a positional index It required 26% more space than having a positional index alone

Distributed indexing For web-scale indexing (don’t try this at home!):
Sec. 4.4 Distributed indexing For web-scale indexing (don’t try this at home!): must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

Web search engine data centers
Sec. 4.4 Web search engine data centers Web search data centers (Google, Bing, Baidu) mainly contain commodity machines. Data centers are distributed around the world. Estimate: Google ~1 million servers, 3 million processors/cores (Gartner 2007)

Sec. 4.4 Massive data centers If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? Answer: 63% Exercise: Calculate the number of servers failing per minute for an installation of 1 million servers.

Inside Google Data Center

Sec. 4.4 Distributed indexing Maintain a master machine directing the indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.

Parallel tasks We will use two sets of parallel tasks
Sec. 4.4 Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document collection into splits Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI)

Parsers Master assigns a split to an idle parser machine
Sec. 4.4 Parsers Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j = 3. Now to complete the index inversion

Sec. 4.4 Inverters An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists

Data flow Master assign assign Postings Parser a-f g-p q-z Inverter
Sec. 4.4 Data flow Master assign assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Inverter q-z Parser a-f g-p q-z Map phase Reduce phase Segment files

Sec. 4.4 MapReduce The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing … … without having to write code for the distribution part. They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.

MapReduce Index construction was just one phase.
Sec. 4.4 Index construction was just one phase. Another phase: transforming a term-partitioned index into a document-partitioned index. Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of documents As we’ll discuss in the web part of the course, most search engines use a document-partitioned index … better load balancing, etc.

Schema for index construction in MapReduce
Sec. 4.4 Schema for index construction in MapReduce Schema of map and reduce functions map: input → list(k, v) reduce: (k,list(v)) → output Instantiation of the schema for index construction map: collection → list(termID, docID) reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …) → (postings list1, postings list2, …)

Example for index construction
Map: d1 : C came, C c’ed. d2 : C died. → <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>, <C, d2>, <died,d2> Reduce: (<C,(d1,d2,d1)>, <died,(d2)>, <came,(d1)>, <c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c’ed,(d1:1)>)

Sec. 4.5 Dynamic indexing Up to now, we have assumed that collections are static. They rarely are: Documents come in over time and need to be inserted. Documents are deleted and modified. This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

Simplest approach Maintain “big” main index
Sec. 4.5 Simplest approach Maintain “big” main index New docs go into “small” auxiliary index Search across both, merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index

Issues with main and auxiliary indexes
Sec. 4.5 Issues with main and auxiliary indexes Problem of frequent merges – you touch stuff a lot Poor performance during merge Actually: Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. Merge is the same as a simple append. But then we would need a lot of files – inefficient for OS. Assumption for the rest of the lecture: The index is one big file. In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)

Indexing Subsystem documents Documents assign document IDs text
document numbers and *field numbers break into tokens tokens stop list* non-stoplist tokens stemming* *Indicates optional operation. stemmed terms term weighting* terms with weights Index database

Search Subsystem query parse query query tokens ranked document set
stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

Lucene’s index (conceptual)
Document Document Field Field Field Document Name Value Field Document Field

Create a Lucene index (step 1)
Create Lucene document and add fields import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public void createDoc(String title, String body) { Document doc=new Document( ); doc.add(Field.Text(“Title”, title)); doc.add(Field.Text(“Body”,body); }

Create an Analyser Options WhitespaceAnalyzer divides text at whitespace SimpleAnalyzer divides text at non-letters convert to lower case StopAnalyzer removes stop words StandardAnalyzer good for most European Languages

Create an index writer, add Lucene document into the index import java.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyser; public void WriteDoc(Document doc, String idxPath) { try{ IndexWriter writer = new IndexWriter(idxPath, new StandardAnalyzer( ), true); writer.addDocument(doc); writer.close( ); } catch (IOException exp) {System.out.println(“I/O Error!”);}

Luence Index – Behind the Scene
Inverted Index (Inverted File) Posting id word doc offset 1 football Doc 1 3 67 Doc 2 2 penn players 4 state 13 Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State

Posting table Posting table is a fast look-up mechanism
Key: word Value: posting id, satellite data (#df, offset, …) Lucene implements the posting table with Java’s hash table Objectified from java.util.Hashtable Hash function depends on the JVM hc2 = hc1 * 31 + nextChar Posting table usage Indexing: insertion (new terms), update (existing terms) Searching: lookup, and construct document vector

Search Lucene’s index (step 1)
Construct an query (automatic) import org.apache.lucene.search.Query; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyser; public void formQuery(String ques) { Query query = new QueryParser.parse( ques, “body”, new StandardAnalyser( ) ); }

Types of query: Boolean: [IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] wildcard: [nu?ch] [nutc*] phrase: [“JAVA TOMCAT”] proximity: [“lucene nutch” ~10] fuzzy: [roam~] matches roams and foam date range …

Search the index import org.apache.lucene.document.Document; import org.apache.lucene.search.*; import org.apache.lucene.store.*; public void searchIdx(String idxPath) { Directory fsDir=FSDirectory.getDirectory(idxPath,false); IndexSearcher is=new IndexSearcher(fsDir); Hits hits = is.search(query); }

Display the results for (int i=0;i<hits.length();i++) { Document doc=hits.doc(i); //show your results }

Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search 136

Lucene workflow Document super_name: Spider-Man name: Peter Parker
category: superhero powers: agility, spider-sense Hits (Matching Docs) Query (powers:agility) addDocument() search() Get Lucene jar file Write indexing code to get data and create Document objects Write code to create query objects Write code to use/display results IndexWriter IndexSearcher Get Lucene jar file Write indexing code to get data and create Document objects to index with the IndexWriter Write code to create query objects for search Write code to use/display results Only a single IndexWriter may be open on an index An IndexWriter is thread-safe, so multiple threads can add documents at the same time. Multiple IndexSearchers may be opened on an index IndexSearchers are also thread safe, and can handle multiple searches concurrently an IndexSearcher instance has a static view of the index... it sees no updates after it has been opened An index may be concurrently added to and searched, but new additions won’t show up until the IndexWriter is closed and a new IndexSearcher is opened. Lucene Index

What we covered Indexing modes Inverted files Examples from Lucene
Storage and access advantages Posting files Examples from Lucene 138

Manning, Raghavan, Schutze

Similar presentations

Presentation on theme: "Manning, Raghavan, Schutze"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Manning, Raghavan, Schutze

Similar presentations

Presentation on theme: "Manning, Raghavan, Schutze"— Presentation transcript:

Similar presentations

About project

Feedback