Manning, Raghavan, Schutze

Manning, Raghavan, Schutze
IR Indexing Thanks to B. Arms SIMS Baldi, Frasconi, Smyth Manning, Raghavan, Schutze

What we have covered What is IR Evaluation
Tokenization, terms and properties of text Web crawling; robots.txt Vector model of documents Bag of words Queries Measures of similarity This presentation Indexing Inverted files

Last time Vector models of documents and queries Similarity measures
Bag of words model Similarity measures Text similarity typically used Similarity is a measure of relevance (and ranking) Query is a small document Match query to document All stored and indexed before a query is matched.

Summary: What’s the point of using vector spaces?
A well-formed algebraic space for retrieval Key: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same space as the docs. Can measure each doc’s proximity to it. Natural measure of scores/ranking – no longer Boolean. Queries are expressed as bags of words Other similarity measures: see for a survey

A Typical Web Search Engine
YouSeer Nutch LucidWorks Solr/Lucene Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

A Typical Web Search Engine
Indexing Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Why indexing? For efficient searching for documents with unstructured text (not databases) using terms as queries Databases can still be searched with an index Sphinx Online sequential text search (grep) Small collection Text volatile Data structures - indexes Large, semi-stable document collection Efficient search

Unstructured vs structured data
What’s available? Web Organizations You Unstructured usually means “text” Structured usually means “databases” Semistructured somewhere in between

Unstructured (text) vs. structured (database) data companies in 1996

Unstructured (text) vs. structured (database) data companies in 2006

IR vs. databases: Structured vs unstructured data
Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < AND Manager = Smith.

Unstructured data Typically refers to free text Allows
Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents

Semi-structured data In fact almost no data is “unstructured”
E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure

More sophisticated semi-structured search
Title is about Object Oriented Programming AND Author something like stro*rup where * is the wild-card operator Issues: how do you process “about”? how do you rank results? The focus of XML search.

Clustering and classification
Given a set of docs, group them into clusters based on their contents. Given a set of topics, plus a new doc D, decide which topic(s) D belongs to. Not discussed in this course

The web and its challenges
Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks link analysis, clickstreams ... How do search engines work? And how can we make them better?

More sophisticated information retrieval not covered here
Cross-language information retrieval Question answering Summarization Text mining …

Search Subsystem query parse query query tokens ranked document set
stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

Lexical Analysis: Tokens
What is a token? Free text indexing A token is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, tokens are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc.

Decisions in Building Inverted Files: What is the Term/Token?
• Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? Stemming? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

Decisions in Building an Inverted File: Efficiency and Query Languages
Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Major Indexing Methods
Inverted index effective for very large collections of documents associates lexical items to their occurrences in the collection Positional index Non-positional indexs Block-sort Suffix trees and arrays Faster for phrase searches; harder to build and maintain Signature files Word oriented index structures based on hashing (usually not used for large texts)

Sparse Vectors Terms (tokens) as vectors
Vocabulary and therefore dimensionality of vectors can be very large, ~104 . However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). Need efficient methods for storing and computing with sparse vectors.

Sparse Vectors as Lists
Store vectors as linked lists of non-zero-weight tokens paired with a weight. Space proportional to number of unique terms (n) in document. Requires linear search of the list to find (or change) the weight of a specific term. Requires quadratic time in worst case to compute vector for a document:

Sparse Vectors as Trees
Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves. memory  < film variable Balanced Binary Tree   < < bit 2 film 1 memory 1 variable 2

Sparse Vectors as Trees (cont.)
Space overhead for tree structure: ~2n nodes. O(log n) time to find or update weight of a specific term. O(n log n) time to construct vector. Need software package to support such data structures.

Sparse Vectors as Hash Tables
Hashing: well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array Store tokens in hash table, with token string as key and weight as value. Storage overhead for hash table ~1.5n Table must fit in main memory. Constant time to find or update weight of a specific token (ignoring collisions). O(n) time to construct vector (ignoring collisions).

Implementation Based on Inverted Files
In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). Critical issue is logarithmic or constant-time access to token information.

Index File Types - Lucene
Field File Field infos: .fnm Stored Fields field index: .fdx field data: .fdt Term Dictionary file Term infos: .tis Term info index: .tii Frequency file: .frq Position file: .prx

Representation of Inverted Files
Field File Field infos: Field names are stored in the field info file Stored Fields Field index: contains, for each document, a pointer to its field data Field data: contains the stored fields of each document Term Dictionary file (stores Document Frequency) Term infos: This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text Term info index: contains entries and locations in term infos file. This is designed to be read entirely into memory and used to provide random access to the term infos file. Frequency file (posting file): contains the lists of documents which contain each term, along with the frequency of the term in that document Ordered by term (the term is implicit, from the term infos file). Term Frequency entries are ordered by increasing document number. Position file (Inverted index file): contains the lists of positions that each term occurs at within documents. Can be used to highlight search results, etc. Positions entries are ordered by increasing document number

Organization of Inverted Files
Term Frequency (Posting file) Term Position File (Inverted index) Term Dictionary Prefix Length Suffix Field Num Doc Freq cat 1 2 3 hy dog 6 om Term (implicit) DocDelta, Freq? cat 3, 2, 2 cathy 3, 3, 7 dog 3, 3, 5, 3, 4, 2, 7 doom 7 Term(implicit) Doc1 Doc2 … DocN cat 1 2, 1 2 cathy 3 dog 6 doom Document Files

Efficiency Criteria Storage
Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

Document File The documents file stores the documents that are being indexed. The documents may be: • primary documents, e.g., electronic journal articles • surrogates, e.g., catalog records or abstracts

Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

Document File The storage of the document file may be:
Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents are stored on independently managed servers (e.g., Web) Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document.

Documents File for Web Search System
For web search systems: • A document is a web page. • The documents file is the web. • The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Term Frequency (Postings File)
The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each term, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence on a term in documents Each list consists of one or many individual postings.

Postings File in Lucene: A Linked List for Each Term
1 abacus : 3, 6, 5 meaning: “abacus” appeared once in document 1 ( |3/2| ), 5 times in document 4 ( 1+|6/2| ). 2 abc : 3, 5, 2, 3 meaning: “abc” appeared once in document 1 ( |3/2| ), once in document 3 ( 1+ |5/2| ), 3 times in document 4 ( 3+|2/2| ). A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

Length of Postings File
For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document 1012 postings By Zipf's law, the 10th ranking word occurs, approximately: (1012/10)/10 times = 1010 times

Inverted Index Primary data structure for text indexes
Basically two elements: (Vocabulary, Occurrences) Main Idea: Invert documents into a big index Basic steps: Make a “dictionary” of all the terms/tokens in the collection For each term/token, list all the docs it occurs in. Possibly location in document (Lucene stores the positions) Compress to reduce redundancy in the data structure Also reduces I/O and storage required

Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created
Documents are parsed one document at a time to extract tokens/terms. These are saved with the Document ID (DID). <term, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created
After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

Multiple term/token entries for a single document are merged. Within-document term frequency information is compiled. Result <term,DID,tf> <the,1,2>

Then the inverted file can be split into A Dictionary file File of unique terms Fit in memory if possible and A Postings file File of what document the term/token is in and how often. Sometimes where the term is in the document. Store on disk Worst case O(n); n size of tokens.

Dictionary and Posting Files
Dictionary Posting

Inverted indexes Permit fast search for individual terms
For each term/token, you get a list consisting of: document ID (DID) frequency of term in doc (optional, implied in Lucene) position of term in doc (optional, Lucene) <term,DID,tf,position> <term,(DIDi,tf,positionij),…> Lucene: <positionij,…> (term and DID are implied from other files) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2

How Inverted Files are Used
Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Inverted index Associates a posting list with each term
POSTING LIST example a (d1, 1) … the (d1,2) (d2,2) Replace term frequency(tf) with tfidf Lucene only uses tf, idf added at query time Compress index and put hash links Hashing (a type of compression) speeds up the access Match query to index and rank

Position in inverted file posting
POSTING LIST example now (1, 1) … time (1,4) (2,13) Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

Change weight Multiple term entries for a single document are merged.
Within-document term frequency information is compiled.

Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search

Time Complexity of Indexing
Complexity of creating vector and indexing a document of n terms is O(n). So building an index of m such documents is O(m n). Computing vector lengths is also O(m n), which is also the complexity of reading in the corpus.

Retrieval with an Inverted Index
Terms that are not in both the query and the document do not effect cosine similarity. Product of term weights is zero and does not contribute to the dot product. Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words. Retrieval time O(log M) due to hashing where M is the size of the document collection.

Inverted Query Retrieval Efficiency
Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all M documents, O(|V| M), because |Q| << |N| and B << M. |V| size of vocabulary Q = q q … qN D11…D1B D21…D2B Dn1…DnB

Processing the Query Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value.

Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Index File Structures: Linear Index
Advantages Can be searched quickly, e.g., by binary search, O(log M) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Documents File for Web Search System
For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information)

Index File Structures: Binary Tree
Input: elk, hog, bee, fox, cat, gnu, ant, dog elk bee hog ant cat fox gnu dog

Binary Tree Advantages Can be searched quickly
Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

Binary Tree Calculation of maximum depth of tree.
Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

Right Threaded Binary Tree
Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when index held in memory Knuth vol 1, 2.3.1, page 325.

Right Threaded Binary Tree
From: Robert F. Rossa

B-trees B-tree of order m: A balanced, multiway search tree:
• Each node stores many keys • Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. • If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki • All leaves are at the same depth

B-trees B-tree example (order 2) 50 65 55 59 10 19 35 70 90 98 66 68
50 65 55 59 66 68 36 47 72 73 Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers.

B+-tree • A B-tree is used as an index
• Data is stored in the leaves of the tree, known as buckets Example: B+-tree of order 2, bucket size 4 50 65 10 25 55 59 ... D9 D D54 D66... D81 ...

Time Complexity of Indexing
Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabularly V is O(|V|). Computing vector lengths is also O(m n). Since |V|  m n, complete process is O(m n), which is also the complexity of just reading in the corpus.

Index on disk vs. memory Most retrieval systems keep the dictionary in memory and the postings on disk Web search engines frequently keep both in memory massive memory requirement feasible for large web service installations less so for commercial usage where query loads are lighter

Indexing in the real world
Typically, don’t have all documents sitting on a local filesystem Documents need to be crawled Could be dispersed over a WAN with varying connectivity Must schedule distributed spiders/indexers Could be (secure content) in Databases Content management applications applications

Content residing in applications
Mail systems/groupware, content management contain the most “valuable” documents http often not the most efficient way of fetching these documents - native API fetching Specialized, repository-specific connectors These connectors also facilitate document viewing when a search result is selected for viewing

Secure documents Each document is accessible to a subset of users
Usually implemented through some form of Access Control Lists (ACLs) Search users are authenticated Query should retrieve a document only if user can access it So if there are docs matching your search but you’re not privy to them, “Sorry no results found” E.g., as a lowly employee in the company, I get “No results” for the query “salary roster” Solr has this with post filtering

Users in groups, docs from groups
Index the ACLs and filter results by them Often, user membership in an ACL group verified at query time – slowdown Documents Users 0/1 0 if user can’t read doc, 1 otherwise.

Compound documents What if a doc consisted of components
Each component has its own ACL. Your search should get a doc only if your query meets one of its components that you have access to. More generally: doc assembled from computations on components e.g., in Lotus databases or in content management systems How do you index such docs? No good answers …

“Rich” documents (How) Do we index images?
Researchers have devised Query based on Image Content (QBIC) systems “show me a picture similar to this orange circle” Then use vector space retrieval In practice, image search usually based on meta- data such as file name e.g., monalisa.jpg New approaches exploit social tagging E.g., flickr.com

Passage/sentence retrieval
Suppose we want to retrieve not an entire document matching a query, but only a passage/sentence - say, in a very long document Can index passages/sentences as mini- documents – what should the index units be? This is the subject of XML search

Indexing Subsystem documents Documents assign document IDs text
document numbers and *field numbers break into tokens tokens stop list* non-stoplist tokens stemming* *Indicates optional operation. stemmed terms term weighting* terms with weights Index database

Search Subsystem query parse query query tokens ranked document set
stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

Lucene’s index (conceptual)
Document Document Field Field Field Document Name Value Field Document Field

Create a Lucene index (step 1)
Create Lucene document and add fields import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public void createDoc(String title, String body) { Document doc=new Document( ); doc.add(Field.Text(“Title”, title)); doc.add(Field.Text(“Body”,body); }

Create an Analyser Options WhitespaceAnalyzer divides text at whitespace SimpleAnalyzer divides text at non-letters convert to lower case StopAnalyzer removes stop words StandardAnalyzer good for most European Languages

Create an index writer, add Lucene document into the index import java.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyser; public void WriteDoc(Document doc, String idxPath) { try{ IndexWriter writer = new IndexWriter(idxPath, new StandardAnalyzer( ), true); writer.addDocument(doc); writer.close( ); } catch (IOException exp) {System.out.println(“I/O Error!”);}

Luence Index – Behind the Scene
Inverted Index (Inverted File) Posting id word doc offset 1 football Doc 1 3 67 Doc 2 2 penn players 4 state 13 Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State

Posting table Posting table is a fast look-up mechanism
Key: word Value: posting id, satellite data (#df, offset, …) Lucene implements the posting table with Java’s hash table Objectified from java.util.Hashtable Hash function depends on the JVM hc2 = hc1 * 31 + nextChar Posting table usage Indexing: insertion (new terms), update (existing terms) Searching: lookup, and construct document vector

Search Lucene’s index (step 1)
Construct an query (automatic) import org.apache.lucene.search.Query; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyser; public void formQuery(String ques) { Query query = new QueryParser.parse( ques, “body”, new StandardAnalyser( ) ); }

Types of query: Boolean: [IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] wildcard: [nu?ch] [nutc*] phrase: [“JAVA TOMCAT”] proximity: [“lucene nutch” ~10] fuzzy: [roam~] matches roams and foam date range …

Search the index import org.apache.lucene.document.Document; import org.apache.lucene.search.*; import org.apache.lucene.store.*; public void searchIdx(String idxPath) { Directory fsDir=FSDirectory.getDirectory(idxPath,false); IndexSearcher is=new IndexSearcher(fsDir); Hits hits = is.search(query); }

Display the results for (int i=0;i<hits.length();i++) { Document doc=hits.doc(i); //show your results }

Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search 90

Lucene workflow Document super_name: Spider-Man name: Peter Parker
category: superhero powers: agility, spider-sense Hits (Matching Docs) Query (powers:agility) addDocument() search() Get Lucene jar file Write indexing code to get data and create Document objects Write code to create query objects Write code to use/display results IndexWriter IndexSearcher Get Lucene jar file Write indexing code to get data and create Document objects to index with the IndexWriter Write code to create query objects for search Write code to use/display results Only a single IndexWriter may be open on an index An IndexWriter is thread-safe, so multiple threads can add documents at the same time. Multiple IndexSearchers may be opened on an index IndexSearchers are also thread safe, and can handle multiple searches concurrently an IndexSearcher instance has a static view of the index... it sees no updates after it has been opened An index may be concurrently added to and searched, but new additions won’t show up until the IndexWriter is closed and a new IndexSearcher is opened. Lucene Index

What we covered Indexing modes Inverted files Examples from Lucene
Storage and access advantages Posting files Examples from Lucene 92

Manning, Raghavan, Schutze

Similar presentations

Presentation on theme: "Manning, Raghavan, Schutze"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Manning, Raghavan, Schutze

Similar presentations

Presentation on theme: "Manning, Raghavan, Schutze"— Presentation transcript:

Similar presentations

About project

Feedback