Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Chapter 5: Introduction to Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Data Structures Hash Tables
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
CS 430: Information Discovery
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Indexing.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Information Retrieval LECTURE 1 : Introduction.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS4432: Database Systems II
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Lecture 1: Introduction and the Boolean Model Information Retrieval
Indexing Structures for Files and Physical Database Design
Indexing and hashing.
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Text Based Information Retrieval
CS 430: Information Discovery
COMP 430 Intro. to Database Systems
CS 430: Information Discovery
Searching and Indexing
Indexing and Hashing Basic Concepts Ordered Indices
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

Information Retrieval Inverted Files.

Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends of the vectors d' all lie on a surface with unit radius For similar documents, we can represent parts of this surface as a flat region Similar document are represented as points that are close together on this surface d|d| d|d|

Results of a Search x x x x x x x  hits from search x documents found by search  query

Relevance Feedback (Concept) x x x x o o o  hits from original search x documents identified as non-relevant o documents identified as relevant  original query reformulated query

Document Clustering (Concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.

Use of Inverted Files for Calculating Similarities In the term vector space, if q is query and d j a document, then q and d j have no terms in common iff q.d j = To calculate all the non-zero similarities, find all the documents, d j, that contain at least one term in the query: Merge the inverted lists for each term t i in the query, with a logical OR, to establish a set of hits, R. For each d j  R, calculate Similarity(q, d j ), using appropriate weights. 2. Return the elements of R in ranked order.

Representation of Inverted Files Document file: Stores the documents. Important for user interface design. Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially.

Organization of Inverted Files TermPointer to postings ant bee cat dog elk fox gnu hog Inverted lists Index filePostings file Documents file

Decisions in Building Inverted Files: What is a Term? Underlying character set, e.g., printable ASCII, Unicode, UTF8. Is there a controlled vocabulary? If so, what words are included? List of stopwords. Rules to decide the beginning and end of words, e.g., spaces or punctuation. Character sequences not to be indexed, e.g., sequences of numbers.

Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

Document File The documents file stores the documents that are being indexed. The documents may be: primary documents, e.g., electronic journal articles surrogates, e.g., catalog records or abstracts

Document File The storage of the document file may be: Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents are stored on independently managed servers (e.g., Web) Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document.

Docs file for web search system For web search systems: A document is a web page. The documents file is the web. The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Postings File The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each column, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence on a term in a document Each list consists of one or many individual postings.

Postings File: A Linked List for Each Term 1 abacus actor aspen atoll A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

Length of Postings File For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document postings By Zipf's law, the 10th ranking word occurs, approximately: (10 12 /10)/10 times = times

Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

Data for Calculating Weights The calculation of weights requires extra data to be held in the inverted file system. For each term, t j and document, d i f ij number of occurrences of t j in d i For each term, t j n j number of documents containing t j For each document, d i m i maximum frequency of any term in d i For the entire document file ntotal number of documents

Index File: Individual Records for Each Term The record for term j in the index file contains: term j pointer to inverted (postings) list for term j number of documents in which term j occurs (n j )

Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for lexicographic processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Index File Structures: Binary Tree elk beehog cat dog fox ant gnu Input: elk, hog, bee, fox, cat, gnu, ant, dog

Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise- empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when index held in memory Knuth vol 1, 2.3.1, page 325.

Right Threaded Binary Tree From: Robert F. Rossa

B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If k i is the i th key in a given internal node -> all keys in the (i-1) th child are smaller than k i -> all keys in the i th child are bigger than k i All leaves are at the same depth

B-trees B-tree example (order 2) Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers

B + -tree A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets D 9 D D 54 D D Example: B + -tree of order 2, bucket size 4