1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Chapter 5: Introduction to Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
CS 430 / INFO 430 Information Retrieval
BTrees & Bitmap Indexes
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
CS 430: Information Discovery
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Indexing.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Information Retrieval LECTURE 1 : Introduction.
Evidence from Content INST 734 Module 2 Doug Oard.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS4432: Database Systems II
General Architecture of Retrieval Systems 1Adrienn Skrop.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
COMP261 Lecture 23 B Trees.
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CSCE 561 Information Retrieval System Models
Searching and Indexing
Indexing and Hashing Basic Concepts Ordered Indices
CS 430: Information Discovery
Information Retrieval and Web Design
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4

2 Course Administration Assignment 1 has been posted. It is a programming assignment and is due on Saturday, September 17 at midnight. Follow the submission instructions carefully. Send questions to

3 Organization of Files for Full Text Searching TermPointer to postings ant bee cat dog elk fox gnu hog Inverted lists Word listPostings Documents store

4 Representation of Inverted Files Document store: Stores the documents. Important for user interface design. [Repositories for the storage of document collections are covered in CS 431.] Word list (vocabulary file): Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially.

5 Document Store The Documents Store holds the corpus that is being indexed. The corpus may be: primary documents, e.g., electronic journal articles or Web pages surrogates, e.g., catalog records or abstracts, which refer to the primary documents

6 Document Store The storage of the document store may be: Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents stored on independently managed servers (e.g., the Web) Each requires: a document ID, which is a unique identifier that can be used by the search system to refer to the document, and a location counter, which can be used to specify location of words or characters within a document.

7 Documents Store for Web Search Systems For Web search systems: A document is a Web page. The documents store is the Web. The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the Web for indexing. After indexing, the local copy of each page is discarded, unless stored in a cache. (In addition to the usual word list and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

8 Inverted File Inverted file: An inverted file is list of search terms that are organized for associative look-up, i.e., to answer the questions: In which documents does a specified search term appear? Where within each document does each term appear? (There may be several occurrences.) The word list and the postings file together provide an inverted file system for free text searching. In addition, they contain the data needed to calculate weights and information that is used to display results.

9 Inverted File -- Basic Concept Word Document abacus actor aspen5 atoll11 34 Stop words are removed before building the index.

10 Inverted List -- Definitions Inverted List: A list of all the entries in an inverted file that apply to a specific word, e.g. abacus Posting: Entry in an inverted list that applies to a single instance of a term within a document, e.g., there are three postings for "abacus": abacus 3 abacus 19 abacus 22

11 Use of Inverted Files for Calculating Similarities In the term vector space, if q is query and d j a document, then q and d j have no terms in common iff q.d j = To calculate all the non-zero similarities find R. the set of all the documents, d j, that contain at least one term in the query: 2. Merge the inverted lists for each term t i in the query, with a logical or, to establish the set, R. 3. For each d j  R, calculate Similarity(q, d j ), using appropriate weights. 4. Return the elements of R in ranked order.

12 Enhancements to Inverted Files -- Concept Location: Each posting holds information about the location of each term within the document. Uses user interface design -- highlight location of search term adjacency and near operators (in Boolean searching) Frequency: Each inverted list includes the number of postings for each term. Uses term weighting query processing optimization

13 Inverted File -- Concept (Enhanced) WordPostings DocumentLocation abacus actor aspen atoll Inverted list for term actor

14 Lexicographic Order It is important that the word list can be processed sequentially, i.e, in alphabetic order. To search with wild cards, e.g. comp*, which expands to every term beginning with the letters "comp". To list results for browsing lists of search terms. This is a special case of of the mathematical concept of lexicographic order.

15 Postings File The postings file stores the elements of a sparse matrix, the term assignment matrix, with weights. It is stored as a separate inverted list for each column, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence of a term in a document Each list consists of one or many individual postings.

16 Postings File: A Linked List for Each Term 1 abacus actor aspen atoll A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

17 Length of Postings File For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document postings By Zipf's law, the 10th ranking word occurs, approximately: (10 12 /10)/10 times = times

18 Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

19 Data for Calculating Weights The calculation of weights requires extra data to be held in the inverted file system. For each term, t j and document, d i f ij number of occurrences of t j in d i For each term, t j n j number of documents containing t j For each document, d i m i maximum frequency of any term in d i For the entire document file ntotal number of documents

20 Word List: Individual Records for Each Term The record for term j in the word list contains: term j pointer to inverted (postings) list for term j number of documents in which term j occurs (n j )

21 Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Logical operators If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

22 Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

23 Word List On disk If a word list is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that a word list has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

24 File Structures for Inverted Files: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for lexicographic processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

25 File Structures for Inverted Files: Binary Tree elk beehog cat dog fox ant gnu Input: elk, hog, bee, fox, cat, gnu, ant, dog

26 File Structures for Inverted Files: Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

27 File Structures for Inverted Files: Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

28 File Structures for Inverted Files: Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in- order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when held in memory Knuth vol 1, 2.3.1, page 325.

29 File Structures for Inverted Files: Right Threaded Binary Tree dog bee ant cat gnu elk fox hog NULL

30 File Structures for Inverted Files: B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If k i is the i th key in a given internal node -> all keys in the (i-1) th child are smaller than k i -> all keys in the i th child are bigger than k i All leaves are at the same depth

31 File Structures for Inverted Files: B-trees B-tree example (order 2) Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers

32 File Structures for Inverted Files: B + -tree A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets D 9 D D 54 D D Example: B + -tree of order 2, bucket size 4 (Implementation of B + -trees is covered in CS 432.)