1 CS 430: Information Discovery Lecture 3 Inverted Files.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Chapter 4 : File Systems What is a file system?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
January 11, Csci 2111: Data and File Structures Week1, Lecture 1 Introduction to the Design and Specification of File Structures.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Chapter 11: File System Implementation
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
Chapter 10 File System Interface
Lecture 9 Methodology – Physical Database Design for Relational Databases.
January 11, Files – Chapter 1 Introduction to the Design and Specification of File Structures.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
CS 430: Information Discovery
File System Interface. File Concept Access Methods Directory Structure File-System Mounting File Sharing (skip)‏ File Protection.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS246 Data & File Structures Lecture 1 Introduction to File Systems Instructor: Li Ma Office: NBC 126 Phone: (713)
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
File Organization Lecture 1
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Evidence from Content INST 734 Module 2 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
CIS 250 Advanced Computer Applications Database Management Systems.
FILE ORGANIZATION.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 Geog 357: Data models and DBMS. Geographic Decision Making.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Welcome to ….. File Organization.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Information Retrieval in Practice
Text Based Information Retrieval
CS 430: Information Discovery
26 - File Systems.
Methodology – Physical Database Design for Relational Databases
Implementation Issues & IR Systems
CS 430: Information Discovery
Multimedia Information Retrieval
CSCE 561 Information Retrieval System Models
FILE ORGANIZATION.
Searching and Indexing
CS 430: Information Discovery
Advance Database System
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 3 Inverted Files

2 Course Administration Reading for Wednesday -- since the textbooks have not yet arrived, photocopies of the reading are provided. Assignment 1 will be posted during the next couple of days.

3 Course Administration Course enrollment: We do not have enough laptops and wireless cards for everybody who has applied. Everybody who has both (a) pre-registered for the class and (b) applied for a laptop should have received an . Other people may enroll for the course, but without laptops. No part of the course will require use of laptops. If you drop the course, we will re-issue your laptop to another student. If we have any extra laptops we will allocate them to people who applied, by drawing lots.

4 Inverted File (Basic) Inverted file: a list of the words in a set of documents and the documents in which they appear. Word Document abacus actor aspen 5 atoll Stop words are removed before building the index.

5 Inverted List Inverted list: All the entries in an inverted file that apply to a specific word, e.g. abacus Posting: Entry in an inverted list, e.g., the postings for "abacus" are documents 3, 19, 22.

6 Boolean Search (Keyword) Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

7 Boolean Diagram A B A and B A or B not (A or B)

8 Evaluating a Boolean Query Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". Since inverted lists may be very long, it is important to match postings efficiently

9 Enhancements to Inverted Files Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization user interface design

10 Inverted File (Enhanced) WordPostings DocumentLocation abacus actor aspen atoll

11 Adjacent and Near Operators abacus adj actor The terms abacus and actor are next to each other as in the string "abacus actor". abacus near 4 actor The terms abacus and actor are next to each other as in the string "abacus actor". Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

12 Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent

13 Evaluation of Boolean Operators Precedence of operators must be defined: adj, nearhigh and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

14 Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource.

15 Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in alphabetical order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

16 Index File Structures: Linear Index TermPointer to list of postings ant bee cat dog elk fox gnu hog Inverted lists

17 Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

18 Index File Structures: Binary Tree elk cathog beedogfox ant gnu

19 Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Poor for sequential processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, each node may require a separate disk access

20 Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees.

21 Impact of Secondary Storage If an index is held on disk, search time is dominated by the number of disk accesses. Suppose that an index has 100,000 terms. Each index entry consists of the term and a pointer to the inverted list, average 50 characters. Size of index is 5 megabytes, which can easily be held in memory.