Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.

Slides:



Advertisements
Similar presentations
FILE SYSTEM IMPLEMENTATION
Advertisements

Introduction to Information Retrieval
Index Construction David Kauchak cs160 Fall 2009 adapted from:
Comp 335 File Structures Reclaiming and Reusing File Space Techniques for File Maintenance.
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Department of Computer Science and Engineering, HKUST Slide 1 Dynamic Hashing Good for database that grows and shrinks in size Allows the hash function.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Chapter 10: File-System Interface
Dr. Kalpakis CMSC 421, Operating Systems. Fall File-System Interface.
Page 13 1.a) A block is a group of records. A block is referred to as the UNIT of TRANSFER In computer files as when a record is searched / updated the.
File Organizations Sept. 2012Yangjun Chen ACS-3902/31 Outline: File Organization Hardware Description of Disk Devices Buffering of Blocks File Records.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
File Concept l Contiguous logical address space l Types: Data: numeric, character, binary Program: source, object (load image) Documents.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Hash Table March COP 3502, UCF.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
ITEC 502 컴퓨터 시스템 및 실습 Chapter 10-1: File Systems Mi-Jung Choi DPNM Lab. Dept. of CSE, POSTECH.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
Comp 335 File Structures Hashing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
File Storage Organization The majority of space on a device is reserved for the storage of files. When files are created and modified physical blocks are.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
FALL 2005 CENG 351 Data Management and File Structures 1 Hashing.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 11: File-System Interface File Concept Access Methods Directory Structure.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
CS261 Data Structures Hash Tables Open Address Hashing.
Silberschatz and Galvin  Operating System Concepts Module 10: File-System Interface File Concept Access :Methods Directory Structure Protection.
File Systems. 2 What is a file? A repository for data Is long lasting (until explicitly deleted).
Evidence from Content INST 734 Module 2 Doug Oard.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Accumulator Representations Dr. Susan Gauch. Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)
Access Methods File store information When it is used it is accessed & read into memory Some systems provide only one access method IBM support many access.
Lecture Topics: 12/1 File System Implementation –Space allocation –Free Space –Directory implementation –Caching Disk Scheduling File System/Disk Interaction.
Chapter 5 Record Storage and Primary File Organizations
Operating Systems Files, Directory and File Systems Operating Systems Files, Directory and File Systems.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Why indexing? For efficient searching of a document
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Module 11: File Structure
Modified from Stanford CS276 slides Lecture 4: Index Construction
Module 10: File-System Interface
Chapter 11: File-System Interface
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Are they better or worse than a B+Tree?
Hashing CENG 351.
Subject Name: File Structures
Lecture 10: Buffer Manager and File Organization
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Chapter 11: File System Implementation
Index Construction: sorting
External Memory Hashing
Indexing 4/11/2019.
Module 10: File-System Interface
INF 141: Information Retrieval
Presentation transcript:

Incremental Indexing Dr. Susan Gauch

Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What happens if we have already indexed a million documents and add 1 document to the collection  Do not want to index 1,000,001 documents from scratch  Web search engines have spiders/crawlers/robots continually collecting new content  Need a way to add a new document to existing inverted files

Adding a document  This can cause two types of changes  Add a new word  Add an occurrence of an existing word

Adding a New Word  This is the easiest type of change  Fill in a new entry in dict file for the word  Append to the end of the post file  If the dict file is 1/3 full after indexing phase  We can add many words before dict file blank records are used up  Over time, probability of a collision increases, slowing down retrieval  When dict file is > 2/3 full, rehash on disk  Essentially, create new dict file twice as big  Rehash all dict file records to new location  Lots of I/O, but can be done in background or on separate computer

Adding an New Occurrence  Change to dict file is trivial  Just increment numdocs  Change to post file is catastrophic  Need to add a new posting record but cannot insert a new record in the middle of a file  The idf for the word is different (log N/numdocs; but numdocs just changed)  All postings for that word have the wrong term weights

Adding Posting Records Option 1: Blank records  Write blank records after existing postings  Number of blank records should be proportional to the number of existing postings  E.g., if “dog” has 3 postings, write scale_factor * 3 blank records after the 3 real postings; if “many” has 100 postings, write scale_factor * 100 blank records after the 100 real ones  Allows for scale_factor expansion  First word to have more than numdocs * scale_factor new postings causes entire postings file to be rewritten with new blanks inserted

Adding Posting Records Option 2: Move Postings  Copy existing postings for the word to end of file  Append new posting there  Update dict to have “start” index to new location  Causes a lot of data movement  Post file becomes fragmented

Adding Posting Records Option 3: Overflow pointer  Change post record format to have an overflow pointer (record number/block address)  Add new posting to end of post file or in separate overflow file  While processing post records:  Loop over numdocs records in post  If overflow is null  Next = i++  Else  Next = overflow_location

Adding Posting Records Option 4: Next pointers  Variation of Option 3  While processing post records:  Seek to start  Read >> docid >> wt >> next  While next != -1  Seek to next  Read >> docid >> wt >> next  Allows infinite expandability  Can degenerate into equivalent of a linked list on disk with one seek per post record

Handling idf  Updating numdocs changes idf which in turn changes wt for all postings for the term  Read all postings for the term, change wt, rewrite postings  If doing proper document length normalization,  All document lengths for this term now have new lengths  Must recalculate norm factor and rewrite the postings for all terms in that document  Infeasible: we don’t have a way to find all postings for a document without reading whole file or adding a new file that maps docid -> postings (doubling the inverted index size)

Better idea  Calculate term weights on the fly  Store rtf in posting record  Prenormalized by document length  Loop over postings  Acc [docid} += wt  Becomes  Calc idf from current value of numdocs  Loop over postings  Acc [docid] += rtf * idf

Scalability (or how Google does it)  Create overflow areas that are larger than 1  Make them variable sizes  Store a few postings in dict file  Dict record becomes  Token, numdocs, idf, P postings, Next  Pick P so that dict record is size of 0.5 or 1 block (e.g., 100)  Create Small, Medium, and Large overflow files

Variable Overflows  If have > P postings  Allocate a record in the “Small” overflow file  Record format: S postings, Next  Pick S so that record fits in 1 block  Or Pick S so that 50% of all tokens can be processed without going to Medium overflow file  If have > P + S postings  Allocate a record in “Medium” overflow file  Record format: M postings, Next  Or Pick M so that 90% of all tokens can be processed without going to Large overflow file

Variable Overflows  If have > P + S + M postings  Allocate a record in the “Large” overflow file  Record format: L postings, Next  Pick L so that 99% of all tokens can be processed without going to second overflow file  If have > P + S + L postings  Allocate another record at end of Large file  Next pointer just points to the next Large record