Chapter 7 Indexing Objectives: To get familiar with: Indexing

Slides:



Advertisements
Similar presentations
Disk Storage, Basic File Structures, and Hashing
Advertisements

Indexing.
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
Modern Information Retrieval
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
LEARNING OBJECTIVES Index files.
Chapter 12 File Management
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 File Structure n File as a stream of characters l No structure l Consider students registered in a course Joe SmithSC Kathy LeeEN Albert.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
FALL 2004CENG 351 File Structures1 Indexing Reference: Sections
File StructureSNU-OOPSLA Lab.1 Chap 7. Indexing 서울대학교 컴퓨터공학과 객체지향시스템연구실 SNU-OOPSLA-LAB 김 형 주 교수 File Structures by Folk, Zoellick, and Ricarrdi.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Kruse/Ryba ch091 Object Oriented Data Structures Tables and Information Retrieval Rectangular Tables Tables of Various Shapes Radix Sort Hashing.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
File Processing - Indexing MVNC1 Indexing Jim Skon.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Chapter 10: File-System Interface 10.1 Silberschatz, Galvin and Gagne ©2011 Operating System Concepts – 8 th Edition 2014.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Now, please open your book to page 60, and let’s talk about chapter 9: How Data is Stored.
1 Chapter 7 Indexing File Structures by Folk, Zoellick, and Ricarrdi.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Chapter 5 Record Storage and Primary File Organizations
( ) 1 Chapter # 8 How Data is stored DATABASE.
CPSC 231 Organizing Files for Performance (D.H.)
Subject Name: File Structures
CHP - 9 File Structures.
Indexing and hashing.
Subject Name: File Structures
Chapter 11: Indexing and Hashing
DATABASE IMPLEMENTATION ISSUES
Indexing 4/11/2019.
Database Implementation Issues
Chapter 11: Indexing and Hashing
Advance Database System
Presentation transcript:

Chapter 7 Indexing Objectives: To get familiar with: Indexing Simple indexes Maintaining an index

Outline What is an index? A simple index for entry-sequenced files Basic operations on an indexed entry-sequenced file Indexing to provide access by multiple keys Improving the secondary index structure Selective Indexes Binding

Indexes An index is a list of keys and their associated references pointing to the record where the information referenced by the key can be found. A simple index can be represented as a simple array of structures containing keys and references. An index allows us to impose order on a file without actually rearranging the file. You can have different indexes for the same data: multiple access paths. Books in a library may be ordered by authors, titles, call numbers, and/or subjects. Indexing give us direct, keyed access to variable-length record files.

A Simple Index for Entry-Sequenced Files Suppose that you are looking at a collection of recordings with the following information about each of them: Identification Number Title Composer or Composers Artist or Artists Label (publisher) We choose to organize the file as a series of variable-length records with a size field preceding each record. The fields within each record are also of variable-length but are separated by delimiters. We form a primary key by concatenating the record company label code and the record’s ID number, e.g., LON2321.

A Simple Index for Entry-Sequenced Files (Cont’d) Record address: the record offset, the address of the first byte of the record

A Simple Index for Entry-Sequenced Files (Cont’d) In order to provide rapid keyed access, we build a simple index. The index file has a fixed-length record structure and contains a 12-character key. Each key is associated with a reference field giving the address of the first byte of the corresponding data record (record address). The index may be sorted while the file does not have to be. This means that the data file may be entry sequenced: the record occur in the order they are entered in the file. A few comments about our Index Organization: The index is easier to use than the data file because 1) it uses fixed-length records and 2) it is likely to be much smaller than the data file. By requiring fixed-length records in the index file, we impose a limit on the size of the primary key field. This could cause problems. The index could carry more information than the key and reference fields. (e.g., we could keep the length of each data file record in the index as well).

A Simple Index for Entry-Sequenced Files (Cont’d)

Basic Operations on an Indexed Entry-Sequenced File Assumption: the index is small enough to be held in memory. Create the original empty index and data files Load the index into memory before using it. Rewrite the index file from memory after using it. Add records to the data file and index. Delete records from the data file. Update records in the data file.

Creating, Loading and Re-writing The index is represented as an array of records. The loading into memory can be done sequentially, reading a large number of index records (which are short) at once. What happens if the index changed but its re-writing does not take place or takes place incompletely? Use a mechanism for indicating whether or not the index is out of date. Have a procedure that reconstructs the index from the data file in case it is out of date.

Record Addition When we add a record, both the data file and the index should be updated. In the data file, the record can be added anywhere. However, the byte-offset of the new record should be saved. Since the index is sorted, the location of the new record does matter: we have to shift all the records that belong after the one we are inserting to open up space for the new record. However, this operation is not too costly as it is performed in memory.

Record Deletion and Updating Record deletion can be done using the methods discussed in Chapter 6. In addition, however, the index record corresponding to the data record being deleted must also be deleted. Once again, since this deletion takes place in memory, the record shifting is not too costly. Record updating falls into two categories: The update changes the value of the key field. The update does not affect the key field. In the first case, both the index and data file may need to be reordered. The update is easiest to deal with if it is conceptualized as a delete followed by an insert (but the user needs not know about this). In the second case, the index does not need reordering, but the data file may. If the updated record is smaller than the original one, it can be re-written at the same location. If, however, it is larger, then a new spot has to be found for it. Again the delete/insert solution can be used.

Indexes That Are Too Large to Hold in Memory When the index file is too large to fit into memory, Binary searching of the index becomes expensive. It requires several access to the index file. The index file need to be kept sorted. An expensive task. Solutions: Use a hashed organization Use a tree-structured index (e.g., a B-Tree) Nonetheless, simple indexes should not be completely discarded: They allow the use of a binary search in a variable-length record file. If the index entries are significantly smaller than the data file records, sorting and file maintenance is faster. If there are pinned records in the data file, rearrangements of the keys are possible without moving the data records. They can provide access by multiple keys.

Indexing to Provide Access by Multiple Keys So far, our index only allows key access. i.e., you can retrieve record DG188807, but you cannot retrieve a recording of Beethoven’s Symphony no. 9. ==> Not that useful! We need to use secondary key fields consisting of album titles, composers, and artists. Although it would be possible to relate a secondary key to an actual byte offset, this is usually not done (why?). Instead, we relate the secondary key to a primary key which then will point to the actual byte offset.

Record Addition in Multiple Key Access Settings When a secondary index is used, adding a record involves updating the data file, the primary index and the secondary index. The secondary index update is similar to the primary index update. Secondary keys are entered in canonical form (all capitals). The upper- and lower- case form must be obtained from the data file. As well, because of the length restriction on keys, secondary keys may sometimes be truncated. The secondary index may contain duplicate (the primary index couldn’t).

Record Deletion in Multiple Key Access Settings Removing a record from the data file means removing its corresponding entry in the primary index and may mean removing all of the entries in the secondary indexes that refer to this primary index entry. However, it is also possible not to worry about the secondary index (since, as we mentioned before, secondary keys were made to point at primary ones). ==> savings associated with the lack of rearrangement of the secondary index. Cost associated with not purging the secondary index.

Record Updating in Multiple Key Access Settings Three possible situations: Update changes the secondary key: may have to rearrange secondary index. Update changes the primary key: changes to the primary index are required, but very few are needed for the secondary index. Update confined to other fields: no changes necessary to primary nor secondary index.

Retrieval Using Combinations of Secondary Keys With secondary keys, we can now search for things like all the recordings of “Beethoven’s work” or all the recordings titled “Violin Concerto”. More importantly, we can use combinations of secondary keys. (e.g., find all recordings of Beethoven’s Symphony no. 9). Without the use of secondary indexes, this request requires a very expensive sequential search through the entire file.

Improving the Secondary Index Structure Secondary indexes lead to two difficulties: The index file has to be rearranged every time a new record is added to the file. If there are duplicate secondary keys, the secondary key field is repeated for each entry ==> Space is wasted. Solution 1: Change the secondary index structure so it associates an array of reference with each secondary key. Advantage: helps avoid the need to rearrange the secondary index file too often. Disadvantages: It may restrict the number of references that can be associated with each secondary key. It may cause internal fragmentation, i.e., waste of space.

Improving the Secondary Index Structure: Inverted Lists Solution 2: each secondary key points to a different list of primary key references. Each of these lists could grow to be as long as it needs to be and no space would be lost to internal fragmentation. Advantages: The secondary index file needs to be rearranged only when a record with a new secondary key is added. It is not that costly keeping the secondary index on disk. The rearranging is faster since there are fewer records in the secondary index. The primary index is entry sequences -- never needs to be sorted. Space from deleted primary index records can easily be reused. Disadvantage: Locality (in the secondary index) has been lost ==> More seeking may be necessary.

Selective Indexes Using secondary keys, you can divide the file into parts and provide a selective view. For example, you can build a selective index that contains only titles to classical recordings or recordings released prior to 1970, and since 1970. A possible query could then be: “List all the recordings of Beethoven’s Symphony no. 9 released since 1970.”

Binding At what point is the key bound to the physical address of its associated record? Answer so far: the binding of our primary keys takes place at construction time. The binding of our secondary keys takes place at the time they are used. Advantage of construction time binding: faster access Disadvantage of construction time binding: reorganization of the data file must result in modifications to all bound index files. Advantage of retrieval time binding: safer Tradeoff in binding decisions: Tight, construction time binding is preferable when: The data file is static or nearly static, requiring little or no adding, deleting or updating. Rapid performance during actual retrieval is a high priority. Retrieval time binding (postponing as long as possible) is simpler and safer when the data file requires a lot of adding, deleting and updating.