File Storage and Indexing

Slides:



Advertisements
Similar presentations
Disk Storage, Basic File Structures, and Hashing
Advertisements

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
1 Physical Data Organization and Indexing Lecture 14.
Announcements Exam Friday Project: Steps –Due today.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.
OSes: 11. FS Impl. 1 Operating Systems v Objectives –discuss file storage and access on secondary storage (a hard disk) Certificate Program in Software.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Chapter Ten. Storage Categories Storage medium is required to store information/data Primary memory can be accessed by the CPU directly Fast, expensive.
Indexing CS 400/600 – Data Structures. Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Chapter 5 Record Storage and Primary File Organizations
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
1 CS122A: Introduction to Data Management Lecture #14: Indexing Instructor: Chen Li.
File organization Secondary Storage Devices Lec#7 Presenter: Dr Emad Nabil.
Indexing Structures for Files
Data Indexing Herbert A. Evans.
CPS216: Data-intensive Computing Systems
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Chapter 11: File System Implementation
CS 540 Database Management Systems
Indexing Goals: Store large files Support multiple search keys
Indexing and hashing.
Multiway Search Trees Data may not fit into main memory
CS 728 Advanced Database Systems Chapter 18
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CS522 Advanced database Systems
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Chapter 11: File System Implementation
B+-Trees.
Database Management Systems (CS 564)
Oracle SQL*Loader
B+-Trees.
B+ Tree.
9/12/2018.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Chapter 11: File System Implementation
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Disk Storage, Basic File Structures, and Hashing
File organization and Indexing
Disk storage Index structures for files
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Chapter 11: File System Implementation
Indexing and Hashing Basic Concepts Ordered Indices
Secondary Storage Management Brian Bershad
File Storage and Indexing
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Storage and Indexing.
Indexing 4/11/2019.
Secondary Storage Management Hank Levy
Chapter 11: File System Implementation
Lecture 20: Indexes Monday, February 27, 2006.
Indexing, Access and Database System Architecture
Advance Database System
Presentation transcript:

File Storage and Indexing Chapters 4 and 5 12/1/97

DBs Reside on Disks Main reasons: (main) memory is too small and too volatile Understanding disk operation and performance is important Data structures and algorithms for disk are very different from those appropriate for memory The single most important fact about disks, relative to memory, is… (we shall see) 12/1/97

Review: Disk Organization Spinning magnetic surfaces, stacked R/W head per surface Track: circular path Cylinder: vertical stack of tracks Sectors: fixed length physical blocks along track separated by gaps overhead info (ID, error-correction, etc.) each sector has a hardware address 12/1/97

Disks and OS OS manages disks Files are allocated in blocks (groups of sectors) units may be scattered on disk OS makes file look contiguous no simple map between logical records (program/file view) and physical placement 12/1/97

Program Viewpoint Files are collections of logical "records" records are commonly (but not always) fixed in size and format fields in records are commonly (but not always) fixed in size and format Sequential access: program gets data one record at a time Random access: program can ask for a record by its relative block # (not disk address!) 12/1/97

Common Modes of Record Processing Single record (random) …WHERE SSN=345667899 Multiple records (no special order) SELECT * FROM DEPT... Multiple records (in order) …ORDER BY SSN An efficient storage scheme should support all of these modes 12/1/97

Naïve Relational View Each table is a separate file All rows in a table are of fixed size and format Rows are stored in order by primary key 12/1/97

What's the Big Deal? (actual numbers will vary) CPU speed: 100-500MHz Memory access speed: (100ns) how many CPU cycles is this? Disk speed has three components Seek time: to position the W/R head (10ms) how many memory cycles is this? Latency: rotational speed (3600rpm) Transfer time: movement of data to/from memory (5Mb/sec) 12/2/97

Fact of Life Seek time outweighs all other time factors Implication: Martin's three rules for efficient file access: 1. Avoid seeks 2. Avoid seeks 3. Avoid seeks Corollary: Any technique that takes more than one seek to locate a record is unsatisfactory. 12/1/97

Memory vs Disk Data Structures Example: Binary Search of a sorted array Needs O(Log2N) operations How many passes if 1,000,000 integers in memory? The answer (20) is acceptable How many seeks if 1,000,000 records on disk? The answer (20) in unacceptable And how did they get sorted (NlogN at best)? 12/1/97

Avoiding Seeks Overlapping disk operations Disk cache double buffering Disk cache locality principle: recently used pages are likely to be needed again maintained by OS or DBMS Sequential access Seek needed only for first record Smart file organizations 12/1/97

A Smart File Organization... One which needs very few seeks to locate any record Hashing: go directly to the desired record, based on a simple calculation Indexing: go directly to the desired record, based on a simple look-up 12/1/97

Review: Hashing Main idea: address of record is calculated from some value in the record usually based on primary key Zillions of possible hash functions Hard to find a good hash function 12/1/97

Pitfalls of Hashing Conflicts: more than one record hashing to the same address. schemes to overcome conflicts: overflow areas; chained records, etc. all involve more disk I/O Wasted space No efficient access for other than primary key No efficient way for sequential, especially sorted traversal 12/1/97

Index Main idea: A separate data structure used to locate records Frequently, index is a list of key/address pairs If index is small, a copy can be maintained in memory! Permanent disk copy is still needed Many, many flavors of index organization 12/1/97

Some Indexing Terminology Index field (key) Primary index Secondary index Dense index Non-dense (sparse) index Clustering index 12/1/97

Indexing Pitfalls Each index takes space Index file itself must permit efficient reorganization Large indices won't fit in memory May require multiple seeks to locate record entry Solution to some problems: Multilevel indexing each level is an index to the next level down 12/1/97

Requirements on Multilevel Indexes Must have low height Must be efficiently updatable Must be storage-efficient Top level(s) should fit in memory Should support efficient sequential access, if possible 12/1/97

B-Tree B-Tree is a type of multilevel index from another standpoint: it's a type of balanced tree Invented in 1972 by Boeing engineers R. Bayer and E. McCreight By 1979: "the standard organization for indexes in a database system" (Comer) 12/1/97

B-Tree Overview Assume for now that keys are fixed-length and unique A B-tree can be thought of as a generalized binary search tree multiple branches rather than just L or R Some wasted space in the nodes is tolerated Trees are always perfectly balanced 12/1/97

B-Tree Concepts Each node contains tree (node) pointers, and key values (and record pointers) Order p means (up to) p tree pointers, (up to) p-1 keys Given a key K and the two node pointers L and R around it All key values pointed to by L are < K All key values pointed to by R are > K 12/1/97

B-Tree Growth and Change When a node is full, it splits. middle value is propagated upward two new nodes are at same level as original node Height of tree increases only when the root splits Recommended: split only on the way down On deletion: two adjacent nodes recombine if both are < half full 12/1/97

B+ Tree Like B-Tree, except Allows sequential access of entire file! record pointers are only in the leaves left-side pointer has keys <= rather than < leaf nodes are linked together left to right Allows sequential access of entire file! "indexed sequential" Interior nodes have higher order than leaf nodes Fewer non-leaf levels than B-Tree 12/1/97

Variations Could store the whole record in the index block especially if records are few and small in a B+ tree, this would make sequential access especially efficient Could redistribute records between adjacent blocks esp. on deletion (B* tree) Variable order: accommodate varying key lengths 12/1/97

Other Forms of Indexing Bitmap indexes One index per value (property) of interest One bit per record TRUE if record has a particular property Indexed hash: hash function takes you to an entry in an index allows physical record locations to change Clever indexing schemes are useful in optimizing complex queries 12/1/97