Big Data Yuan Xue CS 292 Special topics on.

Slides:



Advertisements
Similar presentations
Disk Storage, Basic File Structures, and Hashing
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 14 Indexing Structures for Files Copyright © 2004 Ramez Elmasri and Shamkant Navathe.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Efficient Storage and Retrieval of Data
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Database Systems Chapters ITM 354. The Database Design and Implementation Process Phase 1: Requirements Collection and Analysis Phase 2: Conceptual.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Indexing dww-database System.
1 Lecture 7: Data structures for databases I Jose M. Peña
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Chapter 14-1 Chapter Outline Types of Single-level Ordered Indexes –Primary Indexes –Clustering Indexes –Secondary Indexes Multilevel Indexes Dynamic Multilevel.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
1 Physical Data Organization and Indexing Lecture 14.
Basic File Structures and Hashing Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Indexing Structures for Files by Pinar Senkul resources: mostly froom Elmasri, Navathe and.
Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.
External data structures
Indexing Structures for Files
1 Chapter 2 Indexing Structures for Files Adapted from the slides of “Fundamentals of Database Systems” (Elmasri et al., 2003)
Nimesh Shah (nimesh.s) , Amit Bhawnani (amit.b)
1 Overview of Database Design Process. Data Storage, Indexing Structures for Files 2.
Chapter Ten. Storage Categories Storage medium is required to store information/data Primary memory can be accessed by the CPU directly Fast, expensive.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Indexing Methods. Storage Requirements of Databases Need data to be stored “permanently” or persistently for long periods of time Usually too big to fit.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Chapter 6 Index Structures for Files 1 Indexes as Access Paths 2 Types of Single-level Indexes 2.1Primary Indexes 2.2Clustering Indexes 2.3Secondary Indexes.
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Chapter 14 Indexing Structures for Files Copyright © 2004 Ramez Elmasri and Shamkant Navathe.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems, Sixth.
Chapter 5 Record Storage and Primary File Organizations
CS4432: Database Systems II
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
10/3/2017 Chapter 6 Index Structures.
Indexing Structures for Files
Indexing Structures for Files
Chapter Outline Indexes as additional auxiliary access structure
Indexing Structures for Files
CS 540 Database Management Systems
Indexing and hashing.
Lecture 20: Indexing Structures
Database Management Systems (CS 564)
11/14/2018.
Disk Storage, Basic File Structures, and Hashing
Database Implementation Issues
Disk storage Index structures for files
DATABASE IMPLEMENTATION ISSUES
Indexing 4/11/2019.
Database Implementation Issues
Indexing Structures for Files
Advance Database System
Database Implementation Issues
8/31/2019.
Lec 6 Indexing Structures for Files
Presentation transcript:

Big Data Yuan Xue CS 292 Special topics on

Part I Relational Database (Physical Database Design) Yuan Xue

Review and Look Forward  What we know so far  Design: Database schema  Optimization objective: minimum redundancy with information preservation  Operation: Database access and manipulation via SQL  Next step  Design: How data is stored in database?  Operation: How data is accessed and manipulated; how to configure the data access design via SQL  Optimization: How data storage and access method can be designed so that applications (and SQL queries) (average) execution time is minimized Conceptual Design Data model mapping Next step Entity/Relationship model Logic Design Logical Schema] Normalization Normalized Schema Physical Design Physical (Internal) Schema

Primitives  Storage Hierarchy in computer systems  Processor cache  Memory (data cache, volatile)  Disk (persistent)  Tapes  Persistent data storage vs. transient(volatile) data storage  Correlation exists currently between storage speed and price  E.g., using Solid State Drives (SSDs) for speed by DynamoDB  Faster storage is typically volatile  Disk basics  Slow: rotation delay + seek time  Block is the unit of data access and transfer between disk and memory  Buffering is a low-level technique to save data access and transfer time

Primitives  File System  A Layer of OS that transforms Block interface of disks into File (directories, etc)  File System component  Disk management: organize disk blocks into files  Contiguous allocation, linked allocation, indexed allocation  E.g. inode in Unix  Naming: interface to find files by name, not by blocks  Protection: Layers to keep data secure  Reliability and durability  File Operations  Open, close  Read (random, sequential), write (modification, insertion, append, delete)

Placing and Accessing File Records on Disk  Record  Fixed-length records  Variable-length records  Separator   BLOB (binary large object) in mySQL  Media file, image, etc  Option 1: stored separately from its record, keep a pointer in the record  Option 2: using BLOB or TEXT type  Use MySQL string handling  Better security support  Putting Record into Blocks  Spanned vs. unspanned CREATE TABLE MiniTwitter.User (IDVARCHAR(20)NOT NULL, NameVARCHAR(20)NOT NULL, VARCHAR(20)NOT NULL, PasswordCHAR(10)NOT NULL, ; IDName Password Aadf1234 com qwer6789 Block 1 Block 2

Database Physical Design Overview  Two aspects  Storage organization  Access method  Primary file organization  How file records are physically placed on the disk  How the records can be accessed  Secondary organization (auxiliary access structure)  Allows efficient access to file records based on alternative fields than those that have been used for the primary file organization  Most of these exist as indexes  Database engine (or storage engine) in DBMS  component that handles data create, read, update and delete (CRUD) operations on a database. Successful design: Perform as efficient as possible over the frequent operations

Primary File Organization  Several types  Heap file (unsorted)  Sorted file (ordered by a particular field – sort key)  Hash file  B-tree, B+ tree  mainly used for index file

Heap file (unsorted)  Operations  Insertion: efficient  Retrieval/Search: inefficient, linear  Deletion  May lead to unused space  Mark selected records as "deleted"  requires periodic reorganization  Advantages  efficient for bulk loading data  efficient for relatively small relations as indexing overheads are avoided  efficient when retrievals involve large proportion of stored records  Disadvantages  not efficient for selective retrieval using key values  sorting may be time-consuming

Sorted File  Sorted file: ordered by a particular field – ordering key  Operations  Search based on ordering key is efficient (binary search)  Insertion, deletion are expensive  Solution: overflow file (idea used in BigTable/Hbase)  Create a temporary unordered file  New record added to overflow file  Periodically, overflow file is sorted and merged with master file  Increased complexity in search (have to search in both overflow and master files)  Advantages  Efficient for range access of ordering key  Efficient for (random) write  Disadvantages  Sequential scan perform the same as Heap file  No benefit for accessing based on nonordered fields Sorted files are rarely used in database application without primary index (coming up in slide 15) is defined as an additional access path

Hash File  Basic idea  Hash function : Hash field of a record  location of this record  Hash function  Even distribution  Collision handling: open addressing, chained overflow  External hashing for disk files  Target address space divided into buckets, each hold multiple records.  Bucket : one block, or contiguous disk blocks  Hash function maps the key to a bucket number; a table in the file header converts the bucket number into disk block address  Static hashing vs. Extensible hashing  Pros and cons  efficient for exact matches on key field  not suitable for range retrieval, which requires sequential storage

Row vs. Column  Row-oriented storage  all data associated with a given row is stored together.  Column-oriented storage (adopted by Dremel for example)  store all data from a given column together in order  Quickly serve data warehouse-style queries  Read only  Access a large range of certain attributes together  Comparison  Column-oriented organizations are more efficient when  an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data  new values of a column are supplied for all rows at once at writing  Row-oriented organizations are more efficient when  many columns of a single row are required at the same time  row-size is relatively small, as the entire row can be retrieved with a single disk seek  writing a new row if all of the row data is supplied at the same time

Secondary Organization (Index)  Index  Persistent data structure, stored in database  Ordered file  fixed length record  A binary search on the index yields a pointer to the file record  Also called as access path on the field (index field)  The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller  Primary mechanism to get improved performance on a database  Difference between full table scans and immediate access  Indexes can be added or removed without changing database application logic  Indexes can also be characterized as dense or sparse  Dense index has an index entry for every search key value (and hence every record) in the data file.  Sparse (or nondense) index has index entries for only some of the search values

Secondary Organization (Index)  Down side  Extra space  marginal  Index creation  medium  Index maintenance  slow data maintenance can offset benefits  Indices can be implemented using a variety of data structures.  Balanced trees (B trees, B+ trees)  Logarithm access, support range queries  Hash tables  Constant time access

Primary Index  Defined on an ordered data file [primary organization method]  data file is ordered on a key field  Includes one index entry for each block in the data file  index entry has the key field value for the first record in the block, which is called the block anchor  A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value.

Secondary Index  Secondary means of accessing a file with primary access.  Defined on an ordered, unordered, or hashed data file [primary organization method]  Defined on a field that can be  candidate key with a unique value in every record,  non-key with duplicate values  Each record in the index has two fields.  The first field has the indexing field  The second field is either a block pointer or a record pointer.  There can be many secondary indexes (and hence, indexing fields) for the same file. When a block pointer is used in secondary index, to access a record, the disk block is transferred to memory, then a search for the record is carried out in the memory

Multi-Level Indexes  A single-level index is an ordered file  a primary index to the index itself can be created  the original index file is called the first-level index  the index to the index is called the second-level index  Repeat the process  creating a third, fourth,..., top level  until all entries of the top level fit in one disk block  A multi-level index can be created for any type of first-level index (primary, secondary)  Such a multi-level index is a form of search tree  Insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file  Solution: B-tree and B+-tree

B-Trees and B+-Trees  Most multi-level indexes use B-tree or B+-tree data structures because of the insertion and deletion problem  In B-Tree and B+-Tree data structures, each node corresponds to a disk block  Each node is kept between half-full and completely full  An insertion into a node that is not full is quite efficient  If a node is full the insertion causes a split into two nodes; Splitting may propagate to other tree levels  A deletion is quite efficient if a node does not become less than half full  If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

B-tree Structures

Difference between B-tree and B+-tree  In a B-tree, pointers to data records exist at all levels of the tree  In a B+-tree, all pointers to data records exists at the leaf-level nodes  A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

The Nodes of a B+-tree  FIGURE The nodes of a B+-tree  (a) Internal node of a B+-tree with q –1 search values.  (b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers.

An Example of an Insertion in a B+-tree

An Example of a Deletion in a B+-tree

Create Index  How to create using SQL  Which attribute should be used? CREATE INDEX index_name ON table_name (column_name) Consider a simple relation V(M,N) with two attributes M, N; both take integer values [1, 100]; For the following SQL queries SELECT * FROM V WHERE M=? SELECT * FROM V WHERE N=? SELECT * FROM V WHERE M=? AND N=? Which index is helpful for each query? 1.Index on V(M) 2.Index on V(N) 3.Index on V(M,N)

Example Value of V(M) Pointers to records (or blocks) … … 100 Index on V(M) List of pointers to (3,1) (3,2), …(3, 100) SELECT * FROM V WHERE M=3 For SQL query

Example Value of V(N) Pointers to records (or blocks) … … 100 Index on V(N) List of pointers to (1,3) (2,3), …(100,3) SELECT * FROM V WHERE N=3 For SQL query

Example Value of V(N) Pointers to records 1 … 5 … … 100 Approach 1: Index on V(M) and Index on V(N) List of pointers to (1,5) …(100,5) SELECT * FROM V WHERE M=3 AND N=5 For SQL query Value of V(M) Pointers to records … … 100 List of pointers to (3,1) …(3, 100) {(3,1), (3,2),..} {(1,5), (2,5),..} Take the set intersection

Example Approach 2: Index on V(M,N) SELECT * FROM V WHERE M=3 AND N=5 For SQL query Value of V(M,N) Pointers to records 1,1 2,1 … 3,5 … 100 pointer to (3,5) Comparison: 1.Overhead in Index maintenance: 2.Efficiency in record access: 3.Flexibility in support different queries

Summary Meta-data Application Program/Queries Users Query Processing Data access (Database Engine) Data DBMS system  Primary file organization  Heap, Sorted, Hash Files  Secondary organization (Index)  Primary, Secondary  Multiple-level index  B-Tree, B+-Tree  Create Index in SQL  Database engine in DBMS  handles data create, read, update and delete (CRUD) operations on a database. Physical Design Optimization: Create Index based on 1.Database statistics 2.Query patterns