Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Yuan Xue CS 292 Special topics on.

Similar presentations


Presentation on theme: "Big Data Yuan Xue CS 292 Special topics on."— Presentation transcript:

1 Big Data Yuan Xue (yuan.xue@vanderbilt.edu) CS 292 Special topics on

2 Part I Relational Database (Physical Database Design) Yuan Xue (yuan.xue@vanderbilt.edu)

3 Review and Look Forward  What we know so far  Design: Database schema  Optimization objective: minimum redundancy with information preservation  Operation: Database access and manipulation via SQL  Next step  Design: How data is stored in database?  Operation: How data is accessed and manipulated; how to configure the data access design via SQL  Optimization: How data storage and access method can be designed so that applications (and SQL queries) (average) execution time is minimized Conceptual Design Data model mapping Next step Entity/Relationship model Logic Design Logical Schema] Normalization Normalized Schema Physical Design Physical (Internal) Schema

4 Primitives  Storage Hierarchy in computer systems  Processor cache  Memory (data cache, volatile)  Disk (persistent)  Tapes  Persistent data storage vs. transient(volatile) data storage  Correlation exists currently between storage speed and price  E.g., using Solid State Drives (SSDs) for speed by DynamoDB  Faster storage is typically volatile  Disk basics  Slow: rotation delay + seek time  Block is the unit of data access and transfer between disk and memory  Buffering is a low-level technique to save data access and transfer time

5 Primitives  File System  A Layer of OS that transforms Block interface of disks into File (directories, etc)  File System component  Disk management: organize disk blocks into files  Contiguous allocation, linked allocation, indexed allocation  E.g. inode in Unix  Naming: interface to find files by name, not by blocks  Protection: Layers to keep data secure  Reliability and durability  File Operations  Open, close  Read (random, sequential), write (modification, insertion, append, delete)

6 Placing and Accessing File Records on Disk  Record  Fixed-length records  Variable-length records  Separator   BLOB (binary large object) in mySQL  Media file, image, etc  Option 1: stored separately from its record, keep a pointer in the record  Option 2: using BLOB or TEXT type  Use MySQL string handling  Better security support  Putting Record into Blocks  Spanned vs. unspanned CREATE TABLE MiniTwitter.User (IDVARCHAR(20)NOT NULL, NameVARCHAR(20)NOT NULL, EmailVARCHAR(20)NOT NULL, PasswordCHAR(10)NOT NULL, ; IDNameEmailPassword Alice00Alicealice00@gmail.com Aadf1234 Bob2013Bobbob13@gmail. com qwer6789 Cathy123Cathycath@vandyTyuoa~!@ Block 1 Block 2 Alice00Alicealice00@gmail.comAadf1234 Bob2013Bobbob13@gmail.comqwer6789 Cathy123Cathyalice00@gmail.comAadf1234 Dave00Davebob13@gmail.comqwer6789

7 Database Physical Design Overview  Two aspects  Storage organization  Access method  Primary file organization  How file records are physically placed on the disk  How the records can be accessed  Secondary organization (auxiliary access structure)  Allows efficient access to file records based on alternative fields than those that have been used for the primary file organization  Most of these exist as indexes  Database engine (or storage engine) in DBMS  component that handles data create, read, update and delete (CRUD) operations on a database. Successful design: Perform as efficient as possible over the frequent operations

8 Primary File Organization  Several types  Heap file (unsorted)  Sorted file (ordered by a particular field – sort key)  Hash file  B-tree, B+ tree  mainly used for index file

9 Heap file (unsorted)  Operations  Insertion: efficient  Retrieval/Search: inefficient, linear  Deletion  May lead to unused space  Mark selected records as "deleted"  requires periodic reorganization  Advantages  efficient for bulk loading data  efficient for relatively small relations as indexing overheads are avoided  efficient when retrievals involve large proportion of stored records  Disadvantages  not efficient for selective retrieval using key values  sorting may be time-consuming

10 Sorted File  Sorted file: ordered by a particular field – ordering key  Operations  Search based on ordering key is efficient (binary search)  Insertion, deletion are expensive  Solution: overflow file (idea used in BigTable/Hbase)  Create a temporary unordered file  New record added to overflow file  Periodically, overflow file is sorted and merged with master file  Increased complexity in search (have to search in both overflow and master files)  Advantages  Efficient for range access of ordering key  Efficient for (random) write  Disadvantages  Sequential scan perform the same as Heap file  No benefit for accessing based on nonordered fields Sorted files are rarely used in database application without primary index (coming up in slide 15) is defined as an additional access path

11 Hash File  Basic idea  Hash function : Hash field of a record  location of this record  Hash function  Even distribution  Collision handling: open addressing, chained overflow  External hashing for disk files  Target address space divided into buckets, each hold multiple records.  Bucket : one block, or contiguous disk blocks  Hash function maps the key to a bucket number; a table in the file header converts the bucket number into disk block address  Static hashing vs. Extensible hashing  Pros and cons  efficient for exact matches on key field  not suitable for range retrieval, which requires sequential storage

12 Row vs. Column  Row-oriented storage  all data associated with a given row is stored together.  Column-oriented storage (adopted by Dremel for example)  store all data from a given column together in order  Quickly serve data warehouse-style queries  Read only  Access a large range of certain attributes together  Comparison  Column-oriented organizations are more efficient when  an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data  new values of a column are supplied for all rows at once at writing  Row-oriented organizations are more efficient when  many columns of a single row are required at the same time  row-size is relatively small, as the entire row can be retrieved with a single disk seek  writing a new row if all of the row data is supplied at the same time

13 Secondary Organization (Index)  Index  Persistent data structure, stored in database  Ordered file  fixed length record  A binary search on the index yields a pointer to the file record  Also called as access path on the field (index field)  The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller  Primary mechanism to get improved performance on a database  Difference between full table scans and immediate access  Indexes can be added or removed without changing database application logic  Indexes can also be characterized as dense or sparse  Dense index has an index entry for every search key value (and hence every record) in the data file.  Sparse (or nondense) index has index entries for only some of the search values

14 Secondary Organization (Index)  Down side  Extra space  marginal  Index creation  medium  Index maintenance  slow data maintenance can offset benefits  Indices can be implemented using a variety of data structures.  Balanced trees (B trees, B+ trees)  Logarithm access, support range queries  Hash tables  Constant time access

15 Primary Index  Defined on an ordered data file [primary organization method]  data file is ordered on a key field  Includes one index entry for each block in the data file  index entry has the key field value for the first record in the block, which is called the block anchor  A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value.

16 Secondary Index  Secondary means of accessing a file with primary access.  Defined on an ordered, unordered, or hashed data file [primary organization method]  Defined on a field that can be  candidate key with a unique value in every record,  non-key with duplicate values  Each record in the index has two fields.  The first field has the indexing field  The second field is either a block pointer or a record pointer.  There can be many secondary indexes (and hence, indexing fields) for the same file. When a block pointer is used in secondary index, to access a record, the disk block is transferred to memory, then a search for the record is carried out in the memory

17 Multi-Level Indexes  A single-level index is an ordered file  a primary index to the index itself can be created  the original index file is called the first-level index  the index to the index is called the second-level index  Repeat the process  creating a third, fourth,..., top level  until all entries of the top level fit in one disk block  A multi-level index can be created for any type of first-level index (primary, secondary)  Such a multi-level index is a form of search tree  Insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file  Solution: B-tree and B+-tree

18 B-Trees and B+-Trees  Most multi-level indexes use B-tree or B+-tree data structures because of the insertion and deletion problem  In B-Tree and B+-Tree data structures, each node corresponds to a disk block  Each node is kept between half-full and completely full  An insertion into a node that is not full is quite efficient  If a node is full the insertion causes a split into two nodes; Splitting may propagate to other tree levels  A deletion is quite efficient if a node does not become less than half full  If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

19 B-tree Structures

20 Difference between B-tree and B+-tree  In a B-tree, pointers to data records exist at all levels of the tree  In a B+-tree, all pointers to data records exists at the leaf-level nodes  A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

21 The Nodes of a B+-tree  FIGURE 14.11 The nodes of a B+-tree  (a) Internal node of a B+-tree with q –1 search values.  (b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers.

22 An Example of an Insertion in a B+-tree

23 An Example of a Deletion in a B+-tree

24 Create Index  How to create using SQL  Which attribute should be used? CREATE INDEX index_name ON table_name (column_name) Consider a simple relation V(M,N) with two attributes M, N; both take integer values [1, 100]; For the following SQL queries SELECT * FROM V WHERE M=? SELECT * FROM V WHERE N=? SELECT * FROM V WHERE M=? AND N=? Which index is helpful for each query? 1.Index on V(M) 2.Index on V(N) 3.Index on V(M,N)

25 Example Value of V(M) Pointers to records (or blocks) 1 2 3 … … 100 Index on V(M) List of pointers to (3,1) (3,2), …(3, 100) SELECT * FROM V WHERE M=3 For SQL query

26 Example Value of V(N) Pointers to records (or blocks) 1 2 3 … … 100 Index on V(N) List of pointers to (1,3) (2,3), …(100,3) SELECT * FROM V WHERE N=3 For SQL query

27 Example Value of V(N) Pointers to records 1 … 5 … … 100 Approach 1: Index on V(M) and Index on V(N) List of pointers to (1,5) …(100,5) SELECT * FROM V WHERE M=3 AND N=5 For SQL query Value of V(M) Pointers to records 1 2 3 … … 100 List of pointers to (3,1) …(3, 100) {(3,1), (3,2),..} {(1,5), (2,5),..} Take the set intersection

28 Example Approach 2: Index on V(M,N) SELECT * FROM V WHERE M=3 AND N=5 For SQL query Value of V(M,N) Pointers to records 1,1 2,1 … 3,5 … 100 pointer to (3,5) Comparison: 1.Overhead in Index maintenance: 2.Efficiency in record access: 3.Flexibility in support different queries

29 Summary Meta-data Application Program/Queries Users Query Processing Data access (Database Engine) Data DBMS system  Primary file organization  Heap, Sorted, Hash Files  Secondary organization (Index)  Primary, Secondary  Multiple-level index  B-Tree, B+-Tree  Create Index in SQL  Database engine in DBMS  handles data create, read, update and delete (CRUD) operations on a database. Physical Design Optimization: Create Index based on 1.Database statistics 2.Query patterns


Download ppt "Big Data Yuan Xue CS 292 Special topics on."

Similar presentations


Ads by Google