Physical Design: Types of Indexes & Files

Physical Design: Types of Indexes & Files
University of Manitoba Asper School of Business 3500 DBMS Bob Travica Physical Design: Types of Indexes & Files Based on G. Post, DBMS: Designing & Building Business Applications Updated 2020

Physical Data Storage Topics of interest: File types for storing data
Index types - Data structures for retrieving data (Index to sequential file, Linked List, B+-Tree, Hash Table) Additional Physical Design Methods (file partitioning, clustering)

Terminology Data entry or data element - a special short record containing usually the key attribute, address to the rest of data (pointer), and sometimes other addresses; used for creating indexes Pointer = address of data, designation of data location DBMS task = any of CRUD operations

DBMS Tasks (CRUD) Store (write, create) data Retrieve (read) data
Insert a row. Retrieve (read) data Read entire table (scan all rows). Read arbitrary/random row. Modify (update) data (Change “Crag” into “Craig”; random read first) Delete data. (2 steps: mark + ”pack”; LastName FirstName Phone Adams Kimberly (406) Allbright Searoba (619) Anderson Charlotte (701) Baez Bessie (606) Baez Lou Ann (502) Bailey Gayle (360) Bell Luther (717) Carter Phillip (219) Carver Bernice (804) Crag Melinda (502) x Duvall Pierre (502) Adkins Inga (706)

File Types & Access Methods (Indexes)
Indexed Sequential Access Method (ISAM) & Sequential File Linked List index B+-Tree index Hash index

Sequential File Two formats: Sorted file and unsorted file (record insertion by appending at the end). . Uses: When data don’t change much Data retrieved in same order When table is huge and space is expensive. When transporting / converting data to a different system. File Employee (sorted on employee ID)

Operations on Sequential Files
Read entire file sequentially: Easy and fast Read next record: Fast Random Read/Sequential (pattern matching): Slow Probability of any row lookup = 1/N (N=number of records) Delete, Insert, Modify: First find, then do… So, slow. Row Prob. # Reads A 1/N 1 B 1/N 2 C 1/N 3 D 1/N 4 E 1/N 5 … 1/N i

Sequential Access to Sorted Sequential File
Sequential search Find: Brown; 2 lookups, Find: Jones; 10 lookups Go one by one from top Min lookups = 1, Max = 10 On the average = (N+1)/2 = (10+1)/2= 11/2 = 5.5, i.e. 6 lookups Key Attribute LastName Adams  Brown Cadiz Dorfmann Eaton Farris Goetz Hanson Inez  Jones N =10 records

Insertion into Sorted Sequential File
Insert record Inez: Find insert location, mark top & bottom parts of the old file. Copy the top to new file. Add new row. Copy the bottom to new file. Delete old file ID LastName FirstName DateHired 8 6 7 2 Carpenter Eaton Farris Gibson Carlos Anissa Dustin Bill 12/29/2001 8/23/2001 3/28/2001 3/31/2001 5 9 3 1 10 James O’Connor Reasoner Reeves Shields Leisha Jessica Katy Keith Howard 1/6/2001 /23/2001 2/17/2001 1/29/2001 7/13/2001 Old File Top Bottom 1. copy New File 2. insert 3. copy

Indexed Sequential Access Method (ISAM)
Record address Records (Table), sorted or unsorted ISAM index is a small file with columns key and pointer. Index can be put on multiple attributes. A11 A22 A32 A42 A47 A58 A63 A67 A78 A83 ID LastName FirstName DateHired 1 Reeves Keith 1/29/2001 2 Gibson Bill 3/31/2001 3 Reasoner Cathy 2/17/2001 4 Hopkins Alan 2/8/ 2001 5 James Leisha 1/6/ 2001 6 Eaton Anissa 8/23/ 2001 7 Farris Dustin 3/28/ 2001 8 Carpenter Carlos 12/29/ 2001 9 O'Connor Jessica 7/23/ 2001 10 Shields Howard 7/13/ 2001 index on ID index on LastName ID Pointer 1 A11 2 A22 3 A32 4 A42 5 A47 6 A58 7 A63 8 A67 9 A78 10 A83 Index on ID LastName Pointer Carpenter A67 Eaton A58 Farris A63 Gibson A22 Hopkins A42 James A47 O'Connor A78 Reasoner A32 Reeves A11 Shields A83 Index on LastName An index is sorted, search run on it, and the pointer (address location) used to fetch the record sought. Re-indexing (index refreshing) when new record is inserted.

Search of ISAM Index Binary search method: Task: Find Jones
Index on LastName Binary search method: Task: Find Jones Binary search 1) Split & test middle value Goetz vs. Jones. Result: Jones comes after Goetz (Jones > Goetz), so look down + discard upper half 2) Split & test: Jones < Kalida, so look up 3) Split & test: Jones > Inez, so look down 4) Split & test: Jones = Jones, so match! Adams Brown Cadiz Dorfmann Eaton Farris Goetz Hanson Inez Jones Kalida Lomax Miranda Norman (N=14) 1 Match in 4th lookup 3 4 2

Binary Search (Cont.) 4 lookups in total; manual calculation of lookups: 1) 14/2=7 2) 7/2=3.5, round to 4; 3) 4:2=2 4) 2:2=1 Or by formula log2 N: log214= …, round up to 4. Check: 2x2=4; 4x2=8; 8x2=16 that is, appx Number of lookups is the exponential to which 2 should be raised to get the score ~ N.

Linked List Index Linked List Index consists of data entries with 3 pieces: key value, pointer to next element, and pointer to a stored record. Complete records - randomly stored location record 8 Carpenter Carlos 12/29/2001 A67 2 Gibson Bill 3/31/2001 A22 Linked List Index - sorted via pointers pointer to stored record 6 Eaton Anissa 8/23/2001 A58 pointer to next index item index element location 7 Farris Dustin 3/28/2001 A63 Carpenter B87 B29 A67 Gibson B38 00 A22 Eaton B29 B71 A58 Indicates last record Farris B71 B38 A63

Linked List: Insert Task
Task: Insert the Eccles row Procedure: 1. Identify place of Eccles index element in sorting order (Eccles is after Eaton and before Farris). 2. Store Eccles element at an available location (B14) 3. Move pointer from Eaton element to Eccles element – B71 (referencing Farris element) 4. Insert pointer in Eaton to point to the Eccles record – new location B14 4. RECORDS 1. B14 Eaton B29 B71 A58 X 3. 2. B14 Eccles B71 A97 Farris B71 B38 A63

Tree (hierarhical) Indexes
Three = a hierarchical structure with a root element on top, branches, nodes, and leaves. Root = start point Node (data entry) Leaf (bottom node with no children) Depth (n) = number of levels Degree (m) = max. no. of children per node: If m=2 then it is a binary (B) tree If>2 then it is a B+ tree. Root value < <= Comparison symbol for locating smaller values Comparison symbols for locating bigger values and an equal value

B+-Tree Increased retrieval power and performs optimally on other tasks. Typical index method in modern DBMSes Characteristics: Root, non-leaf nodes (some values of key attribute, used for navigating through Tree), leaf-nodes (all key values, point to records) Degree, m >= 3 Every non-leaf node (except Root) has between m/2 and m children Leaf-nodes (Leaves) are at the same level/depth & in the sequential order.

B+-Tree Example Degree = 3
< 315 <= < 231 <=..< 287 <= < 458 <=..< 792 <= < 156 <= < 231 <= < 287 <= < 315 <= 347 <= < 458 <= 692 <= < 792 <= records Degree = 3 At minimum m/2=1.5  2 children Maximum 3 children Search procedure (e.g., find 692) using comparisons: Less than Equal or Greater than Between Note sequential order at leaf level (156…792), as ISAM index.

B+-Tree: Insert Task Insert 257 Find location, starting from Root.
Easy with extra space. Just insert 257 in appropriate sequence. Test 1: 257 vs 315 315 < <= 231 <= < 287 458 792 347 692 156 Test 2 257 Test 3

Tree Creation Either a B or B+ tree index is easiest to build (conceptually, manually) from the leaf level. Properties of the tree index must be respected. 4. Work out the node values. A node value is one of the leaf values satisfying comparisons (<, =, >). Example: the value on the left side node must be bigger than 156 and 231, but smaller than 287 and 315. 3. Pick the comparison symbols; these must be used consistently across the tree. The root value must comply with comparison symbols and the edge values in the leaf dataset (315 and 347). Example: 347 must be bigger than any value on the left side of the tree, and it must be equal or smaller than any value on the right side of the tree. 347 < <= 287 231 156 315 692 458 792 2. Pick a root value to be appx. in the middle of leaf dataset Edge values 1. Line up the values to be stored from min to max

B+-Tree Strengths Designed to give good performance for any type of data and usage. Lookup speed is based on degree/depth. Random and sequential retrieval fast. Insert, delete, modify fast. Many changes are easy. Occasionally large sections must be reorganized to balance the tree.

Direct Access / Hashed Example
Convert a PK value directly to location address. Prime modulus hash algorithm: Choose prime number*. Divide key value by prime no. and use modulus (remainder) as address to storage: 124 / 101 = Take left digit as row no. and right digit as column no. Very fast random retrieval (e.g., POS retrieving price on product no. (key). Slower sequential access. Collision/overflow space for duplicates Must reorganize if hash table full. Example Prime = 101 Key value = 124 Modulus = 2 | 3 112 101 133 Overflow for values in collision, sequential order (ex: 202) 1 2 3 102 123 122 111 124 Location in hash table row 2 column 3 The prime should be such that the result of division >0; modulus is increasing by 1; reducing possibility of colliding values. For example, prime of 101 works well with numbers in the range A value of 202 is the first collision, and other values in the range up to 301 will collide. 1. where would the value 104 be stored? 2. where would the value 200 be stored? 3. which value can be stored in 65? cell 23 pointers to records

Comparison of Index Types
Choice depends on data usage. How often do data change? What percent of the data is used at one time? How big are the tables? How many transactions are processed per second? B+-Tree is best overall Hashing is good for high-speed random access Sequential/ISAM is good if entire larger tables often used

Storing Data Columns Different methods of storing data within each row. Fixed (Positional) Simple, common Fixed with overflow (Memo/highly variable text; VARCHAR data type) A101: -Extra Large Overflow text A321: an-Premium A532: r-Cat

Data Clustering Grouping related data together to improve retrieval.
Data should be close to each other on one disk. Preferably within the same disk page or cylinder. Minimize disk reads and seeks. Example: cluster each invoice with the matching order.

Data Clustering Keeping data on the same drive
Keeping data close together Same cylinder Same I/O page Consecutive sectors Order Order# Customer# OrderDate OrderItem Order# Item# 078 Quantity 3 Order# Item Order# Item# 987 Quantity 1 Order# Item Order# Item# 240 Quantity 2 Order# 1124

Data Partitioning Split table: Horizontally or Vertically
Infrequent access to some rows Large tables Move less used rows to slower / cheaper storage Horizontal Partition High speed hard disk Low cost optical disk Active customers Customer# Name Address Phone 2234 Inouye 9978 Kahlea Dr 5532 Jones 887 Elm St 0087 Hardaway 112 West 0109 Pippen 873 Lake Shore Current Customers Customers w/ no purchase in last 3 years

Data Partitioning Vertical Partition
Some columns less used and large (long) Store often used data on hi speed disk. Store less used data on optical disk. DBMS retrieves both automatically as needed. Vertical Partition High speed hard disk Low cost optical disk Item# Name QOH Description TechnicalSpecifications 875 Bolt 268 1/4” x 10 Hardened, meets standards ... 937 Injector 104 Fuel injector Designed 1995, specs . . .

RAID and Disk Striping Redundant Array of Independent Drives - RAID
Instead of one massive drive, use many smaller drives. Split table to store parts on different drives - Striping Drives can simultaneously retrieve portions of data - parallel processing). CustID Name Phone 115 Jones 225 Inez 333 Shigeta 938 Smith Solutions to questions on the hash table. 1. 01 (102:101=1 & 1) 2. 10 (919:101=9 & 10) 3. 101x1+31=101+31=132 4. 101x2+31=202+31= 233 101x3+31=303+31= 334

Physical Design: Types of Indexes & Files

Similar presentations

Presentation on theme: "Physical Design: Types of Indexes & Files"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Physical Design: Types of Indexes & Files

Similar presentations

Presentation on theme: "Physical Design: Types of Indexes & Files"— Presentation transcript:

Similar presentations

About project

Feedback