Presentation is loading. Please wait.

Presentation is loading. Please wait.

File Processing : Index and Hash

Similar presentations


Presentation on theme: "File Processing : Index and Hash"— Presentation transcript:

1 File Processing : Index and Hash
2018, Spring Pusan National University Ki-Joune Li

2 What is index ? Index in a book Index for a file or database
Index : Keyword  Pages Without Index Exhaustive search : Too Expensive Index for a file or database A function or mechanism FIndex : SPredicate  B (block numbers on hard disk) e.g. find student records where student.GPA > 4.0

3 Data Retrieval Time Data retrieval on disk : Two phases
1st phase : Search with a condition (Predicate) 2nd phase : Data access Data Access Time - File Structure - Disk Placement - Clustering, etc.. 2nd Phase Search Block Number Search Condition { Block# } Database on Disk 1st Phase

4 By maximizing blocking factor, we reduce the number of disk accesses
Blocking Factor Bf Blocking Factor Number of Records in a Block Blocking Number and Number of Disk Accesses ND = Nrecord / Bf By maximizing blocking factor, we reduce the number of disk accesses

5 How to Accelerate Phase 1 ?
Of course, we could accelerate the phase 1 by index or by hash Index vs. Hash Index : a type of data structures Needs additional data structures Hash : a type of mechanism May not need any additional data structure (not exactly true)

6 A Simple Idea on Index Mapping Table from keywords to block numbers
Inverted File Why inverted file is better than nothing ? If the table is too large (to fit in main memory) It has to be stored on disk Disk Access for Index Access Keyword Block# Juliet Romeo B26 Hamlet B22 Carmen B212

7 Searching Algorithms and Index
A good way to accelerate searching Tree : O( logn ) Reorganize Inverted File to Tree Binary Search Tree : Branching Factor = 2 Tree in memory space vs. in disk space Memory space : Number of Comparisons Disk space : Number of Block Accesses 30, b27 14, b17 40, b26 34, b17 55, b26

8 Paged Tree : m-way search tree
How to determine m ? One Node : One Disk Page e.g. When 1 disk page is 4 K bytes 4+4m+8(m-1) = 4096  m = 341 Very fat tree Number of delimiters Delimiter 57, b27 34 103, b28 343, b14 Block number 1, b29 44 54, b21 58, b17 32 96, b127

9 Problem of m-Way search tree
Search Performance : determined by the height Not balanced Average : O(log n) Worst case : n / Bf  O(n) Height : determined by insertion order e.g : insertion by ascending order How to make it balanced ? Balanced m-Way search tree : B-tree

10 B-tree B-tree : Balanced m-way search Tree
Root Node : no child node or more than one child nodes Internal Node : m/2 ~ m child nodes (block number) External Node : data block number instead of child node Balanced Upward split instead of downward split : Binary Tree

11 Downward Split Suppose m=3 Insert 10, 20 10 20 20 Insert 30 10 20 30
Upward Split overflow 10 20 30 40 Insert 40

12 Downward Split 10 20 30 40 50 10 20 40 Insert 50 30 50 60 Insert 60 10
70 10 20 30 40 60 50 70 40 50 10 20 30 60 70

13 Meaning of Downward Split
Always Balanced Not so much influenced by the order of insertions Internal Nodes : m/2 ~ m child nodes (block number) Root Node 40 50 10 20 30 60 70 Internal Node External Node

14 Search by B-tree ? 45 45 40 45 20 60 45 10 30 50 70 Not Found

15 Performance of B-tree Number of Comparison within a node : Trivial
Number of Nodes to visit : Depth

16 Problem of B-tree Types of Search B-tree Exact Match Search
Range Search E.g. find students where 25<student.GPA<50 B-tree Good for Exact match search Bad for range search 40 50 10 20 30 60 70

17 B+-tree A Variant of B-tree Performance
Duplicate all elements at leaf nodes (external nodes) Linked List of Leaf Nodes Performance Exact Match Search and Insertion A small fraction of performance sacrifice Range Search : much more powerful than B-tree

18 B+-tree : Example Duplication 40 10 20 30 10 20 30 40 10 20 30
overflow Linked List 40 10 20 30 50 40 10 20 30 50 60 40 10 20 30 50 60

19 Range Search with B+-tree
Find students where GPA>3.5 35 40 10 20 30 50 60 40 10 20 30 50 60 35 40 10 20 30 50 60 35 40 10 20 30 50 60 35

20 Performance of B+-tree
Determined by the Depth Exact Match Search and Insertion (without split) d node (page) accesses Range Search node accesses ( nq : number of records to retrieve)


Download ppt "File Processing : Index and Hash"

Similar presentations


Ads by Google