Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing Methods. Storage Requirements of Databases Need data to be stored “permanently” or persistently for long periods of time Usually too big to fit.

Similar presentations


Presentation on theme: "Indexing Methods. Storage Requirements of Databases Need data to be stored “permanently” or persistently for long periods of time Usually too big to fit."— Presentation transcript:

1 Indexing Methods

2 Storage Requirements of Databases Need data to be stored “permanently” or persistently for long periods of time Usually too big to fit in main memory Low cost of storage per unit of data and the definition of “very large databases” Main cost incurred after storage is of searching the database Primary and secondary (auxiliary) file organizations

3 File Organizations Relations usually stored in files as logical “records” and read in terms of physical “blocks” File organization refers to the way records are stored in terms of blocks and the way blocks are placed on the storage medium and interlinked. Types of organizations –Unsorted –Sorted –Hashing

4 Records Represents a tuple in a relation A file is a sequence of records Records could be either fixed-length or variable-length Records comprise of a sequence of fields (column, attribute)

5 Blocks Refer to physical units of storage in storage devices (Example: Sectors in hard disks, page in virtual memory) Of fixed length, based on physical characteristics of the storage/computing device and operating system Storage device is either defragmented or fragmented depending on whether contiguous sets of records lie in contiguous blocks

6 Blocking Factor The number of records that are stored in a block is called the “blocking factor”. Blocking factor is constant across blocks if record length is fixed, or variable otherwise. If B is block size and R is record size, then blocking factor is: bfr =  B/R  Since R may not exactly divide B, there could be some left-over space in each block equal to: B – (bfr * R) bytes.

7 Spanned and Unspanned Records When extra space in blocks are left unused, the record organization is said to be “unspanned”. Record 1Record 2Record 3 Unused

8 Spanned and Unspanned Records In “spanned” record storage, records can be split so that the “span” across blocks. Record 1Record 2Record 3 Record 4 (part) Record 4 (remaining) Block m Block p p

9 Spanned and Unspanned Records When record size is greater than block size (i.e. R > B), use of “spanned” record storage is compulsory.

10 Indexes Index Files –Secondary or auxiliary files that help speed up data access in primary files Indexes or access structures. –Data structures (and search methods) used for fast access Single level index –index file maps directly to the block or the address of the record Multi-level index –multiple levels of indirection among indexes

11 Definitions Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field) Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations)

12 Definitions Clustering index: When the ordering field is not a key field (i.e. not unique) a clustering index is used instead of a primary index Secondary index: An index structure defined on a non-ordering field.

13 Primary Indexes Comprises of an ordered file of fixed length records having two fields The first field of same data type as ordering key (primary key), and second field is of the type block address. Primary index records are represented by a pair: (k(i), a(i)) –Where k(i) is the key for the i th record and a(i) is the block address containing the i th record.

14 Primary Index...... Index File 2003-0101 2003-0121 2003-0181 ….. 2003-0201 2003-0241 2003-0221 ….. RollNoNameAgeGenderGrade 2003-0101 2003-0121 2003-0262 2003-0120 …...... 2003-0140 2003-0221 2003-0262 ….. 2003-0240 2003-0280 K(i)a(i)....

15 Primary Index The number of entries in the index is equal to the number of disk blocks in the ordered data file The first record in each block of the file is indexed (in sparse indexes). These records are called anchor records A sparse index has index entries for only some of the search values A dense index has an index for every search key value (every record in the data file). Dense indexes are not beneficial on ordered data files.

16 Primary Index Search: –Easy. Perform Binary Search on index file to identify block containing required record Insertion / Deletion: –Easy if key values in records are fixed length and statically allocated to blocks without block spanning (results in wasted space however). –Else, re-computation of index required on insertion / deletion. Use of overflow buffers may be necessary.

17 Clustering Index Clustering field: A non-key ordering field. That is, blocks are ordered on this field which does not have the UNIQUE constraint Structure of index file similar to primary index file, but each index points to the first block having the given value in its clustering field. One index entry for every distinct value of the clustering field.

18 Clustering Index...... K(I) A(I) 1 2 30 3 ….. 39 ….. 80 …...... 90 Dept No Name Gender DOB Job 11121112 22332233 80 81 89 90

19 Clustering Index A sparse index, since only distinct values are indexed Insertion and deletion cause problems when a block can hold more than one value for clustering field Alternative solution: Allocate blocks for each value of clustering field.

20 Clustering Index...... K(I) A(I) 1 2 30 3 ….. 39 ….. 80 …...... 90 Dept No Name Gender DOB Job 111111 222222 80 89 More 1 fields More 2 fields More 89 fields

21 Secondary Index Used to index fields that are neither ordering fields nor key fields. Many secondary indexes possible on a single file. One index entry for the every record in the data file (dense index), containing the value of the indexed attribute, and a pointer to the block / record.

22 Secondary Index on Key Field K(i), A(i) 2003-0101 2003-0102 2003-0103 2003-0104 2003-0106 2003-0105 RollNoNameAgeDept NoJob 2003-0101 2003-0107 2003-0103 2003-0102 2003-0105 2003-0104 2003-0106 … Has as many index entries as the number of records…

23 Secondary Index on Key Field Since key fields are unique, number of index entries equal to number of records Data file need not be sorted on disk Fixed length records for index file

24 Secondary Index on non-key Field When a non-key field is indexed, duplicate values have to be handled. There are three different techniques for handling duplicates: –Duplicate index entries –Variable length records –Extra redirection levels

25 Duplicate Index Entries K(i) A(i) 2003-0101 2003-0102 2003-0103 2003-0102 2003-0103 … Index entries are repeated for each duplicate occurrence of the non-key attribute. Binary search becomes more complicated. Mid-point of a search may have duplicate entries on either side. Insertion of records may need restructuring of index table.

26 Variable Length Records Use variable length records for index table in order to accommodate duplicate key entries For a given key K(i), there is a set of address pointers instead of a single address pointer Binary search becomes complicated since address mid points cannot be computed efficiently Insertion of records may need restructuring of the index table

27 Extra Redirection Levels K(I) A(I) 1 2 3 4 RollNoNameAgeLabIdGrade 1 3........ 2 2 1 2 3 1 4 4 1 1 Address Blocks

28 Extra Indirection Levels Most frequently used technique Index records are of fixed length A(i) in an index record points to a block of address fields Block overflows handled by chaining Retrieval requires sequential search within blocks Insertion of records straightforward

29 Multi-level Indexes Binary search in single-level indexes require a search time of the order of log 2 b number of block accesses. Here b is the number of blocks in the index file If the bfr of the index file is greater than 2, number of block accesses can be reduced even further Multi-level indexes are meant for such a reduction.

30 Multi-level Indexes Contains several levels of the index file Each index block at a given level connects to a maximum of fo number of blocks at the next level. Here fo is called the “fan out” of the index structure Block accesses reduced from log 2 b to log fo b on an average.

31 A Two-level Index Structure 2 10 2 5 15 2 4 5 8 10 12 15 18 First (base) level Second (top) level Block 1 Block 2

32 Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo.

33 Summary Types of Indexes Ordering FieldNonordering Field Key fieldPrimary indexSecondary index (key) Non-key fieldClustering index Secondary index (non- key)

34 Summary Properties of Indexes Number of (first- level) index entries Dense or non- dense PrimaryNumber of blocks in data file Non-dense ClusteringNumber of distinct index field values Non-dense Secondary (key)Number of records in data file Dense Secondary (non- key) Number of records or number of distinct field values Dense or non- dense

35 Summary Multi-level indexes: Several level of index files Characteristic “fan out” property. Fan out fo preferably greater than 2 Reduces number of block accesses to order of log fo b.

36 Dynamic Multi-level Indexes

37 Overview of Index Structures Index Files –Secondary or auxiliary files that help speed up data access in primary files Indexes or access structures –Data structures (and search methods) used for fast access Single level index –index file maps directly to the block or the address of the record Multi-level index –multiple levels of indirection among indexes

38 Definitions Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field) Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations)

39 Definitions Clustering index: When the ordering field is not a key field (i.e. not unique) a clustering index is used instead of a primary index Secondary index: An index structure defined on a non-ordering field.

40 Primary Index Illustration...... Index File 2003-0101 2003-0121 2003-0181 ….. 2003-0201 2003-0241 2003-0221 ….. RollNoNameAgeGenderGrade 2003-0101 2003-0121 2003-0262 2003-0120 …...... 2003-0140 2003-0221 2003-0262 ….. 2003-0240 2003-0280 K(i)a(i)....

41 Clustering Index Illustration...... K(I) A(I) 1 2 30 3 ….. 39 ….. 80 …...... 90 Dept No Name Gender DOB Job 11121112 22332233 80 81 89 90

42 Secondary Index on Key Field K(i), A(i) 2003-0101 2003-0102 2003-0103 2003-0104 2003-0106 2003-0105 RollNoNameAgeDept NoJob 2003-0101 2003-0107 2003-0103 2003-0102 2003-0105 2003-0104 2003-0106 … Has as many index entries as the number of records…

43 Secondary Index on non-Key Field K(I) A(I) 1 2 3 4 RollNoNameAgeLabIdGrade 1 3........ 2 2 1 2 3 1 4 4 1 1 Address Blocks

44 Summary Types of Indexes Ordering FieldNonordering Field Key fieldPrimary indexSecondary index (key) Non-key fieldClustering index Secondary index (non- key)

45 Summary Properties of Indexes Number of (first- level) index entries Dense or non- dense PrimaryNumber of blocks in data file Non-dense ClusteringNumber of distinct index field values Non-dense Secondary (key)Number of records in data file Dense Secondary (non- key) Number of records or number of distinct field values Dense or non- dense

46 Multi-level Indexes Binary search in single-level indexes require a search time of the order of log 2 b number of block accesses. Here b is the number of blocks in the index file If the bfr of the index file is greater than 2, number of block accesses can be reduced even further Multi-level indexes are meant for such a reduction.

47 Multi-level Indexes Contains several levels of the index file Each index block at a given level connects to a maximum of fo number of blocks at the next level. Here fo is called the “fan out” of the index structure Block accesses reduced from log 2 b to log fo b on an average.

48 A Two-level Index Structure 2 10 2 5 15 2 4 5 8 10 12 15 18 First (base) level Second (top) level Block 1 Block 2

49 Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo.

50 A Two-level Index Structure 2 10 2 5 15 2 4 5 8 10 12 15 18 First (base) level Second (top) level Block 1 Block 2

51 Two-level Index Structure First (Base) level is the usual primary index that is maintained in a sorted file Second (Top) level is a primary index into the first level index file The process can be repeated to any number of levels Each level reduces the number of entries of its next level by a factor of fo.

52 Balanced and Unbalanced Index Trees Unbalanced O(n) Balanced  (log fo n)

53 Insertions and Deletions Balanced property of index trees should be maintained during insertions and deletions Insertions and deletions are problematic in multi-level index, since all index files are physically sorted files An approach to overcome this is to use dynamic multi-level indexes

54 B-Trees A Tree data structure where each node has a predetermined maximum fan-out p Terminologies: root node, leaf nodes, internal nodes, parent, children

55 Structure of a Node Data Pointer Left-most Subtree K1K1 K2K2 K i-1 KiKi X > K Data Pointer X < K 1 K 1 < X < K 2 Right-most Subtree

56 B-Tree constraints For a node containing p-1 (or p sub trees) keys, the following condition must always hold: –K 1 < K 2 < … < K p-1 For any data element X in subtree Pi, it should always be the case that: –K i-1 X

57 B-Tree Constraints Each node has at most p tree pointers Each node, except the root and leaf nodes, has at least  p/2  tree pointers (tree balancing constraint) The root node has at least 2 tree pointers unless it is the only node in the tree All leaf nodes are at the same level. In a leaf node, all tree pointers are null.

58 B + Trees Most common index structures in RDBMS. Leaf and non-leaf nodes have different structures: data pointers are stored only at the leaf nodes Leaf nodes form a “sense index” containing every entry for the search field and its corresponding record pointer Leaf nodes linked to provide ordered access to data file records.

59 Non-leaf Nodes in B + Trees Left-most Subtree K1K1 K2K2 K i-1 KiKi X > K X < K 1 K 1 < X < K 2 Right-most Subtree

60 Leaf Nodes in B + Trees K1K1 K2K2 K i-1 KiKi Data pointer Data pointer Data pointer Data pointer Pointer to next leaf node in tree

61 Properties of Leaf Nodes Keys along the leaf nodes chain is organized in sorted order –K 1 < K 2 < … < K n Each leaf node has at least  p/2  values All leaf nodes are at the same level

62 Searching in B + Trees Generalization of Binary Search. 1.Given a search key k start from the root node 2.If key is present in current node then success; else 3.If current node is a leaf node and key not present in node, then key not in the database 4.Search for a tree pointer Pi such that K i-1 < k  k i 5.Return to step 2 to continue search.

63 Insertion Originally, tree begins with only the root node. As and when nodes fill up, they are “split” and made children of a new node. Keys are split uniformly across the three nodes.

64 Insertion 8 5 Let p = 2. Let insertion sequence of keys be: 5, 8, 3, 7, 2, 9, 17, 10, … Tree, after insertion of 5 and 8. Insertion of next key 3 causes overflow requiring a split.

65 Insertion 5 3 5 8 7 is inserted into this node. No overflow.

66 Insertion 5 3 5 7 8 Insertion of 2 causes overflows that need to be cascaded to upper levels.

67 Insertion 57 8 3 2 3 7 Insertion of 9…

68 Insertion 57 8 3 2 3 9 8 5

69 Deletion Deletion of keys may cause underflows which have to be handled separately An underflow occurs when a node contains less than  p/2  keys Nodes are merged with their siblings when underflows occur

70 Indexes on Multiple Attributes All index structures explored till now assumes simple attributes: comprising of only one value Many applications require multi-attribute (composite) keys

71 Ordered Index on Multi- attributes Considers a composite key as a tuple of simple keys (k 1, k 2, …k n ) Ordered index files maintained by ordering each key in sequence.

72 Partitioned Hashing Given a composite key (k 1, k 2, …k n ), partitioned hashing returns n different bucket numbers Hash bucket is determined by concatenating the n numbers.

73 Grid Files Partitions the range of key values for each key into several buckets Combinations of buckets of each key forms a “grid” A grid file stores a grid in either a row major or a column major form.

74 Grid Files Roll No. 1 2 3 4 5 Grade A B C D Roll No. 1  001– 025 2  026 – 050 3  051 – 075 4  076 – 100 5  101 – 125 Bucket Pool

75 Summary Multi-level Indexes Trees, root node, leaf nodes, non-leaf (internal) nodes Dynamic multi-level indexes, B-trees and B + trees Insertion and deletion in B + trees Indexes on multiple attributes.


Download ppt "Indexing Methods. Storage Requirements of Databases Need data to be stored “permanently” or persistently for long periods of time Usually too big to fit."

Similar presentations


Ads by Google