Presentation is loading. Please wait.

Presentation is loading. Please wait.

YV - B+ Trees and Physical Design 326 Kεφάλαιο 8 ISAM και B-Δέντρα Φυσικός Σχεδιασμός για Βάσεις Δεδομένων.

Similar presentations


Presentation on theme: "YV - B+ Trees and Physical Design 326 Kεφάλαιο 8 ISAM και B-Δέντρα Φυσικός Σχεδιασμός για Βάσεις Δεδομένων."— Presentation transcript:

1 YV - B+ Trees and Physical Design 326 Kεφάλαιο 8 ISAM και B-Δέντρα Φυσικός Σχεδιασμός για Βάσεις Δεδομένων

2 YV - B+ Trees and Physical Design 327 Memory Management.. Online External Nearline External DBMS Application Database Access Methods Logging and Recovery Transaction Programs SET-ORIENTED TUPLE- ORIENTED Tuple Management Associative Access Record Management File Manager Buffer Manager Main Manages BLOCK-ORIENTED Archive Manager Manages Buffer Management File Management

3 YV - B+ Trees and Physical Design 328 Index Structures A (single-level) index is an auxiliary file structure that makes it more efficient to search for a record in the file The index is usually specified on one field of the file The index is often called access path on the field An index file usually occupies much less space than the actual data file (the entries are fewer and smaller) A pointer to the block where a record is stored is found with binary search on the index FORM of a record in the index file: key valuepointer to record’s block

4 YV - B+ Trees and Physical Design 329 Indexes Single-level Indexes can be distinguished in: –Primary Index: defined on a file which is ordered on the key field. It includes one entry for each file block. The entry has the key field value for the first record in the block. (also called, sparse index) –Clustering Index: defined on a file which is ordered on a non-key field. It includes one index entry for each distinct value of the field. The entry points to the first data block that contain records with that field value. –Secondary Index: defined on a file which is unordered. Contains one entry per each record in the data file (also called, dense index)

5 YV - B+ Trees and Physical Design 330 Multi-Level Indexes Index files are simply files, therefore they can also be indexed. We thus end-up with a hierarchy of index structures (first level, second level, etc.) Every level of the index is an ordered file, therefore, deletion and insertion require much work (maintain the index) A multi-level index forms a search tree and it is assumed that the top (first) level index fits in one disk block

6 YV - B+ Trees and Physical Design 331 Indexed Sequential Access Method (ISAM) ISAM is a multi-level index structure (tree) for direct access to tuples in a relation sorted on some attribute Each node of the tree is a page (block) on the disk The nodes of the tree hold pairs, sorted on key value. The internal nodes point to lower nodes in the tree (a lower level index), while the leaf nodes point to pages in the relation. A pointer points to a subtree with key values greater or equal the corresponding key value and less than the key value for the next pointer 18 55 60 78 909

7 YV - B+ Trees and Physical Design 332 ISAM Example ISAM on Salary 10 40 10 20 30 40 50 60 Gary 10 Mike 15 LIsa 20 Shirley 25 Bob 30 Robin 35 Ivan 40 Bill 45 Ron 50 Jill 55 Keith 60 Dan 65 Scott 43 OVERFLOW

8 YV - B+ Trees and Physical Design 333 Example ISAM Tree Each node can hold 2 entries 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root

9 YV - B+ Trees and Physical Design 334 After Inserting 23*, 48*, 41*, 42*... 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root 23* 48* 41* 42* Overflow Pages Leaf Index Pages Primary

10 YV - B+ Trees and Physical Design 335... Then Deleting 42*, 51*, 97* * Note that 51* appears in index levels, but not in leaf! 10*15*20*27*33*37*40* 46*55* 63* 20335163 40 Root 23* 48* 41*

11 YV - B+ Trees and Physical Design 336 ISAM Performance PERFORMANCE. Suppose that we have D data pages and k pointers in each node (assume that D is a power of k, say D = k L ) LOOKUP COST: Sequential Scan:D Binary Search on Relation:log 2 D + 1 Binary Search (single index):(log 2 (D/k) + 1) + 1 Traversing the ISAM tree:log k D + 1 = L + 1

12 YV - B+ Trees and Physical Design 337 ISAM Summary ADVANTAGES: –It provides a sorted (mostly) directory for a file (relation) –It is very good for exact queries (e.g., Salary = 40000) –ISAM facilitates the execution of range queries (e.g., Salary between 35000 and 60000) DISADVANTAGES –It is a static structure and may easily get unbalanced –If there are many updates (volatile data), then the relation does not remain “mostly” sorted –The index does consume some space which may be valuable

13 YV - B+ Trees and Physical Design 338 B + - Trees A B + -tree is a multilevel tree-structured directory to index a sorted file Leaf nodes contain tuples in sorted order, the other internal nodes have a special form Nodes correspond to a block (page) Each node is kept between half- and completely-full Insertions in nodes that are not full are very efficient; if a node is full, a split occurs Deletions are very efficient if the node does not become less than half-full The tree structure remains balanced at all times

14 YV - B+ Trees and Physical Design 339 B + - Trees -- Summary Form of internal nodes: Variations of B + -trees: –B-trees : They are like B + -trees, but the internal nodes also contain pointers to data. They can get deeper and are difficult to implement, but occasionally, they are faster. –B * -trees : They are like B + -trees, but keep each node 2/3 full (at least). Shorter trees, faster, but worse for updates P1P1 K1K1 P2P2 K2K2... K n-1 P n P n-1 K 1 < K 1 <... K n-1 P 1 points to a node containing key values n, n< K 1 P 2 points to a node containing key values n, K 1 < = n < K 1

15 YV - B+ Trees and Physical Design 340 B+ Tree: The Most Widely Used Index Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (Direct search)

16 YV - B+ Trees and Physical Design 341 Example B+ Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM). Search for 5*, 15*, all data entries >= 24*... * Based on the search for 15*, we know it is not in the tree! Root 1724 30 2* 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13

17 YV - B+ Trees and Physical Design 342 B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. –average fanout = 133 Typical capacities: –Height 4: 133 4 = 312,900,700 records –Height 3: 133 3 = 2,352,637 records Can often hold top levels in buffer pool: –Level 1 = 1 page = 8 Kbytes –Level 2 = 133 pages = 1 Mbyte –Level 3 = 17,689 pages = 133 MBytes

18 YV - B+ Trees and Physical Design 343 Inserting a Data Entry into a B+ Tree Find correct leaf L. Put data entry onto L. –If L has enough space, done! –Else, must split L (into L and a new node L2) »Redistribute entries evenly, copy up middle key. »Insert index entry pointing to L2 into parent of L. This can happen recursively – To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. –Tree growth: gets wider or one level taller at top.

19 YV - B+ Trees and Physical Design 344 Inserting 8* into Example B+ Tree Observe how minimum occupancy is guaranteed in both leaf and index pg splits. Note difference between copy-up and push-up; be sure you understand the reasons for this. 2* 3*5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is continues to appear in the leaf.) s copied up and appears once in the index. Contrast 52430 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only this with a leaf split.)

20 YV - B+ Trees and Physical Design 345 Example B+ Tree After Inserting 8* v Notice that root was split, leading to increase in height. v In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

21 YV - B+ Trees and Physical Design 346 Deleting a Data Entry from a B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. –If L is at least half-full, done! –If L has only d-1 entries, »Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). »If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height.

22 YV - B+ Trees and Physical Design 347 Example Tree After (Inserting 8*, Then) Deleting 19* and 20*... Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29*

23 YV - B+ Trees and Physical Design 348... And Then Deleting 24* Must merge. Observe ` toss’ of index entry (on right), and `pull down’ of index entry (below). 30 22*27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root 30 135 17

24 YV - B+ Trees and Physical Design 349 Example of Non-leaf Re-distribution Tree is shown below during deletion of 24*. (What could be a possible initial tree?) In contrast to previous example, can re-distribute entry from left child of root to right child. Root 135 1720 22 30 14*16* 17*18* 20*33*34* 38* 39* 22*27*29*21* 7*5*8* 3*2*

25 YV - B+ Trees and Physical Design 350 After Re-distribution Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node. It suffices to re-distribute index entry with key 20; we’ve re- distributed 17 as well for illustration. 14*16* 33*34* 38* 39* 22*27*29* 17*18* 20*21* 7*5*8* 2*3* Root 135 17 30 20 22

26 YV - B+ Trees and Physical Design 351 Prefix Key Compression Important to increase fan-out. (Why?) Key values in index entries only `direct traffic’; can often compress them. –E.g., If we have adjacent index entries with search key values Dannon Yogurt, David Smith and Devarakonda Murthy, we can abbreviate David Smith to Dav. (The other keys can be compressed too...) »Is this correct? Not quite! What if there is a data entry Davey Jones? (Can only compress David Smith to Davi) »In general, while compressing, must leave each index entry greater than every key value (in any subtree) to its left. Insert/delete must be suitably modified.

27 YV - B+ Trees and Physical Design 352 Bulk Loading of a B+ Tree If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. Bulk Loading can be done much more efficiently. Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Sorted pages of data entries; not yet in B+ tree Root

28 YV - B+ Trees and Physical Design 353 Summary of Bulk Loading Option 1: multiple inserts. –Slow. –Does not give sequential storage of leaves. Option 2: Bulk Loading –Has advantages for concurrency control. –Fewer I/Os during build. –Leaves will be stored sequentially (and linked, of course). –Can control “fill factor” on pages.

29 YV - B+ Trees and Physical Design 354 A Note on `Order’ Order (d) concept replaced by physical space criterion in practice (`at least half-full’). –Index pages can typically hold many more entries than leaf pages. –Variable sized records and search keys mean differnt nodes will contain different numbers of entries. –Even with fixed length fields, multiple records with the same search key value (duplicates) can lead to variable-sized data entries (if we use Alternative (3)).

30 YV - B+ Trees and Physical Design 355 Summary Tree-structured indexes are ideal for range-searches, also good for equality searches. ISAM is a static structure. –Only leaf pages modified; overflow pages needed. –Overflow chains can degrade performance unless size of data set and data distribution stay constant. B+ tree is a dynamic structure. –Inserts/deletes leave tree height-balanced; log F N cost. –High fanout (F) means depth rarely more than 3 or 4. –Almost always better than maintaining a sorted file.

31 YV - B+ Trees and Physical Design 356 Summary (Contd.) –Typically, 67% occupancy on average. –Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. –If data entries are data records, splits can change rids! Key compression increases fanout, reduces height. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.

32 YV - B+ Trees and Physical Design 357 Overview of Physical Design After ER design, schema refinement, and the definition of views, we have the conceptual and external schemas for our database. The next step is to choose indexes, make clustering decisions, and to refine the conceptual and external schemas (if necessary) to meet performance goals. We must begin by understanding the workload: –The most important queries and how often they arise. –The most important updates and how often they arise. –The desired performance for these queries and updates.

33 YV - B+ Trees and Physical Design 358 Understanding the Workload For each query in the workload: –Which relations does it access? –Which attributes are retrieved? –Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? For each update in the workload: –Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? –The type of update ( INSERT/DELETE/UPDATE ), and the attributes that are affected.

34 YV - B+ Trees and Physical Design 359 Decisions to Make What indexes should we create? –Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? For each index, what kind of an index should it be? –Clustered? Hash/tree? Dynamic/static? Dense/sparse? Should we make changes to the conceptual schema? –Consider alternative normalized schemas? (Remember, there are many choices in decomposing into BCNF, etc.) –Should we ``undo’’ some decomposition steps and settle for a lower normal form? (Denormalization.) –Horizontal partitioning, replication, views...

35 YV - B+ Trees and Physical Design 360 Choice of Indexes One approach: consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. Before creating an index, must also consider the impact on updates in the workload! –Trade-off: indexes can make queries go faster, updates slower. Require disk space, too.

36 YV - B+ Trees and Physical Design 361 Issues to Consider in Index Selection Attributes mentioned in a WHERE clause are candidates for index search keys. –Exact match condition suggests hash index. –Range query suggests tree index. »Clustering is especially useful for range queries, although it can help on equality queries as well in the presence of duplicates. Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.

37 YV - B+ Trees and Physical Design 362 Issues in Index Selection (Contd.) Multi-attribute search keys should be considered when a WHERE clause contains several conditions. –If range selections are involved, order of attributes should be carefully chosen to match the range ordering. –Such indexes can sometimes enable index-only strategies for important queries. »For index-only strategies, clustering is not important! When considering a join condition: –Hash index on inner is very good for Index Nested Loops. »Should be clustered if join column is not key for inner, and inner tuples need to be retrieved. –Clustered B+ tree on join column(s) good for Sort-Merge.

38 YV - B+ Trees and Physical Design 363 Example Hash index on D.dname supports ‘Toy’ selection. –Given this, index on D.dno is not needed. Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple. What if WHERE included: ``... AND E.age=25’’ ? –Could retrieve Emp tuples using index on E.age, then join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index. –So, if E.age index is already created, this query provides much less motivation for adding an E.dno index. SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno

39 YV - B+ Trees and Physical Design 364 Examples of Clustering B+ tree index on E.age can be used to get qualifying tuples. –How selective is the condition? –Is the index clustered? Consider the GROUP BY query. –If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. –Clustered E.dno index may be better! Equality queries and duplicates: –Clustering on E.hobby helps! SELECT E.dno FROM Emp E WHERE E.age>40 SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps

40 YV - B+ Trees and Physical Design 365 Multi-Attribute Index Keys To retrieve Emp records with age=30 AND sal=4000, an index on would be better than an index on age or an index on sal. –Such indexes also called composite or concatenated indexes. –Choice of index key orthogonal to clustering etc. If condition is: 20<age<30 AND 3000<sal<5000: –Clustered tree index on or is best. If condition is: age=30 AND 3000<sal<5000: –Clustered index much better than index! Composite indexes are larger, updated more often.

41 YV - B+ Trees and Physical Design 366 Summary Database design consists of several tasks: requirements analysis, conceptual design, schema refinement, physical design and tuning. –In general, have to go back and forth between these tasks to refine a database design, and decisions in one task can influence the choices in another task. Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design. –What are the important queries and updates? What attributes/relations are involved?

42 YV - B+ Trees and Physical Design 367 Summary (Contd.) Indexes must be chosen to speed up important queries (and perhaps some updates!). –Index maintenance overhead on updates to key fields. –Choose indexes that can help many queries, if possible. –Build indexes to support index-only strategies. –Clustering is an important decision; only one index on a given relation can be clustered! –Order of fields in composite index key can be important. Static indexes may have to be periodically re-built. Statistics have to be periodically updated.

43 YV - B+ Trees and Physical Design 368 Relation Implementations (a) STORE EACH RELATION IN A FILE –For small relations, a heap may suffice –For larger relations, ISAM, B-tree, or Hashing –Allow for secondary indexes on user-specified fields Example Commands: modify R to isam on A 1 index on R is S(A 1 ) (b) STORE EACH RELATION AS IN DBTG –Store related tuples from different relations together –Use multilist structures


Download ppt "YV - B+ Trees and Physical Design 326 Kεφάλαιο 8 ISAM και B-Δέντρα Φυσικός Σχεδιασμός για Βάσεις Δεδομένων."

Similar presentations


Ads by Google