By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon

Slides:



Advertisements
Similar presentations
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Advertisements

1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
Page 13 1.a) A block is a group of records. A block is referred to as the UNIT of TRANSFER In computer files as when a record is searched / updated the.
Chapter 11: File System Implementation
Issues in Database Performance Performance in Read / write are hardware issues => throw money at it Performance of DB = ability of engine to locate data.
Caches J. Nelson Amaral University of Alberta. Processor-Memory Performance Gap Bauer p. 47.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Cache Conscious Indexing for Decision-Support in Main Memory Pradip Dhara.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
Chapter 4: Transaction Management
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Remote Backup Systems.
1 B+ Trees. 2 Tree-Structured Indices v Tree-structured indexing techniques support both range searches and equality searches. v ISAM : static structure;
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Making B+-Trees Cache Conscious in Main Memory
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Time for Change: Why Not Transact In Memory? Sang K. Cha on leave from Seoul National University, Korea.
Advanced Topics in Databases Fahime Raja Niloofar Razavi Melody Siadaty Summer 2005 Technical Report II : Main memory Databases.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won.
Main memory DB PDT Ján GENČI. 2 Obsah Motivation DRDBMS MMDBMS DRDBMS versus MMDBMS Commit processing Support in commercial systems.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
1 How can several users access and update the information at the same time? Real world results Model Database system Physical database Database management.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
B+ Trees  What if you have A LOT of data that needs to be stored and accessed quickly  Won’t all fit in memory.  Means we have to access your hard.
Darko Makreshanski Department of Computer Science ETH Zurich
Range Queries in Non-blocking k-ary Search Trees Trevor Brown Hillel Avni.
CS333 Intro to Operating Systems Jonathan Walpole.
Optimizing Multidimensional Index Trees for Main Memory Access Author: Kihong Kim, Sang K. Cha, Keunjoo Kwon Members: Iris Zhang, Grace Yung, Kara Kwon,
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
CS 540 Database Management Systems
CS411 Database Systems Kazuhiro Minami 09: Storage.
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Remote Backup Systems.
Jonathan Walpole Computer Science Portland State University
Storage Access Paging Buffer Replacement Page Replacement
CS 540 Database Management Systems
Tree-Structured Indexes
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Tree-Structured Indexes
Performance metrics for caches
Performance metrics for caches
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Performance metrics for caches
Miss Rate versus Block Size
Lecture 21: B-Trees Monday, Nov. 19, 2001.
CPS216: Advanced Database Systems
CSE 373: Data Structures and Algorithms
Performance metrics for caches
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Tree-Structured Indexes
Principle of Locality: Memory Hierarchies
Remote Backup Systems.
Performance metrics for caches
Presentation transcript:

Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon Presenter: Kaloian Manassiev Hello, my name is Kaloian and I will be presenting the paper titled $TITLE by $AUTHORS.

Presentation Plan Need for main-memory DBs Special considerations for in-memory operation Main-memory indexing structures and concurrency control OLFIT Evaluation Conclusions This slide shows my presentation plan. First, I will be talking about the motivations behind placing the entire database into main-memory and the benefits from that. Next, I will discuss some special considerations that need to be taken in so that the in-memory database indeed utilises its capacity and is justified. Next, I will give a brief overview of some of the main-memory indexing structures and will outline the common ways for achieving concurrency control. This step presents the motivation behind the algorithm proposed in the paper. Next, I will present the actual Optimistic, Latch-Free Index Traversal algorithm proposed in the paper. Next, I will present and discuss the evaluation results from the paper. And, finally, I will outline some of the conclusions the paper gives and some observations that I have made.

Main Memory DBMS MMDBMS Database resident in memory Slide borrowed from time4change by Sang K. Cha Main Memory DBMS Database resident in memory Read transactions simply read the in-memory data. Update transactions do in-memory updates and write update log to the log disk. Occasionally, checkpoint the dirty pages of the in-memory database to the disk-resident backup DB to shorten the recovery time. MMDBMS With the advances of today’s memory technology, it becomes more and more feasible to place the entire database into main memory, even for domains which have large-sized working sets. Memories with size of 4GB and 8GB are already common, with TB-sized memories coming around the corner. Because of this memory residency, the read transactions simply read the in-memory data and leave the system. The update transactions do in-memory updates, and write update log to the log disk, and leave. Occasionally, there is a background thread running to checkpoint the dirty pages of the in-memory database to the disk-resident backup DB, in order to shorten the recovery time. So you can see here that the main memory DB also needs the disk for storing persistent log and backup databases. Primary DB Checkpointing Logging Log Backup DB

Slide borrowed from time4change by Sang K. Cha Q: Is Disk Database with large buffer the same as Main Memory Database? No! Complex mapping between disk and memory E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk and memory addresses Large Buffer Database Log record Data Blocks Index Blocks disk address So, why exactly in-memory database? Couldn’t we increase the buffer size of an ordinary database and have it take advantage of large memory using it as buffer for caching the disk blocks and can it get the same performance as the main memory database? The answer is no. With an on-disk database we have all the layers of indirection and mapping. As shown on the previous slide, the updates in main memory database can be much better done than in the disk database. For search-oriented applications, there also comes the issue with cache coherence, which I will discuss on the next slide. Disk index block design is not optimized against hardware cache misses.

Cache behavior of commercial DBMS (on Uniprocessor Pentium II Xeon) Slide borrowed from time4change by Sang K. Cha Cache behavior of commercial DBMS (on Uniprocessor Pentium II Xeon) Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where does time go?, VLDB 99 This slide shows a break-down of the memory stall times in a typical modern DBMS. It can be seen that most of the delays are caused by misses in the L2 data cache. This is because of the long row length and the selective touching by the query of different fields of the row, which happen not to be in the same cache line. This causes a swapping of data in and out of the L2 data cache. If we had a multiprocessor, the cache behavior would probably become even worse, because of the coherence cache misses. Up to my knowledge, this fact has not been evaluated experimentally, but simplified analysis presented in today’s paper seems to show so. Memory related delays: 40-80% of execution time. Data accesses on caches: 19-86% of memory stalls. Multiprocessor cache behavior? Probably worse because of coherence cache misses

Main-memory database index structures Plain old B+-Tree – too much data stored in the nodes => low fanout, which incurs cold and capacity cache misses With the original B+-Tree, there is too much data stored within the nodes. We have the keys, we have the pointers. Possibly, we have latch information etc… Since the size of an in-memory B+-Tree node is chosen to be that of a few L2 cache line sizes (which are about 64B?), this causes low fanout in the B+-Tree, in other words, too few entries within a single node. This causes frequent thrashing within the L2 cache, because of the pointer chasing and the need to visit many nodes. This in turn degrades the performance in the traversal case.

Main-memory database index structures (2) T-Tree – small amount of data stored in the nodes, but traversal mainly touches the two end keys in the node => poor L2 cache utilisation The T-Tree is a data structure specifically designed for increased concurrency in the in-memory case. It was proposed in 1986 for the Dali in-memory database system, after the observation that the RB-Trees and other balanced binary threes severely limit the concurrency. Unfortunately, the T-Tree turns also to be a poor L2 cache user. T-Tree is basically an RB-Tree with more than one entry per node. The leftmost entry is the smallest key-value in the node, and the rightmost one is the largest. Thus, it has increased fanout, which intuitively should imply a better cache utilization. But, if you look closer, as Rao and Ross did in 1999 in their paper “Cache Conscious Indexing for Decision-Support in Main Memory,” T-Tree operations, in most of the cases only touch the two ends of a node, which contain respectively, the lowest and the highest key value in the node. And, thus, again they cause turnarounds in the L2 cache. Since the number of comparisons is still the same, they are no better than the normal RB-Tree in terms of cache-consciousness. … …

Main-memory database index structures (3) CSB+-Tree – keeps only one child pointer per node and combines child nodes with a common parent into a group Increased fanout, cache-conscious, reduces the cache miss rate and improves the search performance The CSB+-Tree (which stands for Cache-Sensitive B+-Tree) is an improved variant of the B+-Tree, which keeps only one pointer per node and keeps child nodes with a common parent into a contiguous group. Thus, it almost doubles the overall fanout of the tree, which yields a significant L2 cache utilisation at the cost of a slightly increased update time. However, as pointed in the paper, the CSB+-Tree does not consider concurrent operations. CSB+-tree: 23 34 47 58 Does not consider concurrent operations!

Concurrency control Lock coupling In this and the next few slides, I will discuss the concurrency control mechanisms used in main-memory databases. One of the most-common ways to provide concurrency control for main-memory indexes is to use lock coupling to avoid dangling pointers. The index traversal procedure goes from one node to its child by holding a latch on the node while requesting a latch on the child. The problem with this scheme shows its effects in the shared-memory multiprocessor case, where it incurs a very high number of invalidations as the animation shows.

Concurrency control (2) Blink-Tree Removes the need for lock coupling by linking each node to its right neighbour The Blink-Tree removes the need for lock coupling by linking each node to its right neighbour. This results in one less lock operation. Thus, the Blink-Tree is immune against concurrent split of the child node by another processor, because it may just follow the link.

Concurrency control (3) Tree-Level Locking The tree-level locking is the simplest locking scheme, it just locks the entire tree for the duration of the update operation, so that no concurrent readers or updaters may interfere with it.

Concurrency control (4) Physical Versioning Use Copy-On-Write so that updaters do not interfere with concurrent readers Severely limits the performance when the update load is high Needs garbage collection mechanism to release the dead versions The physical versioning scheme was introduced also for the Dali in-memory database system. It uses a copy-on-write algorithm, which copies the tree node when an attempt to write it is made. This scheme severely limits the performance in the presence of high update workload In addition, it needs a garbage collection mechanism to release the dead versions

OLFIT Probability of update with 100% insert workload (10 Million keys) Now, we’ve reached to the core of the paper – the OLFIT algorithm, which stands for Optimistic, Latch-Free Index Traversal. The idea of the OLFIT algorithm is to eliminate the cache coherence misses incurred by the latching schemes. For the purposes, they use an optimistic traversal algorithm, which only performs reads during traversal (without acquiring latches) and uses versioning and latching only when an update has to be performed. This chart shows the theoretical and experimental probability that a node is updated during an insert with 100% insert workload depending on its level in the tree. The tree has a node fanout of 15 (the number of keys in each node) and the total number of keys is 10M, which results into a tree of height 7). From the chart it becomes clear that latching schemes, which latch during index traversal are too conservative.

OLFIT (1) Node structure CCINFO For the purposes of their latch-free algorithm, the authors augment the B-Tree node with a latch and version fields, which are combined into one word – ccinfo for the purposes of utilising atomic intructions. The latch field is obtained (set to one) when the node is about to be updated. The version field is incremented after the node actually gets updated. The <high key> and <link ptr> fields are used to ensure consistency when the node is being split. I will explain them later on in the presentation.

OLFIT (2) Node read This slide shows the operation of the node read algorithm. Proceeds as follows: First, it starts from the root The traverse procedure, first reads the current node’s CCINFO field and stores it into a local variable. Then it reads the actual node content. Finally, it checks to see if what was read in the previous step was not modified in the meantime. If this is the case, it aborts and retries again.

OLFIT (3) Node update This slide shows the update node algorithm. It is pretty simple, Basically, a latch is acquired on the node, so that no other reader or updater will continue. They will busy wait. Hopefully, this is for a short period of time. When the update has finished, the node is unlatched.

OLFIT (4) Node split Node deletion Registers the node into a garbage collector ? In order to deal with the node splits, they borrow the idea from the B-link-Tree, and include a link pointer to the next neighbour at the same level. Since, compared to the Blink-tree, OLFIT does not use latching at all, they had to introduce the high-key field and make the split happen from left to right, so that it happens in the same order as the read. The reader uses the high-key value, which is an upper bound of all the keys in the node, to decide whether to follow the link pointer In order to deal with node deletion, the node is placed into a garbage collector, where it stays until the moment there are no readers, which depend on its link pointer.

Evaluation Algorithms & parameters The experimental evaluation of the algorithm compares OLFIT against different concurrency control schemes, which use latching. They implemented the different concurrency control algorithms into two variants: Ordinary B+-Tree – all the described techniques apply to it CSB+-Tree – groups nodes into a contiguous space and uses a single link pointer to the neighbours. There is a separate latch for the link pointer. For keys, they used 10Million uniformly distributed 4 byte integers and run 4 different workloads – 100% search, 100% insert, 50% insert + 50% delete (update only) and a mixture of search, insert and delete. For node size, they chose 128B, which is twice the size of the L2 cache line on their platform. The choice of 128B is justified by the chart of performance of the two variants of the OLFIT algorithm. The table on the top-left shows the fanout of the B-Tree nodes.

Evaluation (1) Search performance These charts show the performance comparison for the different concurrency control schemes. From these chars, it can be seen that latching schemes hurt the performance significantly. In both the variants, NO concurrency control is best, because it does not have any overhead. TODO: Complete the explanation

Evaluation (2) Insert & delete (pure update) performance TODO: Complete the explanation

Evaluation (2) Varying update ratio performance (ST) TODO: Complete the explanation

Evaluation (3) Varying update ratio performance (MT) TODO: Complete the explanation

Conclusions (pros) Good algorithm, does not interfere with readers or other updaters Minimises L2 cache misses Avoids operating system locking calls If used in a database, should put the database transactional concurrency control on top of it

Conclusions (cons) Uses busy waiting The evaluation only considers very small key sizes, so busy waiting is not a problem It would be interesting and more validating to see the performance of this algorithm when the key sizes are longer, as is the case with databases. Then, the cost of busy waiting and retries will be more pronounced

Questions?