Presentation is loading. Please wait.

Presentation is loading. Please wait.

By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon

Similar presentations


Presentation on theme: "By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon"— Presentation transcript:

1 Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems
By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon Presenter: Kaloian Manassiev Hello, my name is Kaloian and I will be presenting the paper titled $TITLE by $AUTHORS.

2 Presentation Plan Need for main-memory DBs
Special considerations for in-memory operation Main-memory indexing structures and concurrency control OLFIT Evaluation Conclusions This slide shows my presentation plan. First, I will be talking about the motivations behind placing the entire database into main-memory and the benefits from that. Next, I will discuss some special considerations that need to be taken in so that the in-memory database indeed utilises its capacity and is justified. Next, I will give a brief overview of some of the main-memory indexing structures and will outline the common ways for achieving concurrency control. This step presents the motivation behind the algorithm proposed in the paper. Next, I will present the actual Optimistic, Latch-Free Index Traversal algorithm proposed in the paper. Next, I will present and discuss the evaluation results from the paper. And, finally, I will outline some of the conclusions the paper gives and some observations that I have made.

3 Main Memory DBMS MMDBMS Database resident in memory
Slide borrowed from time4change by Sang K. Cha Main Memory DBMS Database resident in memory Read transactions simply read the in-memory data. Update transactions do in-memory updates and write update log to the log disk. Occasionally, checkpoint the dirty pages of the in-memory database to the disk-resident backup DB to shorten the recovery time. MMDBMS With the advances of today’s memory technology, it becomes more and more feasible to place the entire database into main memory, even for domains which have large-sized working sets. Memories with size of 4GB and 8GB are already common, with TB-sized memories coming around the corner. Because of this memory residency, the read transactions simply read the in-memory data and leave the system. The update transactions do in-memory updates, and write update log to the log disk, and leave. Occasionally, there is a background thread running to checkpoint the dirty pages of the in-memory database to the disk-resident backup DB, in order to shorten the recovery time. So you can see here that the main memory DB also needs the disk for storing persistent log and backup databases. Primary DB Checkpointing Logging Log Backup DB

4 Slide borrowed from time4change by Sang K. Cha
Q: Is Disk Database with large buffer the same as Main Memory Database? No! Complex mapping between disk and memory E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk and memory addresses Large Buffer Database Log record Data Blocks Index Blocks disk address So, why exactly in-memory database? Couldn’t we increase the buffer size of an ordinary database and have it take advantage of large memory using it as buffer for caching the disk blocks and can it get the same performance as the main memory database? The answer is no. With an on-disk database we have all the layers of indirection and mapping. As shown on the previous slide, the updates in main memory database can be much better done than in the disk database. For search-oriented applications, there also comes the issue with cache coherence, which I will discuss on the next slide. Disk index block design is not optimized against hardware cache misses.

5 Cache behavior of commercial DBMS (on Uniprocessor Pentium II Xeon)
Slide borrowed from time4change by Sang K. Cha Cache behavior of commercial DBMS (on Uniprocessor Pentium II Xeon) Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where does time go?, VLDB 99 This slide shows a break-down of the memory stall times in a typical modern DBMS. It can be seen that most of the delays are caused by misses in the L2 data cache. This is because of the long row length and the selective touching by the query of different fields of the row, which happen not to be in the same cache line. This causes a swapping of data in and out of the L2 data cache. If we had a multiprocessor, the cache behavior would probably become even worse, because of the coherence cache misses. Up to my knowledge, this fact has not been evaluated experimentally, but simplified analysis presented in today’s paper seems to show so. Memory related delays: 40-80% of execution time. Data accesses on caches: 19-86% of memory stalls. Multiprocessor cache behavior? Probably worse because of coherence cache misses

6 Main-memory database index structures
Plain old B+-Tree – too much data stored in the nodes => low fanout, which incurs cold and capacity cache misses With the original B+-Tree, there is too much data stored within the nodes. We have the keys, we have the pointers. Possibly, we have latch information etc… Since the size of an in-memory B+-Tree node is chosen to be that of a few L2 cache line sizes (which are about 64B?), this causes low fanout in the B+-Tree, in other words, too few entries within a single node. This causes frequent thrashing within the L2 cache, because of the pointer chasing and the need to visit many nodes. This in turn degrades the performance in the traversal case.

7 Main-memory database index structures (2)
T-Tree – small amount of data stored in the nodes, but traversal mainly touches the two end keys in the node => poor L2 cache utilisation The T-Tree is a data structure specifically designed for increased concurrency in the in-memory case. It was proposed in 1986 for the Dali in-memory database system, after the observation that the RB-Trees and other balanced binary threes severely limit the concurrency. Unfortunately, the T-Tree turns also to be a poor L2 cache user. T-Tree is basically an RB-Tree with more than one entry per node. The leftmost entry is the smallest key-value in the node, and the rightmost one is the largest. Thus, it has increased fanout, which intuitively should imply a better cache utilization. But, if you look closer, as Rao and Ross did in 1999 in their paper “Cache Conscious Indexing for Decision-Support in Main Memory,” T-Tree operations, in most of the cases only touch the two ends of a node, which contain respectively, the lowest and the highest key value in the node. And, thus, again they cause turnarounds in the L2 cache. Since the number of comparisons is still the same, they are no better than the normal RB-Tree in terms of cache-consciousness.

8 Main-memory database index structures (3)
CSB+-Tree – keeps only one child pointer per node and combines child nodes with a common parent into a group Increased fanout, cache-conscious, reduces the cache miss rate and improves the search performance The CSB+-Tree (which stands for Cache-Sensitive B+-Tree) is an improved variant of the B+-Tree, which keeps only one pointer per node and keeps child nodes with a common parent into a contiguous group. Thus, it almost doubles the overall fanout of the tree, which yields a significant L2 cache utilisation at the cost of a slightly increased update time. However, as pointed in the paper, the CSB+-Tree does not consider concurrent operations. CSB+-tree: 23 34 47 58 Does not consider concurrent operations!

9 Concurrency control Lock coupling
In this and the next few slides, I will discuss the concurrency control mechanisms used in main-memory databases. One of the most-common ways to provide concurrency control for main-memory indexes is to use lock coupling to avoid dangling pointers. The index traversal procedure goes from one node to its child by holding a latch on the node while requesting a latch on the child. The problem with this scheme shows its effects in the shared-memory multiprocessor case, where it incurs a very high number of invalidations as the animation shows.

10 Concurrency control (2)
Blink-Tree Removes the need for lock coupling by linking each node to its right neighbour The Blink-Tree removes the need for lock coupling by linking each node to its right neighbour. This results in one less lock operation. Thus, the Blink-Tree is immune against concurrent split of the child node by another processor, because it may just follow the link.

11 Concurrency control (3)
Tree-Level Locking The tree-level locking is the simplest locking scheme, it just locks the entire tree for the duration of the update operation, so that no concurrent readers or updaters may interfere with it.

12 Concurrency control (4)
Physical Versioning Use Copy-On-Write so that updaters do not interfere with concurrent readers Severely limits the performance when the update load is high Needs garbage collection mechanism to release the dead versions The physical versioning scheme was introduced also for the Dali in-memory database system. It uses a copy-on-write algorithm, which copies the tree node when an attempt to write it is made. This scheme severely limits the performance in the presence of high update workload In addition, it needs a garbage collection mechanism to release the dead versions

13 OLFIT Probability of update with 100% insert workload (10 Million keys) Now, we’ve reached to the core of the paper – the OLFIT algorithm, which stands for Optimistic, Latch-Free Index Traversal. The idea of the OLFIT algorithm is to eliminate the cache coherence misses incurred by the latching schemes. For the purposes, they use an optimistic traversal algorithm, which only performs reads during traversal (without acquiring latches) and uses versioning and latching only when an update has to be performed. This chart shows the theoretical and experimental probability that a node is updated during an insert with 100% insert workload depending on its level in the tree. The tree has a node fanout of 15 (the number of keys in each node) and the total number of keys is 10M, which results into a tree of height 7). From the chart it becomes clear that latching schemes, which latch during index traversal are too conservative.

14 OLFIT (1) Node structure CCINFO
For the purposes of their latch-free algorithm, the authors augment the B-Tree node with a latch and version fields, which are combined into one word – ccinfo for the purposes of utilising atomic intructions. The latch field is obtained (set to one) when the node is about to be updated. The version field is incremented after the node actually gets updated. The <high key> and <link ptr> fields are used to ensure consistency when the node is being split. I will explain them later on in the presentation.

15 OLFIT (2) Node read This slide shows the operation of the node read algorithm. Proceeds as follows: First, it starts from the root The traverse procedure, first reads the current node’s CCINFO field and stores it into a local variable. Then it reads the actual node content. Finally, it checks to see if what was read in the previous step was not modified in the meantime. If this is the case, it aborts and retries again.

16 OLFIT (3) Node update This slide shows the update node algorithm. It is pretty simple, Basically, a latch is acquired on the node, so that no other reader or updater will continue. They will busy wait. Hopefully, this is for a short period of time. When the update has finished, the node is unlatched.

17 OLFIT (4) Node split Node deletion
Registers the node into a garbage collector ? In order to deal with the node splits, they borrow the idea from the B-link-Tree, and include a link pointer to the next neighbour at the same level. Since, compared to the Blink-tree, OLFIT does not use latching at all, they had to introduce the high-key field and make the split happen from left to right, so that it happens in the same order as the read. The reader uses the high-key value, which is an upper bound of all the keys in the node, to decide whether to follow the link pointer In order to deal with node deletion, the node is placed into a garbage collector, where it stays until the moment there are no readers, which depend on its link pointer.

18 Evaluation Algorithms & parameters
The experimental evaluation of the algorithm compares OLFIT against different concurrency control schemes, which use latching. They implemented the different concurrency control algorithms into two variants: Ordinary B+-Tree – all the described techniques apply to it CSB+-Tree – groups nodes into a contiguous space and uses a single link pointer to the neighbours. There is a separate latch for the link pointer. For keys, they used 10Million uniformly distributed 4 byte integers and run 4 different workloads – 100% search, 100% insert, 50% insert + 50% delete (update only) and a mixture of search, insert and delete. For node size, they chose 128B, which is twice the size of the L2 cache line on their platform. The choice of 128B is justified by the chart of performance of the two variants of the OLFIT algorithm. The table on the top-left shows the fanout of the B-Tree nodes.

19 Evaluation (1) Search performance
These charts show the performance comparison for the different concurrency control schemes. From these chars, it can be seen that latching schemes hurt the performance significantly. In both the variants, NO concurrency control is best, because it does not have any overhead. TODO: Complete the explanation

20 Evaluation (2) Insert & delete (pure update) performance
TODO: Complete the explanation

21 Evaluation (2) Varying update ratio performance (ST)
TODO: Complete the explanation

22 Evaluation (3) Varying update ratio performance (MT)
TODO: Complete the explanation

23 Conclusions (pros) Good algorithm, does not interfere with readers or other updaters Minimises L2 cache misses Avoids operating system locking calls If used in a database, should put the database transactional concurrency control on top of it

24 Conclusions (cons) Uses busy waiting
The evaluation only considers very small key sizes, so busy waiting is not a problem It would be interesting and more validating to see the performance of this algorithm when the key sizes are longer, as is the case with databases. Then, the cost of busy waiting and retries will be more pronounced

25 Questions?


Download ppt "By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon"

Similar presentations


Ads by Google