Buffering Database Operations for Enhanced Instruction Cache Performance Jingren Zhou, Kenneth A. Ross SIGMOD International Conference on Management of.

Slides:

Advertisements

Similar presentations

By Snigdha Rao Parvatneni

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 12 Pipelining Strategies Performance Hazards.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Memory Management 2010.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Computer Organization and Architecture

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Review of Memory Management, Virtual Memory CS448.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

CMPE 421 Parallel Computer Architecture

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Full and Para Virtualization

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.

Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

PipeliningPipelining Computer Architecture (Fall 2006)

Computer Architecture Chapter (14): Processor Structure and Function

Dynamic Branch Prediction

Database Management System

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

How will execution time grow with SIZE?

5.2 Eleven Advanced Optimizations of Cache Performance

Buffering Database Operations for Enhanced

Chapter 8 Digital Design and Computer Architecture: ARM® Edition

Fundamentals of Computing: Computer Architecture

Presentation transcript:

Buffering Database Operations for Enhanced Instruction Cache Performance Jingren Zhou, Kenneth A. Ross SIGMOD International Conference on Management of Data 2004 Nihan Özman

Outline Introduction Introduction Related Work Related Work Memory Hierarchy Memory Hierarchy Pipelined Query Execution Pipelined Query Execution A New Buffer Operator A New Buffer Operator Buffering Strategies Buffering Strategies Instruction Footprint Analysis Instruction Footprint Analysis Overall Algorithm Overall Algorithm Experimental Validation Experimental Validation Conclusion and Future Work Conclusion and Future Work

Introduction Recent database research has demonstrated that most of the memory stalls are due to the data cache misses on the second-level cache and the instruction cache misses on the first-level instruction cache. Recent database research has demonstrated that most of the memory stalls are due to the data cache misses on the second-level cache and the instruction cache misses on the first-level instruction cache. With random access memory (RAM) getting cheaper and new 64-bit CPUs join the PC family, it becomes affordable and feasible to build computers with large main memories. With random access memory (RAM) getting cheaper and new 64-bit CPUs join the PC family, it becomes affordable and feasible to build computers with large main memories. Which implies more and more query processing work can be done in main memory. Which implies more and more query processing work can be done in main memory. Advance in CPU speeds have far outpaced the advances in memory latency Advance in CPU speeds have far outpaced the advances in memory latency Main memory access is becoming a significant cost component of database operations Main memory access is becoming a significant cost component of database operations

Introduction Relatively little research has been done on improving the instruction cache performance. Relatively little research has been done on improving the instruction cache performance. In this paper, the focus is on In this paper, the focus is on improving the instruction cache performance for conventional demand-pull database query engines improving the instruction cache performance for conventional demand-pull database query engines achieving fast query execution throughput achieving fast query execution throughput not making substantial modifications to existing database implementations not making substantial modifications to existing database implementations

Related Work There are many techniques proposed to improve instruction cache performance at compiler level. There are many techniques proposed to improve instruction cache performance at compiler level. For database systems, these techniques do not solve the fundamental problem: For database systems, these techniques do not solve the fundamental problem: Large instruction footprints during database query execution, leading to a large number of instruction cache misses Large instruction footprints during database query execution, leading to a large number of instruction cache misses The challenge: The challenge: Reducing the effective size of the query execution footprints without sacrificing performance Reducing the effective size of the query execution footprints without sacrificing performance This issue should be resolved at the “query compiler” level (during query optimization) This issue should be resolved at the “query compiler” level (during query optimization)

Related Work A block oriented processing technique for aggregation, expression evaluation and sorting operation is proposed by Padmanabhan, Malkemus, Agarwal and Jhingran. A block oriented processing technique for aggregation, expression evaluation and sorting operation is proposed by Padmanabhan, Malkemus, Agarwal and Jhingran. Each operation is performed on a block of records using a vector style processing strategy to Each operation is performed on a block of records using a vector style processing strategy to achieve better instruction pipeliningachieve better instruction pipelining minimize instruction count and function callsminimize instruction count and function calls Impact of instruction cache performance is not studied. Impact of instruction cache performance is not studied. The technique requires a complete redesign of database operations (all return block of tuples) The technique requires a complete redesign of database operations (all return block of tuples) The technique do not consider the operator footprint size when deciding whether to process data in blocks The technique do not consider the operator footprint size when deciding whether to process data in blocks

Related Work Zhou and Ross use a similar buffer strategy to batch access to a tree-based index (“Buffering accesses to memory-resident index structures”, In Proceedings of VLDB Conference, 2003) Zhou and Ross use a similar buffer strategy to batch access to a tree-based index (“Buffering accesses to memory-resident index structures”, In Proceedings of VLDB Conference, 2003) The performance benefits of such buffering are due to improved data cache hit rates. The performance benefits of such buffering are due to improved data cache hit rates. The impact of buffering on instruction-cache behaviour is not described. The impact of buffering on instruction-cache behaviour is not described.

Memory Hierarchy Modern computer architectures have a hierarchical memory system, where access by the CPU to main memory is accelerated by various levels of cache memories. Modern computer architectures have a hierarchical memory system, where access by the CPU to main memory is accelerated by various levels of cache memories. Cache memories based on the principles of spatial and temporal locality. Cache memories based on the principles of spatial and temporal locality. A cache hit happens when the requested data (or instructions) are found in the cache. A cache hit happens when the requested data (or instructions) are found in the cache. A cache miss incurs when the CPU loads data (or instructions) from a lower cache or memory A cache miss incurs when the CPU loads data (or instructions) from a lower cache or memory

Memory Hierarchy There are typically two or three cache levels. There are typically two or three cache levels. Most of the CPU’s now have Level 1 and Level 2 caches integrated on die. Most of the CPU’s now have Level 1 and Level 2 caches integrated on die. Instruction and data caches are usually seperated in the first level and are shared in the second level. Instruction and data caches are usually seperated in the first level and are shared in the second level. Caches are characterized by: Caches are characterized by: capacity capacity cache-line size cache-line size associativity associativity

Memory Hierarchy Latency is the time span that passes after issuing a data access until the requested data is available in the CPU Latency is the time span that passes after issuing a data access until the requested data is available in the CPU In hierarchical memory systems, latency increases with distance from the CPU In hierarchical memory systems, latency increases with distance from the CPU L2 cache sizes are expected to increase(2MB for Pentium M), L1 cache sizes will not increase at the same rate. Larger L1 caches are slower than smaller L1 caches, and may slow down the processor clock. L2 cache sizes are expected to increase(2MB for Pentium M), L1 cache sizes will not increase at the same rate. Larger L1 caches are slower than smaller L1 caches, and may slow down the processor clock. The aim in this paper is to minimize L1 instruction cache misses The aim in this paper is to minimize L1 instruction cache misses

Memory Hierarchy Modern processors implement a Trace Cache instead of a conventional L1 instruction cache. Modern processors implement a Trace Cache instead of a conventional L1 instruction cache.Trace Cache Trace Cache A trace cache is an instruction cache that stores instructions either after they have been decoded, or as they are retired. This allows the instruction fetch unit of a processor to fetch several basic blocks, without having to worry about branches in the execution flow. A trace cache is an instruction cache that stores instructions either after they have been decoded, or as they are retired. This allows the instruction fetch unit of a processor to fetch several basic blocks, without having to worry about branches in the execution flow. Upon a trace cache miss, the instruction address is submitted to the instruction translation look aside buffer (ITLB), which translates the address into a physical memory address before the cache lookup is performed. Upon a trace cache miss, the instruction address is submitted to the instruction translation look aside buffer (ITLB), which translates the address into a physical memory address before the cache lookup is performed. The TLB ( Translation Look aside Buffer ) is responsible for translating the virtual address into a physical address. The TLB is a little cache that contains recent translations. If it's not in the TLB then the translation must be loaded from the memory hierarchy, possible all the way down from main memory. The TLB ( Translation Look aside Buffer ) is responsible for translating the virtual address into a physical address. The TLB is a little cache that contains recent translations. If it's not in the TLB then the translation must be loaded from the memory hierarchy, possible all the way down from main memory.

Memory Hierarchy Data memory latency can be hidden by correctly prefetching data in to the cache. Data memory latency can be hidden by correctly prefetching data in to the cache. Most of the modern CPU’s are pipelined CPU’s Most of the modern CPU’s are pipelined CPU’s The term pipeline represents the concept of splitting a job into sub processes in which the output of one sub process feeds into the next. (for Pentium 4 the pipeline is 20 stages deep) The term pipeline represents the concept of splitting a job into sub processes in which the output of one sub process feeds into the next. (for Pentium 4 the pipeline is 20 stages deep) Conditional branch instructions present another significant problem for modern pipelined CPUs because the CPUs do not know in advance which of the two possible outcomes of the comparison will happen. Conditional branch instructions present another significant problem for modern pipelined CPUs because the CPUs do not know in advance which of the two possible outcomes of the comparison will happen.

Memory Hierarchy CPUs try to predict the outcome of branches, and have special hardware for maintaining the branching history of many branch instructions. CPUs try to predict the outcome of branches, and have special hardware for maintaining the branching history of many branch instructions. For Pentium 4 processors, the trace cache integrates branch prediction into the instruction cache by storing traces of instructions that have previously been executed in sequence, including branches. For Pentium 4 processors, the trace cache integrates branch prediction into the instruction cache by storing traces of instructions that have previously been executed in sequence, including branches.

Pipelined Query Execution Each operator supports an open-next-close iterator interface Each operator supports an open-next-close iterator interface Open(): initializes the state of the iterator by allocating buffers for its inputs and output, and is also used to pass in arguments such as selection predicates that modify the behavior of the operator. Open(): initializes the state of the iterator by allocating buffers for its inputs and output, and is also used to pass in arguments such as selection predicates that modify the behavior of the operator. Next(): calls the next() function on each input node recursively and processes the input tuples until one output tuple is generated. The state of the operator is updated to keep track of how much input has been consumed. Next(): calls the next() function on each input node recursively and processes the input tuples until one output tuple is generated. The state of the operator is updated to keep track of how much input has been consumed. Close(): deallocates the state information and performs final housekeeping Close(): deallocates the state information and performs final housekeeping

Pipelined Query Execution Pipelining requires much less space for intermediate results during execution, and this query execution model can be executed by a single process or thread. Pipelining requires much less space for intermediate results during execution, and this query execution model can be executed by a single process or thread. Demand driven pipelined query plans exhibit poor instruction locality and have bad instruction cache behavior Demand driven pipelined query plans exhibit poor instruction locality and have bad instruction cache behavior For a query plan with a pipeline of operators, all the operators instructions on the pipeline must be executed at least once to generate one output tuple. For a query plan with a pipeline of operators, all the operators instructions on the pipeline must be executed at least once to generate one output tuple. The query execution shows large instruction footprints. The query execution shows large instruction footprints. footprints

Instruction Cache Thrashing If the instruction cache is smaller than the total instruction footprint, every operator finds that some of its instructions are not in the cache. If the instruction cache is smaller than the total instruction footprint, every operator finds that some of its instructions are not in the cache. By loading its own instructions into the instruction cache, capacity cache misses occur to evict instructions for other operators from the cache. By loading its own instructions into the instruction cache, capacity cache misses occur to evict instructions for other operators from the cache. However, the evicted instructions are required by subsequent requests to the operators, and have to be loaded again in the future. This happens for every output tuple and for every operator in the pipeline. However, the evicted instructions are required by subsequent requests to the operators, and have to be loaded again in the future. This happens for every output tuple and for every operator in the pipeline. The resulting instruction cache thrashing can have a severe impact on the overall performance The resulting instruction cache thrashing can have a severe impact on the overall performance

Simple Query SELECT SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, AVG(l_quantity) as avg_qty, COUNT(*) as count_order FROM lineitem WHERE l_shipdate <= date ' '; (TPC-H Schema)TPC-H Schema

A New Buffer Operator Given a query plan, a special buffer operator is added in certain places between a parent and a child operator. Given a query plan, a special buffer operator is added in certain places between a parent and a child operator. During query execution, each buffer operator stores a large array of pointers to intermediate tuples generated by the child operator. During query execution, each buffer operator stores a large array of pointers to intermediate tuples generated by the child operator. Rather than returning control to the parent operator after one tuple, the buffer operator calls its array with intermediate results by repeatedly calling child operator. Rather than returning control to the parent operator after one tuple, the buffer operator calls its array with intermediate results by repeatedly calling child operator. Control returns to the parent operator once the buffer's array is filled (or when there are no more tuples). Control returns to the parent operator once the buffer's array is filled (or when there are no more tuples). When the parent operator requests additional tuples, they are returned from the buffer's array without executing any code from the child operator. When the parent operator requests additional tuples, they are returned from the buffer's array without executing any code from the child operator.

An Example for Usage of Buffer Operator

A New Buffer Operator To avoid the instruction cache thrashing problem, implementing a new light-weight buffer operator using the conventional iterator interface is proposed. To avoid the instruction cache thrashing problem, implementing a new light-weight buffer operator using the conventional iterator interface is proposed. A buffer operator simply batches the intermediate results of the operators below it. A buffer operator simply batches the intermediate results of the operators below it. Organize the tree of operators into execution groups. Execution groups are candidate units of buffering. Organize the tree of operators into execution groups. Execution groups are candidate units of buffering. The new buffer operator supports the open-next-close interface. The new buffer operator supports the open-next-close interface.

Query Execution Plan

Pseudocode For Buffer Operator

Buffer Operator Here is an example of a buffer operator with size 8. Here is an example of a buffer operator with size 8. The operator maintains a buffer array which stores pointers to previous intermediate tuples from the child operator The operator maintains a buffer array which stores pointers to previous intermediate tuples from the child operator

Point of Buffering The point of buffering is that it increases temporal and spatial instruction locality below the buffer operator. The point of buffering is that it increases temporal and spatial instruction locality below the buffer operator. When the child operator is executed, its instructions are loaded into the L1 instruction cache. When the child operator is executed, its instructions are loaded into the L1 instruction cache. Instead of being used to generate one tuple, these instructions are used to generate many tuples. Instead of being used to generate one tuple, these instructions are used to generate many tuples. As a result, the memory overhead for bringing the instructions into the cache and the CPU overhead of decoding the instructions is smaller. As a result, the memory overhead for bringing the instructions into the cache and the CPU overhead of decoding the instructions is smaller. Buffering also improves Query throughput. Buffering also improves Query throughput.

Buffering Strategies Every buffer operator incurs the cost of maintaining operator state and pointers to tuples during execution. Every buffer operator incurs the cost of maintaining operator state and pointers to tuples during execution. If the cost of buffering is significant compared with the performance gain then there is too much buffering. If the cost of buffering is significant compared with the performance gain then there is too much buffering. If the total footprint of an execution group is larger than the L1 instruction cache then it results in large number of instruction cache misses. If the total footprint of an execution group is larger than the L1 instruction cache then it results in large number of instruction cache misses. Operators with small cardinality estimates are unlikely to benefit from putting new buffers. Operators with small cardinality estimates are unlikely to benefit from putting new buffers. The cardinality threshold at which the benefits outweigh the costs can be determined using a calibration experiment on the target architecture. The cardinality threshold at which the benefits outweigh the costs can be determined using a calibration experiment on the target architecture.

Buffering Strategies This threshold can be determined once, in advance, by the database system. This threshold can be determined once, in advance, by the database system. The calibration experiment would consist of running a single query with and without buffering at various cardinalities. The calibration experiment would consist of running a single query with and without buffering at various cardinalities. The cardinality at which the buffered plan begins to beat the unbuffered plan would be the cardinality threshold for buffering. The cardinality at which the buffered plan begins to beat the unbuffered plan would be the cardinality threshold for buffering.

Instruction Footprint Analysis The basic strategy is to break a pipeline of operators into execution groups so that the instruction footprint of each execution group combined with the footprint of a new buffer operator is less than the L1 instruction cache size. This eliminates instruction cache thrashing within an execution group. The basic strategy is to break a pipeline of operators into execution groups so that the instruction footprint of each execution group combined with the footprint of a new buffer operator is less than the L1 instruction cache size. This eliminates instruction cache thrashing within an execution group. An ideal footprint estimate can only be measured when actually running the query, in an environment typical of the query being posed. However, it would be too expensive to run the whole query first just for estimating footprint sizes. An ideal footprint estimate can only be measured when actually running the query, in an environment typical of the query being posed. However, it would be too expensive to run the whole query first just for estimating footprint sizes. No matter which data is used, the same set of functions are almost always executed for each module. Thus, the core database system can be calibrated once by running a small query set that covers all kinds of operators. No matter which data is used, the same set of functions are almost always executed for each module. Thus, the core database system can be calibrated once by running a small query set that covers all kinds of operators.

Overall Algorithm The database system is first calibrated by running a small set of simple queries which cover all the operator types, and the instruction footprint for each operator is measured. The database system is first calibrated by running a small set of simple queries which cover all the operator types, and the instruction footprint for each operator is measured. Plan refinement algorithm accepts a query plan tree from the optimizer as input. It produces as output an equivalent enhanced plan tree with buffer operators added. Plan refinement algorithm accepts a query plan tree from the optimizer as input. It produces as output an equivalent enhanced plan tree with buffer operators added.

Overall Algorithm 1. Consider only nonblocking operators with output cardinalities exceeding the calibration threshold. 2. A bottom-up pass is made of the query plan. Each leaf operator is initially an execution group. Try to enlarge each execution group by including parent operators or merging adjacent execution groups until any further action leads to a combined footprint larger than the L1 instruction cache. When that happens, finish with the current execution group and label the parent operator as a new execution group, and continue the bottom-up traversal. 3. Add a buffer operator above each execution group and return the new plan.

Specification of Experimental System

Experimental Validation Large buffer pool is allocated so that all tables can be memory-resident. Large buffer pool is allocated so that all tables can be memory-resident. Enough memory for sorting and hashing operations is also allocated. Enough memory for sorting and hashing operations is also allocated.

Validating Buffer Strategies SELECT COUNT(*) as count_order FROM lineitem WHERE l_shipdate <= date ' '; The results confirms that The results confirms that The overhead of buffering is small The overhead of buffering is small Buffering within a group of operators that already fit in the trace cache does not improve instruction cache performance. Buffering within a group of operators that already fit in the trace cache does not improve instruction cache performance.

Validating Buffer Strategies SELECT SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, AVG(l_quantity) as avg_qty, COUNT(*) as count_order FROM lineitem WHERE l_shipdate <= date ' '; The results shows that The results shows that Combined footprint is 23KB Combined footprint is 23KB Number of trace cache misses reduced by 80% Number of trace cache misses reduced by 80% Number of branch mispredictions reduced by 21% Number of branch mispredictions reduced by 21%

Cardinality Effects Query 1 is used as a query template to calibrate the system, in order to determine a cardinality threshold for buffering. Query 1 is used as a query template to calibrate the system, in order to determine a cardinality threshold for buffering. The threshold is not sensitive to the choice of operator The threshold is not sensitive to the choice of operator The benefits of buffering become more obvious as the predicate becomes less selective The benefits of buffering become more obvious as the predicate becomes less selective

Buffer Size Buffering parameter is the size of array used to buffer tuple pointers. Buffering parameter is the size of array used to buffer tuple pointers. The size is set during operator initialization. The size is set during operator initialization. The number of reduced trace cache misses is roughly proportional to 1/buffersize. Once buffer is of moderate size, there is only a small incentive to make it bigger. The number of reduced trace cache misses is roughly proportional to 1/buffersize. Once buffer is of moderate size, there is only a small incentive to make it bigger. Buffered query performance as a function of the buffer size for Query 1 Buffered query performance as a function of the buffer size for Query 1

Buffer Size Execution Time Breakdown for Varied Buffer Sizes Execution Time Breakdown for Varied Buffer Sizes As the buffer size increases, As the buffer size increases, trash cache miss penalty drops trash cache miss penalty drops the buffered plan shows better performance the buffered plan shows better performance Buffer operators incur more L2 data cache misses (hardware prefetching hides most of this miss latency, since the data is allocated or accessed sequentially) Buffer operators incur more L2 data cache misses (hardware prefetching hides most of this miss latency, since the data is allocated or accessed sequentially)

Complex Queries SELECT sum(o_totalprice), count(*), avg(l_discount) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND l_shipdate <= date ' ';

Nest Loop Join Nest Loop Joins Reduced cache trace misses by 53% Reduced cache trace misses by 53% Reduced branch mispredictions by 26% Reduced branch mispredictions by 26%

Hash Join HashJoin Joins Reduced cache trace misses by 70% Reduced cache trace misses by 70% Reduced branch mispredictions by 44% Reduced branch mispredictions by 44%

Merge Join Merge Joins Reduced cache trace misses by 79% Reduced cache trace misses by 79% Reduced branch mispredictions by 30% Reduced branch mispredictions by 30%

General Results Buffered plans lead to more L2 cache misses Buffered plans lead to more L2 cache misses The overhead is much smaller than the performance gain from the trace cache and branch prediction The overhead is much smaller than the performance gain from the trace cache and branch prediction Overall Improvement

General Results Better instruction cache performance leads to lower CPI (Cost-Per-Instruction) Better instruction cache performance leads to lower CPI (Cost-Per-Instruction) The original and buffered plans have almost the same number (less than 1% difference) of instructions executed The original and buffered plans have almost the same number (less than 1% difference) of instructions executed Result: Buffer operators are light-weight Result: Buffer operators are light-weight CPI Improvement

Conclusion Techniques to buffer query execution of query plans to exploit instruction cache spatial and temporal locality are proposed Techniques to buffer query execution of query plans to exploit instruction cache spatial and temporal locality are proposed A new buffer operator over the existing open-next- close interface is implemented (without changing the implementation of other operators) A new buffer operator over the existing open-next- close interface is implemented (without changing the implementation of other operators) Buffer operators are useful for complex queries which have large instruction footprints and large output cardinalities Buffer operators are useful for complex queries which have large instruction footprints and large output cardinalities

Conclusion Plan refinement algorithm traverses the query plan in a botoom-up fashion and makes buffering decisions based on footprint and large output cardinalities Plan refinement algorithm traverses the query plan in a botoom-up fashion and makes buffering decisions based on footprint and large output cardinalities Instruction cache misses are reduced by up to 80% Instruction cache misses are reduced by up to 80% The query performance is improved by up to 15% The query performance is improved by up to 15%

Future Work Integrate buffering with in the plan generator of a query optimizer Integrate buffering with in the plan generator of a query optimizer Break the operators with very large footprints into several sub-operators and execute them in stages. Break the operators with very large footprints into several sub-operators and execute them in stages. Reschedule the execution so that operators that are of the same type are scheduled together for complex query plans Reschedule the execution so that operators that are of the same type are scheduled together for complex query plans

References S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in modern computer architectures. In Proceedings of ICDE Conference, S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in modern computer architectures. In Proceedings of ICDE Conference, J. Zhou, and K. A. Ross. Bufferring accesses to memory- resident index structures. In Proceedings of VLDB conference, J. Zhou, and K. A. Ross. Bufferring accesses to memory- resident index structures. In Proceedings of VLDB conference, K.A. Ross, J. Cieslewicz, J. Rao, and J. Zhou. Architecture Sensitive Database Design: Examples from the Columbia Group. Bulletin of the IEEE Computer Society Technical Committe on Data Engineering, 2005 K.A. Ross, J. Cieslewicz, J. Rao, and J. Zhou. Architecture Sensitive Database Design: Examples from the Columbia Group. Bulletin of the IEEE Computer Society Technical Committe on Data Engineering, 2005

References Transaction Processing Performance Council. TPC Benchmark H. Available via Transaction Processing Performance Council. TPC Benchmark H. Available via A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of VLDB conference, A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of VLDB conference, 1999.

Questions?

Thank You

Pentium 4 Cache Memory System

Instruction Footprints

TPC-H Schema