@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos.

@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos

2 Database world: a 30,000 ft viewDBMS Sarah: “Buy this book” Jeff: “Which store needs more advertising?” internet offload data OLTP: Online Transaction Processing many short-lived requests DSS: Decision Support Systems few long-running queries DB systems fuel most e-applications Improved performance Impact on everyday life

3 New HW/SW requirements More capacity, throughput efficiency CPUs run much faster than they can access data CPU memory the ‘80s 1 cycle 10 300 DSS stress I/O subsystem today Need to optimize all levels of memory hierarchy

4 The further, the slower Keep data close to CPU Locality and predictability is key DBMS core design contradicts above goals Overlap mem. accesses with computation Modify algorithms and structures to exhibit more locality

5 Thread-based execution in DBMS Queries are handled by a pool of threads Threads execute independently No means to exploit common operations DBMS thread pool x no coordination D C D C StagedDB New design to expose locality across threads

6 Staged Database Systems Organize system components into stages No need to change algorithms / structures Stage 3 Stage 2 Stage 1 DBMS queries StagedDB queries High concurrency locality across requests

7 Thesis “By organizing and assigning system components into self-contained stages, database systems can exploit instruction and data commonality across concurrent requests thereby improving performance.”

8 Summary of main results 56% - 96% fewer I-misses STEPS: full-system evaluation on Shore 1.2x - 2x throughput QPipe: full-system evaluation on BerkeleyDB memory hierarchy I I D D L1 L2-L3 RAM Disks

9 Contributions and dissemination Introduced StagedDB design Scheduling algorithms for staged systems Built novel query engine design QPipe engine maximizes data and work sharing Addressed instruction cache in OLTP STEPS applies to any DBMS with few changes CIDR’03 VLDB’04 SIGMOD’05 CMU-TR ’02 IEEE Data Eng. ’05 ICDE’06 demo sub. CMU-TR ’05 HDMS’05 VLDB J. subm. TODS subm.

10 Outline IntroductionIntroduction QPipe STEPS Conclusions I D DSS

11 Query-centric design of DB engines Queries are evaluated independently No means to share across queries Need new design to exploit common data instructions work across operators

12 QPipe: operator-centric engine Conventional: “one-query, many-operators” QPipe: “one operator, many-queries” Relational operators become  Engines Queries break up in tasks and queue up conventional QPipe queue runtime

13 QPipe design packet dispatcher S S A thread pool storage engine query plans conventional design J  Engine-S Q Q  Engine-J  Engine-A Q Q read write read

14 Reusing data & work in QPipe Detect overlap at run time Shared pages and intermediate results are simultaneously pipelined to parent nodes Q1 overlap in red operator simultaneous pipelining Q2 Q1 Q2

15 Mechanisms for sharing Multi-query optimization Materialized views Buffer pool management Shared scans RedBrick, Teradata, SQL Server requires workload knowledge opportunistic limited use not used in practice QPipe complements above approaches

16 Experimental setup QPipe prototype Built on top of BerkeleyDB, 7,000 C++ lines Shared-memory buffers, native OS threads Platform 2GHz Pentium 4, 2GB RAM, 4 SCSI disks Benchmarks TPC-H (4GB)

17 Sharing order-sensitive scans I I M-J S A ORDERS LINEITEM TPC-H Query 4 Q1 M-J S A Q2 I I ORDERSLINEITEM Two clients send query at different intervals QPipe performs 2 separate joins order-sensitive order-insensitive M-J II II +

18 Sharing order-sensitive scans Two clients send query at different intervals QPipe performs 2 separate joins time difference between arrivals total response time (sec)

19 TPC-H workload Clients use pool of 8 TPC-H queries QPipe reuses large scans, runs up to 2x faster..while maintaining low response times throughput (queries/hr) number of clients

20 QPipe: conclusions DB engines evaluate queries independently Limited existing mechanisms for sharing QPipe requires few code changes SP is simple yet powerful technique Allows dynamic sharing of data and work Other benefits (not described here) I-cache, D-cache performance Efficiently execute MQO plans

21 Outline IntroductionIntroduction QPipeQPipe STEPS Conclusions I D OLTP

22 Online Transaction Processing Need solution for instruction cache-residency L1-I sizes for various CPUs Max on-chip L2/L3 cache ‘96 ‘00 ‘04 ‘98 ‘02 Year Introduced 10 KB 100 KB 1 MB 10 MB Cache size High-end servers, non I/O bound L1-I stalls are 20-40% of execution time Instruction caches cannot grow

23 Related work Hardware and compiler approaches Increased block size, stream buffer [ Ranganathan98 ] Code layout optimizations [ Ramirez01 ] Database software approaches Instruction cache for DSS [ Padmanabhan01 ][ Zhou04 ] Instruction cache for OLTP: Challenging!

24 STEPS for cache-resident code STEPS: Synchronized Transactions through Explicit Processor Scheduling Microbenchmark: eliminate 96% of L1-I misses TPC-C: eliminate 2/3 of misses, 1.4 speedup Begin Select Update Insert Delete Commit Transaction keep thread model, insert sync points S still larger than I-cache multiplex execution, reuse instructions

25 I-cache aware context-switching code fits in I-cache context-switch (CTX) point select( ) s1 s2 s3 s4 s5 s6 s7 thread 1 thread 2 instruction cache thread 1 thread 2 select( ) s1 s2 s3 s4 s5 s6 s7 select( ) s1 s2 s3 s4 s5 s6 s7 select( ) s1 s2 s3 s4 s5 s6 s7 Miss M MMMMMMMM MMMMMMMM HHHHHHHH Hit H MMMMMMMMMMMMMMMM no STEPSwith STEPS

26 Placing CTX calls in source AutoSTEPS tool Evaluation DBMS binary valgrind 0x01 0x04 0x05 0x04 … … instruction mem. refs STEPS simulation 0x01 0x05 … … mem. address for CTX gdb file1.c:30 file2.c:40 … … lines to insert CTX Comparable performance to manual..while being more conservative

27 Experimental setup (1 st part) Implemented on top of Shore AMD AthlonXP 64KB L1-I + 64KB L1-D, 256KB L2 Microbenchmark Index fetch, in-memory index Fast CTX for both systems, warm cache

28 Microbenchmark: L1-I misses STEPS eliminates 92-96% of misses for add’l threads 1 24 6 Concurrent threads L1-I cache misses 810 1K 2K 3K 4K AthlonXP

29 L1-I misses & speedup Speedup 1.1 1.2 1.3 1.4 Miss reduction 100% 80% 60% 40% 10 20 30 40 Concurrent threads 50 60 70 80 10 20 30 40 Concurrent threads 50 60 70 80 Steps achieves max performance for 6-10 threads No need for larger thread groups AthlonXP

30 Challenges in full-system operation So far: Threads are interested in same Op Uninterrupted flow No thread scheduler Full-system requirements High concurrency on similar Ops Handle exceptions Disk I/O, locks, latches, abort Co-exist with system threads Deadlock detection, buffer pool housekeeping

31 System design Fast CTX through fixed scheduling Repair thread structures at exceptions Modify only thread package STEPS wrapper Op X STEPS wrapper Op Y STEPS wrapper Op Z stray thread execution team to other Op

32 Experimental setup (2 nd part) AMD AthlonXP 64KB L1-I + 64KB L1-D, 256KB L2 TPC-C (wholesale parts supplier) 2GB RAM, 2 disks 10-30 Warehouses (1-3GB), 100-300 users Zero think time, in-memory, lazy commits

33 One transaction: payment STEPS outperforms baseline system 1.4 speedup, 65% fewer L1-I misses Number of users 20% 40% 60% 80% 100% Cycles L1-I misses Normalized count

34 Mix of four transactions Number of users Normalized count 20% 40% 60% 80% 100% Cycles L1-I misses Xaction mix reduces team size Still, 56% fewer L1-I misses

35 STEPS: conclusions STEPS can handle full OLTP workloads Significant improvements in TPC-C 65% fewer L1-I misses 1.2 – 1.4 speedup STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity

36 StagedDB: future work Promising platform for Chip-Multiprocessors DBMS suffer from CPU-to-CPU cache misses StagedDB allows work to follow data -- not the other way around! Resource scheduling Stages cluster requests for DB locks, I/O Potential for deeper, more effective scheduling

37 Conclusions New hardware, new requirements Server core design remains the same Need new design to fit modern hardware StagedDB: Optimizes all memory hierarchy levels Promising design for future installations

38 The speaker would like to thank: his academic advisor Anastassia Ailamaki his thesis committee members Panos K. Chrysanthis, Christos Faloutsos, Todd C. Mowry, and Michael Stonebraker and his coauthors Kun Gao, Vladislav Shkapenyuk, and Ryan Williams Thank you

39 QPipe backup

40 A  Engine in detail tuple batching I-cache query grouping I&D-cache relational operator code simultaneous pipelining scheduling thread free threads busy threads main routine  Engine parameters  Engine queue Harizopoulos04 (VLDB) Zhou03 (VLDB) Padmanabhan01 (ICDE) Zhou04 (SIGMOD)

41 Simultaneous Pipelining in QPipe join without SP with SP Q1 write Q2 COMPLETE 2 Q2 copy 3 Q1 4 pipeline Q1 join Q2 attach 1 Q2 join coordinator SP Q1 read

42 Sharing data & work across queries S S M-J A TABLE ATABLE B Query 1 : “Find average age of students enrolled in both class A and class B” S TABLE A max Query 2 S S M-J TABLE ATABLE B Query 3 min data sharing opportunity work sharing opportunity

43 Sharing opportunities at run time Q1 executes operator R Q2 arrives with R in its plan sharing potential result production for R in Q1 Q2 result production for R in Q2 R without SP Q1 Q2 R write read with SP R coordinator SP Q1 Q2 read pipeline

44 TPC-H workload Clients use pool of 8 TPC-H queries QPipe reuses large scans, runs up to 2x faster..while maintaining low response times throughput (queries/hr) number of clients average response time think time (sec)

45 STEPS backup

46 Smaller L1-I cache 209% AthlonXP, Pentium III 10 threads Normalized count Cycles L1-I misses Br. Mispred. L1-D misses Branches 20% 40% 60% 80% 100% 120% Br. missed BTB Instr. stalls (cycles) Steps outperforms Shore even on smaller caches ( PIII ) 62-64% fewer mispredicted branches on both CPUs

47 SimFlex: L1-I misses higher associativity L1-I cache misses 2K 4K 6K 8K 10K direct 2-way 4-way 8-way full higher associativity AthlonXP 64b cache block 10 threads Steps eliminates all capacity misses (16, 32KB caches) Up to 89% overall miss reduction (upper limit is 90%)

48 One Xaction: payment Steps outperforms Shore 1.4 speedup, 65% fewer L1-I misses 48% fewer mispredicted branches Number of Warehouses Normalized count 20% 40% 60% 80% 100% CyclesL1-I misses L1-D misses L2-I misses L2-D misses Branches mispred.

49 Mix of four Xactions Normalized count 20% 40% 60% 80% 100% CyclesL1-I misses L1-D misses L2-I misses L2-D misses Branches mispred. Xaction mix reduces average team size ( 4.3 in 10W ) Still, Steps has 56% fewer L1-I misses ( out of 77% max ) 121%125% Number of Warehouses

@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos.

Similar presentations

Presentation on theme: "@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos.

Similar presentations

Presentation on theme: "@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos."— Presentation transcript:

Similar presentations

About project

Feedback