Download presentation
Presentation is loading. Please wait.
Published byChester Lang Modified over 7 years ago
1
@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos
2
2 Database world: a 30,000 ft viewDBMS Sarah: “Buy this book” Jeff: “Which store needs more advertising?” internet offload data OLTP: Online Transaction Processing many short-lived requests DSS: Decision Support Systems few long-running queries DB systems fuel most e-applications Improved performance Impact on everyday life
3
3 New HW/SW requirements More capacity, throughput efficiency CPUs run much faster than they can access data CPU memory the ‘80s 1 cycle 10 300 DSS stress I/O subsystem today Need to optimize all levels of memory hierarchy
4
4 The further, the slower Keep data close to CPU Locality and predictability is key DBMS core design contradicts above goals Overlap mem. accesses with computation Modify algorithms and structures to exhibit more locality
5
5 Thread-based execution in DBMS Queries are handled by a pool of threads Threads execute independently No means to exploit common operations DBMS thread pool x no coordination D C D C StagedDB New design to expose locality across threads
6
6 Staged Database Systems Organize system components into stages No need to change algorithms / structures Stage 3 Stage 2 Stage 1 DBMS queries StagedDB queries High concurrency locality across requests
7
7 Thesis “By organizing and assigning system components into self-contained stages, database systems can exploit instruction and data commonality across concurrent requests thereby improving performance.”
8
8 Summary of main results 56% - 96% fewer I-misses STEPS: full-system evaluation on Shore 1.2x - 2x throughput QPipe: full-system evaluation on BerkeleyDB memory hierarchy I I D D L1 L2-L3 RAM Disks
9
9 Contributions and dissemination Introduced StagedDB design Scheduling algorithms for staged systems Built novel query engine design QPipe engine maximizes data and work sharing Addressed instruction cache in OLTP STEPS applies to any DBMS with few changes CIDR’03 VLDB’04 SIGMOD’05 CMU-TR ’02 IEEE Data Eng. ’05 ICDE’06 demo sub. CMU-TR ’05 HDMS’05 VLDB J. subm. TODS subm.
10
10 Outline IntroductionIntroduction QPipe STEPS Conclusions I D DSS
11
11 Query-centric design of DB engines Queries are evaluated independently No means to share across queries Need new design to exploit common data instructions work across operators
12
12 QPipe: operator-centric engine Conventional: “one-query, many-operators” QPipe: “one operator, many-queries” Relational operators become Engines Queries break up in tasks and queue up conventional QPipe queue runtime
13
13 QPipe design packet dispatcher S S A thread pool storage engine query plans conventional design J Engine-S Q Q Engine-J Engine-A Q Q read write read
14
14 Reusing data & work in QPipe Detect overlap at run time Shared pages and intermediate results are simultaneously pipelined to parent nodes Q1 overlap in red operator simultaneous pipelining Q2 Q1 Q2
15
15 Mechanisms for sharing Multi-query optimization Materialized views Buffer pool management Shared scans RedBrick, Teradata, SQL Server requires workload knowledge opportunistic limited use not used in practice QPipe complements above approaches
16
16 Experimental setup QPipe prototype Built on top of BerkeleyDB, 7,000 C++ lines Shared-memory buffers, native OS threads Platform 2GHz Pentium 4, 2GB RAM, 4 SCSI disks Benchmarks TPC-H (4GB)
17
17 Sharing order-sensitive scans I I M-J S A ORDERS LINEITEM TPC-H Query 4 Q1 M-J S A Q2 I I ORDERSLINEITEM Two clients send query at different intervals QPipe performs 2 separate joins order-sensitive order-insensitive M-J II II +
18
18 Sharing order-sensitive scans Two clients send query at different intervals QPipe performs 2 separate joins time difference between arrivals total response time (sec)
19
19 TPC-H workload Clients use pool of 8 TPC-H queries QPipe reuses large scans, runs up to 2x faster..while maintaining low response times throughput (queries/hr) number of clients
20
20 QPipe: conclusions DB engines evaluate queries independently Limited existing mechanisms for sharing QPipe requires few code changes SP is simple yet powerful technique Allows dynamic sharing of data and work Other benefits (not described here) I-cache, D-cache performance Efficiently execute MQO plans
21
21 Outline IntroductionIntroduction QPipeQPipe STEPS Conclusions I D OLTP
22
22 Online Transaction Processing Need solution for instruction cache-residency L1-I sizes for various CPUs Max on-chip L2/L3 cache ‘96 ‘00 ‘04 ‘98 ‘02 Year Introduced 10 KB 100 KB 1 MB 10 MB Cache size High-end servers, non I/O bound L1-I stalls are 20-40% of execution time Instruction caches cannot grow
23
23 Related work Hardware and compiler approaches Increased block size, stream buffer [ Ranganathan98 ] Code layout optimizations [ Ramirez01 ] Database software approaches Instruction cache for DSS [ Padmanabhan01 ][ Zhou04 ] Instruction cache for OLTP: Challenging!
24
24 STEPS for cache-resident code STEPS: Synchronized Transactions through Explicit Processor Scheduling Microbenchmark: eliminate 96% of L1-I misses TPC-C: eliminate 2/3 of misses, 1.4 speedup Begin Select Update Insert Delete Commit Transaction keep thread model, insert sync points S still larger than I-cache multiplex execution, reuse instructions
25
25 I-cache aware context-switching code fits in I-cache context-switch (CTX) point select( ) s1 s2 s3 s4 s5 s6 s7 thread 1 thread 2 instruction cache thread 1 thread 2 select( ) s1 s2 s3 s4 s5 s6 s7 select( ) s1 s2 s3 s4 s5 s6 s7 select( ) s1 s2 s3 s4 s5 s6 s7 Miss M MMMMMMMM MMMMMMMM HHHHHHHH Hit H MMMMMMMMMMMMMMMM no STEPSwith STEPS
26
26 Placing CTX calls in source AutoSTEPS tool Evaluation DBMS binary valgrind 0x01 0x04 0x05 0x04 … … instruction mem. refs STEPS simulation 0x01 0x05 … … mem. address for CTX gdb file1.c:30 file2.c:40 … … lines to insert CTX Comparable performance to manual..while being more conservative
27
27 Experimental setup (1 st part) Implemented on top of Shore AMD AthlonXP 64KB L1-I + 64KB L1-D, 256KB L2 Microbenchmark Index fetch, in-memory index Fast CTX for both systems, warm cache
28
28 Microbenchmark: L1-I misses STEPS eliminates 92-96% of misses for add’l threads 1 24 6 Concurrent threads L1-I cache misses 810 1K 2K 3K 4K AthlonXP
29
29 L1-I misses & speedup Speedup 1.1 1.2 1.3 1.4 Miss reduction 100% 80% 60% 40% 10 20 30 40 Concurrent threads 50 60 70 80 10 20 30 40 Concurrent threads 50 60 70 80 Steps achieves max performance for 6-10 threads No need for larger thread groups AthlonXP
30
30 Challenges in full-system operation So far: Threads are interested in same Op Uninterrupted flow No thread scheduler Full-system requirements High concurrency on similar Ops Handle exceptions Disk I/O, locks, latches, abort Co-exist with system threads Deadlock detection, buffer pool housekeeping
31
31 System design Fast CTX through fixed scheduling Repair thread structures at exceptions Modify only thread package STEPS wrapper Op X STEPS wrapper Op Y STEPS wrapper Op Z stray thread execution team to other Op
32
32 Experimental setup (2 nd part) AMD AthlonXP 64KB L1-I + 64KB L1-D, 256KB L2 TPC-C (wholesale parts supplier) 2GB RAM, 2 disks 10-30 Warehouses (1-3GB), 100-300 users Zero think time, in-memory, lazy commits
33
33 One transaction: payment STEPS outperforms baseline system 1.4 speedup, 65% fewer L1-I misses Number of users 20% 40% 60% 80% 100% Cycles L1-I misses Normalized count
34
34 Mix of four transactions Number of users Normalized count 20% 40% 60% 80% 100% Cycles L1-I misses Xaction mix reduces team size Still, 56% fewer L1-I misses
35
35 STEPS: conclusions STEPS can handle full OLTP workloads Significant improvements in TPC-C 65% fewer L1-I misses 1.2 – 1.4 speedup STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity
36
36 StagedDB: future work Promising platform for Chip-Multiprocessors DBMS suffer from CPU-to-CPU cache misses StagedDB allows work to follow data -- not the other way around! Resource scheduling Stages cluster requests for DB locks, I/O Potential for deeper, more effective scheduling
37
37 Conclusions New hardware, new requirements Server core design remains the same Need new design to fit modern hardware StagedDB: Optimizes all memory hierarchy levels Promising design for future installations
38
38 The speaker would like to thank: his academic advisor Anastassia Ailamaki his thesis committee members Panos K. Chrysanthis, Christos Faloutsos, Todd C. Mowry, and Michael Stonebraker and his coauthors Kun Gao, Vladislav Shkapenyuk, and Ryan Williams Thank you
39
39 QPipe backup
40
40 A Engine in detail tuple batching I-cache query grouping I&D-cache relational operator code simultaneous pipelining scheduling thread free threads busy threads main routine Engine parameters Engine queue Harizopoulos04 (VLDB) Zhou03 (VLDB) Padmanabhan01 (ICDE) Zhou04 (SIGMOD)
41
41 Simultaneous Pipelining in QPipe join without SP with SP Q1 write Q2 COMPLETE 2 Q2 copy 3 Q1 4 pipeline Q1 join Q2 attach 1 Q2 join coordinator SP Q1 read
42
42 Sharing data & work across queries S S M-J A TABLE ATABLE B Query 1 : “Find average age of students enrolled in both class A and class B” S TABLE A max Query 2 S S M-J TABLE ATABLE B Query 3 min data sharing opportunity work sharing opportunity
43
43 Sharing opportunities at run time Q1 executes operator R Q2 arrives with R in its plan sharing potential result production for R in Q1 Q2 result production for R in Q2 R without SP Q1 Q2 R write read with SP R coordinator SP Q1 Q2 read pipeline
44
44 TPC-H workload Clients use pool of 8 TPC-H queries QPipe reuses large scans, runs up to 2x faster..while maintaining low response times throughput (queries/hr) number of clients average response time think time (sec)
45
45 STEPS backup
46
46 Smaller L1-I cache 209% AthlonXP, Pentium III 10 threads Normalized count Cycles L1-I misses Br. Mispred. L1-D misses Branches 20% 40% 60% 80% 100% 120% Br. missed BTB Instr. stalls (cycles) Steps outperforms Shore even on smaller caches ( PIII ) 62-64% fewer mispredicted branches on both CPUs
47
47 SimFlex: L1-I misses higher associativity L1-I cache misses 2K 4K 6K 8K 10K direct 2-way 4-way 8-way full higher associativity AthlonXP 64b cache block 10 threads Steps eliminates all capacity misses (16, 32KB caches) Up to 89% overall miss reduction (upper limit is 90%)
48
48 One Xaction: payment Steps outperforms Shore 1.4 speedup, 65% fewer L1-I misses 48% fewer mispredicted branches Number of Warehouses Normalized count 20% 40% 60% 80% 100% CyclesL1-I misses L1-D misses L2-I misses L2-D misses Branches mispred.
49
49 Mix of four Xactions Normalized count 20% 40% 60% 80% 100% CyclesL1-I misses L1-D misses L2-I misses L2-D misses Branches mispred. Xaction mix reduces average team size ( 4.3 in 10W ) Still, Steps has 56% fewer L1-I misses ( out of 77% max ) 121%125% Number of Warehouses
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.