Presentation is loading. Please wait.

Presentation is loading. Please wait.

Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part III: architecture studies.

Similar presentations


Presentation on theme: "Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part III: architecture studies."— Presentation transcript:

1 Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part III: architecture studies

2 2 UPC, February 1999 Overview (3)  Day III: architecture studies Memory system characterization Memory system characterization Impact of out-of-order processors Impact of out-of-order processors Simultaneous multithreading Simultaneous multithreading Final remarks Final remarks

3 3 UPC, February 1999 Memory system performance studies  Collaboration with Kourosh Gharachorloo and Edouard Bugnion  Presented at ISCA’98

4 4 UPC, February 1999 Motivations  Market shift for high-performance systems yesterday: technical/numerical applications yesterday: technical/numerical applications today: databases, Web servers, e-mail services, etc. today: databases, Web servers, e-mail services, etc.  Bottleneck shift in commercial application yesterday: I/O yesterday: I/O today: memory system today: memory system  Lack of data on behavior of commercial workloads  Re-evaluate memory system design trade-offs

5 5 UPC, February 1999 Bottleneck Shift  Just a few years back [Thakkar&Sweiger90] I/O was the only important bottleneck  Since then, several improvements: better DB engines can tolerate I/O latencies better DB engines can tolerate I/O latencies better OS’s do more efficient I/O operations and are more scalable better OS’s do more efficient I/O operations and are more scalable better parallelism in the disk subsystem (RAIDs) provide more bandwidth better parallelism in the disk subsystem (RAIDs) provide more bandwidth  … and memory keeps getting “slower” faster processors faster processors bigger machines bigger machines  Result: memory system is a primary factor today

6 6 UPC, February 1999 Workloads  OLTP (on-line transaction processing) modeled after TPC-B, using Oracle7 DB engine modeled after TPC-B, using Oracle7 DB engine short transactions, intense process communication & context switching short transactions, intense process communication & context switching multiple transactions in-transit multiple transactions in-transit  DSS (decision support systems) modeled after TPC-D, using Oracle7 modeled after TPC-D, using Oracle7 long running transactions, low process communication long running transactions, low process communication parallelized queries parallelized queries  AltaVista Web index search application using custom threads package Web index search application using custom threads package medium sized transactions, low process communication medium sized transactions, low process communication multiple transactions in-transit multiple transactions in-transit

7 7 UPC, February 1999 Methodology: Platform  AlphaServer4100 5/300 4x 300 MHz processors (8KB/8KB I/D caches, 96KB L2 cache) 4x 300 MHz processors (8KB/8KB I/D caches, 96KB L2 cache) 2MB board-level cache 2MB board-level cache 2GB main memory 2GB main memory latencies: 1:7:21:80/125 cycles latencies: 1:7:21:80/125 cycles 3-channel HSZ disk array controller 3-channel HSZ disk array controller  Digital Unix 4.0B

8 8 UPC, February 1999 Methodology: Tools  Monitoring tools: IPROBE IPROBE DCPI DCPI ATOM ATOM  Simulation tools: tracing: preliminary user-level studies tracing: preliminary user-level studies SimOS-Alpha: full system simulation, including OS SimOS-Alpha: full system simulation, including OS

9 9 UPC, February 1999 Scaling  Workload sizes make them difficult to study  Scaling the problem size is critical  Validation criteria: similar memory system behavior to larger run  Requires good understanding of workload make sure system is well tuned make sure system is well tuned keep SGA many times larger than hardware caches (1GB) keep SGA many times larger than hardware caches (1GB) use the same number of servers/processor as audit- sized runs (4-8/CPU) use the same number of servers/processor as audit- sized runs (4-8/CPU)

10 10 UPC, February 1999 CPU Cycle Breakdown Instruction and data related stalls are equally important Very high CPI for OLTP

11 11 UPC, February 1999 Cache behavior

12 12 UPC, February 1999 Stall Cycle Breakdown OLTP dominated by non-primary cache and memory stalls DSS and AltaVista stalls are mostly Scache hits

13 13 UPC, February 1999 Impact of On-Chip Cache Size 64KB on-chip caches are enough for DSS P=4; 2MB, 2-way off-chip cache

14 14 UPC, February 1999 OLTP: Effect of Off-Chip Cache Organization Significant benefits from large off-chip caches (up to 8MB) P=4

15 15 UPC, February 1999 OLTP: Impact of system size Communication misses become dominant for larger systems P=4; 2MB, 2-way off-chip cache

16 16 UPC, February 1999 OLTP: Contribution of Dirty Misses Shared metadata is the important region – 80% of off-chip misses – 95% of dirty misses Fraction of dirty misses increases with cache and system size P=4, 8MB Bcache

17 17 UPC, February 1999 OLTP: Impact of Off-Chip Cache Line Size Good spatial locality on communication for OLTP Very little false sharing in Oracle itself P=4; 2MB, 2-way off-chip cache

18 18 UPC, February 1999 Summary of Results  On-chip cache 64KB I/D sufficient for DSS & AltaVista 64KB I/D sufficient for DSS & AltaVista  Off-chip cache OLTP benefits from larger caches (up to 8MB) OLTP benefits from larger caches (up to 8MB)  Dirty misses Can become dominant for OLTP Can become dominant for OLTP

19 19 UPC, February 1999 Conclusion  Memory system is the current challenge in DB performance  Careful scaling enables detailed studies  Combination of monitoring and simulation is very powerful  Diverging memory system designs OLTP benefits from large off-chip caches, fast communication OLTP benefits from large off-chip caches, fast communication DSS & AltaVista may perform better without an off- chip cache DSS & AltaVista may perform better without an off- chip cache

20 20 UPC, February 1999 Impact of out-of-order processors  Collaboration with: Kourosh Gharachorloo (Compaq) Kourosh Gharachorloo (Compaq) Parthasarathy Ranghanathan and Sarita Adve (Rice) Parthasarathy Ranghanathan and Sarita Adve (Rice)  Presented at ASPLOS’98

21 21 UPC, February 1999  Databases fastest-growing market for shared-memory servers Online transaction processing (OLTP) Online transaction processing (OLTP) Decision-support systems (DSS) Decision-support systems (DSS)  But current systems optimized for engineering/scientific workloads Aggressive use of Instruction-Level Parallelism (ILP) Aggressive use of Instruction-Level Parallelism (ILP) –Multiple issue, out-of-order issue, –non-blocking loads, speculative execution  Need to re-evaluate system design for database workloads Motivation Motivation

22 22 UPC, February 1999 Contributions  Detailed simulation study of Oracle with ILP processors  Is ILP design complexity warranted for database workloads? Improve performance (1.5X OLTP, 2.6X DSS) Improve performance (1.5X OLTP, 2.6X DSS) Reduce performance gap between consistency models Reduce performance gap between consistency models  How can we improve performance for OLTP workloads? OLTP limited by instruction and migratory data misses OLTP limited by instruction and migratory data misses –Small stream buffer close to perfect instruction cache –Prefetching/flush appear promising

23 23 UPC, February 1999 Simulation Environment - Workloads  Oracle 7.3.2 commercial DBMS engine  Database workloads Online transaction processing (OLTP) - TPC-B-like Online transaction processing (OLTP) - TPC-B-like –Day-to-day business operations Decision-support System (DSS) - TPC-D/Query 6 Decision-support System (DSS) - TPC-D/Query 6 –Offline business analysis

24 24 UPC, February 1999 Simulation Environment - Methodology  Used RSIM - Rice Simulator for ILP Multiprocessors Detailed simulation of processor, memory, and network Detailed simulation of processor, memory, and network  But simulating commercial-grade database engine hard Some simplifications Some simplifications Similar to Lo et al. and Barroso et al., ISCA’98 Similar to Lo et al. and Barroso et al., ISCA’98

25 25 UPC, February 1999 Simulation Methodology - Simplifications  Trace-driven simulation  OS/system-call simulation OS not a large component OS not a large component Model only key effects Model only key effects Page-mapping, TLB misses, process scheduling Page-mapping, TLB misses, process scheduling System-call and I/O time dilation effects System-call and I/O time dilation effects Multiple processes per processor to hide I/O latency Multiple processes per processor to hide I/O latency  Database scaling

26 26 UPC, February 1999 Simulated Environment - Hardware  4-processor shared-memory system - 8 processes per processor Directory-based MESI protocol with invalidations Directory-based MESI protocol with invalidations  Next-generation processing nodes Aggressive ILP processor Aggressive ILP processor 128 KB 2-way separate instruction and data L1 caches 128 KB 2-way separate instruction and data L1 caches 8M 4-way unified L2 cache 8M 4-way unified L2 cache  Representative miss latencies

27 27 UPC, February 1999 Outline Motivation Motivation Simulation Environment Simulation Environment Impact of ILP on Database Workloads Impact of ILP on Database Workloads Multiple issue and OOO issue for OLTP Multiple issue and OOO issue for OLTP Multiple outstanding misses for OLTP Multiple outstanding misses for OLTP ILP techniques for DSS ILP techniques for DSS ILP-enabled consistency optimizations ILP-enabled consistency optimizations Improving Performance of OLTP Improving Performance of OLTP Conclusions Conclusions

28 28 UPC, February 1999 Multiple Issue and OOO Issue for OLTP  Multiple issue and OOO improve performance by 1.5X But 4-way, 64-element window enough But 4-way, 64-element window enough  Instruction misses and dirty misses are key bottlenecks In-order processorsOut-of-order processors 100.0 92.1 90.1 86.8 88.8 74.3 68.4 67.8

29 29 UPC, February 1999 Multiple Outstanding Misses for OLTP  Support for two distinct outstanding misses enough Data-dependent computation Data-dependent computation 100.0 83.2 79.4

30 30 UPC, February 1999 Impact of ILP Techniques for DSS  Multiple issue and OOO improve performance by 2.6X  4-way, 64-element window, 4 outstanding misses enough  Memory is not a bottleneck In-order processorsOut-of-order processors 100.0 74.1 68.4 68.1 89.2 52.1 39.7 39.0

31 31 UPC, February 1999 ILP-Enabled Consistency Optimizations  Memory consistency model of shared-memory system Specifies ordering and overlap of memory operations Specifies ordering and overlap of memory operations Performance /programmability tradeoff Performance /programmability tradeoff –Sequential consistency (SC) –Processor consistency (PC) –Release consistency (RC)  ILP-enabled consistency optimizations Hardware prefetching, Speculative loads Hardware prefetching, Speculative loads  Impact on database workloads?

32 32 UPC, February 1999 ILP-Enabled Consistency Optimizations ILP-Enabled Consistency Optimizations  ILP-enabled optimizations OLTP: RC only 1.1X better than SC (was 1.4X) OLTP: RC only 1.1X better than SC (was 1.4X) DSS: RC only 1.18X better than SC (was 1.85X) DSS: RC only 1.18X better than SC (was 1.85X)  Consistency model choice in hardware less important Without optimizationsWith optimizations 100 88 72 74 68 SC: sequential consistency PC: processor consistency RC: release consistency

33 33 UPC, February 1999 Outline Motivation Motivation Simulation Environment Simulation Environment Impact of ILP on Database Workloads Impact of ILP on Database Workloads Improving Performance of OLTP Improving Performance of OLTP Improving OLTP - Instruction Misses Improving OLTP - Instruction Misses Improving OLTP - Dirty misses Improving OLTP - Dirty misses Conclusions Conclusions

34 34 UPC, February 1999 Improving OLTP - Instruction Misses  4-element instruction cache stream buffer hardware prefetching of instructions hardware prefetching of instructions 1.21X performance improvement 1.21X performance improvement  Simple and effective for database servers 100 83 71

35 35 UPC, February 1999 Improving OLTP - Dirty Misses  Dirty misses Mostly to migratory data Mostly to migratory data Due to few instructions in critical sections Due to few instructions in critical sections  Solutions for migratory reads Software prefetching + producer-initiated flushes Software prefetching + producer-initiated flushes  Preliminary results without access to source code 1.14X performance improvement 1.14X performance improvement

36 36 UPC, February 1999 Summary  Detailed simulation study of Oracle with out-of- order processors  Impact of ILP techniques on database workloads Improve performance (1.5X OLTP, 2.6X DSS) Improve performance (1.5X OLTP, 2.6X DSS) Reduce performance gap between consistency models Reduce performance gap between consistency models  Improving performance of OLTP OLTP limited by instruction and migratory data misses OLTP limited by instruction and migratory data misses –Small stream buffer close to perfect instruction cache –Prefetching/flush appear promising

37 37 UPC, February 1999 Simultaneous Multithreading (SMT)  Collaboration with: Kourosh Gharachorloo (Compaq) Kourosh Gharachorloo (Compaq) Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh (U.Washington) Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh (U.Washington)  Exploit multithreaded nature of commercial applications  Aggressive Wide-issue OOO superscalar saturates at 4-issue slots  Potential to increase utilization of issue slots  Potential to exploit parallelism in the memory system

38 38 UPC, February 1999 SMT: what is it?  SMT enables multiple threads to issue instructions to multiple functional units in a single cycle   SMT exploits instruction-level & thread-level parallelism Hides long latencies Increases resource utilization and instruction throughput thread 1 thread 2 thread 3 thread 4 superscalar fine-grain multithreading SMT

39 39 UPC, February 1999 SMT and database workloads   Pro: SMT a good match, can take advantage of SMT’s multithreading HW – –Low throughput – –High cache miss rates   Con: Fine-grain interleaving can cause cache interference   What software techniques can help avoid interference?

40 40 UPC, February 1999 SMT studies: methodology  Trace-driven simulation  Same traces used in previous ILP study  New front-end to SMT simulator  Used OLTP and DSS workloads

41 41 UPC, February 1999 SMT Configuration   21264-like superscalar base, augmented with: up to 8 hardware contexts up to 8 hardware contexts 8-wide superscalar 8-wide superscalar 128KB, 2-way I and D, L1 cache, 2 cycle access 128KB, 2-way I and D, L1 cache, 2 cycle access 16MB, direct-mapped L2 cache, 12 cycle access 16MB, direct-mapped L2 cache, 12 cycle access 80 cycle memory latency 80 cycle memory latency 10 functional units (6 integer (4 ld/st), 4 FP) 10 functional units (6 integer (4 ld/st), 4 FP) 100 additional integer & FP renaming registers 100 additional integer & FP renaming registers integer and FP instruction queues, 32 entries each integer and FP instruction queues, 32 entries each

42 42 UPC, February 1999 OLTP Characterization   Memory behavior (1 context, 16 server processes)   High miss rates & large footprints

43 43 UPC, February 1999 Cache interference (16 server processes)  With 8-context SMT, many conflict misses  DSS data set fits in L2$

44 44 UPC, February 1999 Where are the misses?  L1 and L2 misses dominated by PGA references  Misses result from unnecessary address conflicts Misses (PGA) Misses (Instructions) 16 server processes, 8-context SMT

45 45 UPC, February 1999 L2$ conflicts: page mapping  Page coloring can be augmented with random first seed

46 46 UPC, February 1999 Results for different page mapping schemes 8.0 9.0 10.0 L 2 c a c h e m i s s r a t e ( g l o b a l ) DSS 16 MB, direct-mapped L2 cache, 16 server processes

47 47 UPC, February 1999 Why the steady L2$ miss rates?  Not all footprint has temporal locality Critical working sets are being cached Critical working sets are being cached 87% of instruction refs are to 31% of the I-footprint 87% of instruction refs are to 31% of the I-footprint 41% of metadata refs are to 26KB 41% of metadata refs are to 26KB  SMT and superscalar cache misses comparable  SMT changes interleaving, not total footprint  With proper global policies, working sets still fit in caches: SMT is effective

48 48 UPC, February 1999 L1$ conflicts: application-level offseting Base of each thread’s PGA is at same virtual address Causes unnecessary conflicts in virtually-indexed cache Address offsets can avoid interference Offset by thread id * 8KB

49 49 UPC, February 1999 Offsetting results bin hopping no offset bin hopping with offset 128KB, 2-way set associative L1 cache

50 50 UPC, February 1999 SMT: constructive interference  Cache interference can also be beneficial  Instruction segment is shared SMT exploits instruction sharing SMT exploits instruction sharing Improves I-cache locality Improves I-cache locality Reduces I-cache miss rate (OLTP) Reduces I-cache miss rate (OLTP) 14% with superscalar  9% with 8-context SMT

51 51 UPC, February 1999 SMT: overall performance 0.0 1.0 2.0 3.0 4.0 I n s t r u c t i o n s / c y c l e OLTP 16 server processes

52 52 UPC, February 1999 Why SMT is effective  Exploits memory-system concurrency  Improves instruction fetch  Improves instruction issue

53 53 UPC, February 1999 Exploiting memory system concurrency  OLTP has lots of pointer chasing

54 54 UPC, February 1999 Improving instruction fetch  SMT can fetch from multiple threads  Tolerate I-cache misses and branch mispredictions  Fetch fewer speculative instructions

55 55 UPC, February 1999 Improving instruction issue  SMT exposes more parallelism: use instruction-level and thread-level parallelism

56 56 UPC, February 1999 SMT Performance

57 57 UPC, February 1999 Summary  Critical working sets for DB workloads can still fit in caches even for SMT can still fit in caches even for SMT fine-granularity interleaving can be accommodated fine-granularity interleaving can be accommodated  Cache interference can be avoided with simple policies page mapping and application level offsetting page mapping and application level offsetting SMT miss rates comparable to superscalar SMT miss rates comparable to superscalar  SMT is effective: 4.6x speedup on OLTP, 1.5x on DSS 4.6x speedup on OLTP, 1.5x on DSS

58 58 UPC, February 1999 Final remarks  We understand architectural requirements of commercial applications more than we did a couple of years ago  But both technology and applications are moving targets  Lots to be done!

59 59 UPC, February 1999 Final remarks(2)  Important emerging workloads: ERP benchmarks from software vendors ERP benchmarks from software vendors –more representative of end-user OLTP performance Better decision support system algorithms Better decision support system algorithms Many Web-based applications Many Web-based applications –very young field –good benchmarks are still to come –TPC-W may be a good start Enterprise-scale mail servers Enterprise-scale mail servers Packet-switching servers for high-bandwidth subscriber connections (e.g., ADSL) Packet-switching servers for high-bandwidth subscriber connections (e.g., ADSL)

60 60 UPC, February 1999 Final remarks(3)  New technological/architectural challenges Large-scale NUMA architectures Large-scale NUMA architectures –worsen dirty miss problem –reliability and fault-containment Increased integration Increased integration –what’s the next subsystem to move on-chip? Explicitly parallel ISA? Explicitly parallel ISA? Impact of next generation Direct Rambus DRAMs Impact of next generation Direct Rambus DRAMs –very low latency Logic+DRAM integration Logic+DRAM integration What if memory were non-volatile? What if memory were non-volatile?

61 61 UPC, February 1999 Final remarks(3)  More short term issues How to reduce I-stream related stalls for OLTP? How to reduce I-stream related stalls for OLTP? How to reduce communication penalties in OLTP? How to reduce communication penalties in OLTP? –Prefetch/post-store? –Smarter coherence protocols? How to deal with 100s of threads per processor? How to deal with 100s of threads per processor? Innovative ways to reduce latency of pointer-based access patterns? Innovative ways to reduce latency of pointer-based access patterns? Can clusters become competitive in REAL OLTP environments? Can clusters become competitive in REAL OLTP environments?


Download ppt "Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part III: architecture studies."

Similar presentations


Ads by Google