Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.

Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University

Technology  Architecture Transistors are cheap, plentiful and fast  Moore’s law  100 million transistors by 2000 Wires are cheap, plentiful and slow  Wires get slower relative to transistors  Long cross-chip wires are especially slow Architectural implications  Plenty of room for innovation  Single cycle communication requires localized blocks of logic  High communication bandwidth across the chip easier to achieve than low latency

Stanford University Exploiting Program Parallelism Instruction Loop Thread Process Levels of Parallelism Grain Size (instructions) 1101001K10K100K1M

Stanford University Hydra Approach A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control  Exploits parallelism at all levels Memory renaming and thread-level speculation  Makes it easy to develop parallel programs Keep design simple by taking advantage of single chip implementation

Stanford University Outline Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions

Stanford University The Base Hydra Design  Shared 2nd-level cache  Low latency interprocessor communication (10 cycles)  Separate read and write buses  Single-chip multiprocessor  Four processors  Separate primary caches  Write-through data caches to maintain coherence

Stanford University Hydra vs. Superscalar  ILP only  SS 30-50% better than single Hydra processor  ILP & fine thread  SS and Hydra comparable  ILP & coarse thread  Hydra 1.5–2  better  “The Case for a CMP” ASPLOS ‘96 compress m88ksim eqntott MPEG2 applu apsi swim tomcatv pmake 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup Superscalar 6-way issue Hydra 4 x 2-way issue OLTP

Stanford University Problem: Parallel Software Parallel software is limited  Hand-parallelized applications  Auto-parallelized dense matrix FORTRAN applications Traditional auto-parallelization of C-programs is very difficult  Threads have data dependencies  synchronization  Pointer disambiguation is difficult and expensive  Compile time analysis is too conservative How can hardware help?  Remove need for pointer disambiguation  Allow the compiler to be aggressive

Stanford University Solution: Data Speculation Data speculation enables parallelization without regard for data-dependencies  Loads and stores follow original sequential semantics  Speculation hardware ensures correctness  Add synchronization only for performance  Loop parallelization is now easily automated Other ways to parallelize code  Break code into arbitrary threads (e.g. speculative subroutines )  Parallel execution with sequential commits Data speculation support  Wisconsin multiscalar  Hydra provides low-overhead support for CMP

Stanford University Data Speculation Requirements I  Forward data between parallel threads  Detect violations when reads occur too early

Stanford University Data Speculation Requirements II  Safely discard bad state after violation  Correctly retire speculative state

Stanford University Data Speculation Requirements III  Maintain multiple “views” of memory

Stanford University Hydra Speculation Support  Write bus and L2 buffers provide forwarding  “Read” L1 tag bits detect violations  “Dirty” L1 tag bits and write buffers provide backup  Write buffers reorder and retire speculative state  Separate L1 caches with pre-invalidation & smart L2 forwarding for “view”  Speculation coprocessors to control threads

Stanford University Speculative Reads – L1 hit The read bits are set  L1 miss L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)

Stanford University Speculative Writes  A CPU writes to its L1 cache & write buffer  “Earlier” CPUs invalidate our L1 & cause RAW hazard checks  “Later” CPUs just pre-invalidate our L1  Non-speculative write buffer drains out into the L2

Stanford University Speculation Runtime System Software Handlers  Control speculative threads through CP2 interface  Track order of all speculative threads  Exception routines recover from data dependency violations  Adds more overhead to speculation than hardware but more flexible and simpler to implement  Complete description in “Data Speculation Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99

Stanford University Creating Speculative Threads Speculative loops  for and while loop iterations  Typically one speculative thread per iteration Speculative procedures  Execute code after procedure speculatively  Procedure calls generate a speculative thread Compiler support  C source to source translator  Pfor, pwhile  Analyze loop body and globalize any local variables that could cause loop-carried dependencies

Stanford University Base Speculative Thread Performance  Entire applications  GCC 2.7.2 -O2  4 single-issue processors  Accurate modeling of all aspects of Hydra architecture and real runtime system compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup Base

Stanford University Improving Speculative Runtime System Procedure support adds overhead to loops  Threads are not created sequentially  Dynamic thread scheduling necessary  Start and end of loop: 75 cycles  End of iteration: 80 cycles Performance  Best performing speculative applications use loops  Procedure speculation often lowers performance  Need to optimize RTS for common case Lower speculative overheads  Start and end of loop: 25 cycles  End of iteration: 12 cycles (almost a factor of 7)  Limit procedure speculation to specific procedures

Stanford University Improved Speculative Performance  Improves performance of all applications  Most improvement for applications with fine- grained threads  Eqntott uses procedure speculation compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup Base Optimized RTS

Stanford University Optimizing Parallel Performance Cache coherent shared memory  No explicit data movement  100+ cycle communication latency  Need to optimize for data locality  Look at cache misses (MemSpy, Flashpoint) Speculative threads  No explicit data independence  Frequent dependence violations limit performance  Need to optimize to reduce frequency and impact of data violations  Dependence prediction can help  Look at violation statistics (requires some hardware support)

Stanford University Feedback and Code Transformations Feedback tool  Collects violation statistics (PCs, frequency, work lost)  Correlates read and write PC values with source code Synchronization  Synchronize frequently occurring violations  Use non-violating loads Code Motion  Find dependent load-stores  Move loads down in thread  Move stores up in thread

Stanford University Code Motion  Rearrange reads and writes to increase parallelism  Delay reads and advance writes  Create local copies to allow earlier data forwarding read x write x read x write x iteration i iteration i+1 read x write x read x write x iteration i iteration i+1 read x read x’

Stanford University Optimized Speculative Performance Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup

Stanford University Size of Speculative Write State  Max size determines size of write buffer for max performance  Non-head processor stalls when write buffer fills up  Small write buffers (< 64 lines) will achieve good performance 32 byte cache lines Max no. lines of write state

Stanford University Hydra Prototype  Design based on Integrated Device Technology (IDT) RC32364  88 mm 2 in 0.25  m with 8 KB I, D and 128 KB L2

Stanford University Conclusions Hydra offers a new way to design microprocessors  Single-chip MP exploits parallelism at all levels  Low overhead support for speculative parallelism  Provides high performance on applications with medium to large-grain parallelism  Allows performance optimization migration path for difficult to parallelize fine-grain applications Prototype Implementation  Work out implementation details  Provide platform for application and compiler development  Realistic performance evaluation

Stanford University Hydra Team Team  Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT) URL  http://www-hydra.stanford.edu

Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.

Similar presentations

Presentation on theme: "Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.

Similar presentations

Presentation on theme: "Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback