Presentation is loading. Please wait.

Presentation is loading. Please wait.

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin.

Similar presentations

Presentation on theme: "Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin."— Presentation transcript:

1 Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin

2 2 Outline Waves of innovation in architecture Innovation in uniprocessors Opportunities for innovation Example CMP innovations  Parallel processing models  CMP memory hierarcies

3 3 Waves of Research and Innovation A new direction is proposed or new opportunity becomes available The center of gravity of the research community shifts to that direction  SIMD architectures in the 1960s  HLL computer architectures in the 1970s  RISC architectures in the early 1980s  Shared-memory MPs in the late 1980s  OOO speculative execution processors in the 1990s

4 4 Waves Wave is especially strong when coupled with a “step function” change in technology  Integration of a processor on a single chip  Integration of a multiprocessor on a chip

5 5 Uniprocessor Innovation Wave Integration of processor on a single chip  The inflexion point  Argued for different architecture (RISC) More transistors allow for different models  Speculative execution Then the rebirth of uniprocessors  Continue the journey of innovation  Totally rethink uniprocessor microarchitecture

6 6 The Next Wave Can integrate a simple multiprocessor on a chip  Basic microarchitecture similar to traditional MP Rethink the microarchitecture and usage of chip multiprocessors

7 7 Broad Areas for Innovation Overcoming traditional barriers New opportunities to use CMPs for parallel/multithreaded execution Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)

8 8 Remainder of Talk Roadmap Summary of traditional parallel processing Revisiting traditional barriers and overheads to parallel processing Novel CMP applications and workloads Novel CMP microarchitectures

9 9 Multiprocessor Architecture Take state-of-the-art uniprocessor Connect several together with a suitable network  Have to live with defined interfaces Expend hardware to provide cache coherence and streamline inter-node communication  Have to live with defined interfaces

10 10 Software Responsibilities Reason about parallelism, execution times and overheads  This is hard Use synchronization to ease reasoning  Parallel trends towards serial with the use of synchronization Very difficult to parallelize transparently

11 11 Net Result Difficult to get parallelism speedup  Computation is serial  Inter-node communication latencies exacerbate problem Multiprocessors rarely used for parallel execution Typical use: improve throughput This will have to change  Will need to rethink “parallelization”

12 12 Rethinking Parallelization Speculative multithreading Speculation to overcoming other performance barriers Revisiting computation models for parallelization New parallelization opportunities  New types of workloads (e.g., multimedia)  New variants of parallelization models

13 13 Speculative Multithreading Speculatively parallelize an application  Use speculation to overcome ambiguous dependences  Use hardware support to recover from mis- speculation E.g., multiscalar Use hardware to overcome barriers

14 14 Overcoming Barriers: Memory Models Weak models proposed to overcome performance limitations of SC Speculation used to overcome “maybe” dependences Series of papers showing SC can achieve performance of weak models

15 15 Implications Strong memory models not necessarily low performance Programmer does not have to reason about weak models More likely to have parallel programs written

16 16 Overcoming Barriers: Synchronization Synchronization to avoid “maybe” dependences  Causes serialization Speculate to overcome serialization Recent work on techniques to dynamically elide synchronization constructs

17 17 Implications Programmer can make liberal use of synchronization to ease programming Little performance impact of synchronization More likely to have parallel programs written

18 18 Revisiting Parallelization Models Transactions  simplify writing of parallel code  very high overhead to implement semantics in software Hardware support for transactions will exist  Speculative multithreading is ordered transactions  No software overhead to implement semantics More applications likely to be written with transactions

19 19 New Opportunities for CMPs New opportunities for parallelism  Emergence of multimedia workloads Amenable to traditional parallel processing  Parallel execution of overhead code  Program demultiplexing  Separating user and OS

20 20 New Opportunities for CMPs New opportunities for microarchitecture innovation  Instruction memory hierarchies  Data memory hierarchies

21 21 Reliable Systems Software is unreliable and error-prone Hardware will be unreliable and error-prone Improving hardware/software reliability will result in significant software redundancy  Redundancy will be source of parallelism

22 22 Software Reliability & Security Reliability & Security via Dynamic Monitoring -Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… -VM performs checks for Java/C# High Overheads! -Encourages use of unsafe code

23 23 Software Reliability & Security - Divide program into tasks -Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A B C D A’ B’ C’ Monitoring Code Program

24 24 Software Reliability & Security -Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks C D B B’ A A’ COMMIT C’ ABORT C’ D

25 25 Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow

26 26 Software Reliability & Security

27 27 Program De-multiplexing New opportunities for parallelism  2-4X parallelism Program is a multiplexing of methods (or functions) onto single control flow De-multiplex methods of program Execute methods in parallel in dataflow fashion

28 28 Data-flow Execution Data-flow Machines - Dataflow in programs - No control flow, PC - Nodes are instrs. on FU OoO Superscalar Machines - Sequential programs - Limited dataflow with ROB - Nodes are instrs. on FU 6 7

29 29 Program Demultiplexing Demux’ed Execution Nodes - methods (M) Processors - FUs Sequential Program

30 30 Program Demultiplexing Sequential Program Triggers - usually fire after data dep of M - Chosen with software support - Handlers generate params PD - Data-flow based spec. execution - Differs from control-flow spec. 6 On Trigger- Execute On Call- Use Demux’ed exec. (6)

31 31 Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M

32 32 Impact of Parallelization Expect different characteristics for code on each core More reliance on inter-core parallelism Less reliance on intra-core parallelism May have specialized cores

33 33 Microarchitectural Implications Processor Cores  Skinny, less complex Will SMT go away?  Perhaps specialized Memory structures (i-caches, TLBs, d- caches)  Different organizations possible

34 34 Microarchitectural Implications Novel memory hierarchy opportunities  Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth Pressure on non-core techniques to tolerate longer latencies Helper threads, pre-execution

35 35 Instruction Memory Hierarchy Computation spreading

36 36 Conventional CMP Organizations P L1I L1D L2 Interconnect Network P L1I L1D L2 P L1I L1D L2 P L1I L1D L2 ……

37 37 Code Commonality Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity Poor utilization of L1 I-cache due to replication

38 38 Removing Duplication Avoiding duplication significantly reduces misses Shared L1 I-cache may be impractical MPKI (64K)

39 39 Computation Spreading Avoid reference spreading by distributing the computation based on code regions  Each processor is responsible for a particular region of code (for multiple threads)  References to a particular code region is localized  Computation from one thread is carried out in different processors

40 P1P2P3 AB’C’’ BC’A’’ CA’B’’ P1P2P3 AB’C’’ A’BC’ A’’B’’C T1T2T3 ABCABC B’ C’ A’ C’’ A’’ B’’ Canonical Model Computation Spreading Example TIME

41 41 L1 cache performance Significant reduction in instruction misses Data Cache Performance is deteriorated  Migration of computation disrupts locality of data reference

42 42 Data Memory Hierarchy Co-operative caching

43 43 Conventional CMP Organizations P L1I L1D L2 Interconnect Network P L1I L1D L2 P L1I L1D L2 P L1I L1D L2 ……

44 44 Conventional CMP Organizations P L1I L1D Interconnect Network P L1I L1D P L1I L1D P L1I L1D …… Shared L2

45 45 A Spectrum of CMP Cache Designs Shared Caches Private Caches Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) Cooperative Caching Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) …

46 46 CMP Cooperative Caching Basic idea: Private caches cooperatively form an aggregate global cache  Use private caches for fast access / isolation  Share capacity through cooperation  Mitigate interference via cooperation throttling Inspired by cooperative file/web caches P L1IL1D L2 P L1IL1D L2 L1IL1DL1IL1D PP

47 47 Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core Shared (LRU) Most reduction C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads MPKI (private): Reduction of off-chip access rate

48 48 Cycles Spent on L1 Misses Trace-based simulation: no miss overlapping Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses PCFS OLTPApacheECperfJBBBarnesOceanMix1Mix2 PCFS P:Private C:C-Clean F:C-1Fwd S:Shared

49 49 Summary Start of a new wave of innovation to rethink parallelism and parallel architecture New opportunities for innovation in CMPs New opportunities for parallelizing applications Expect little resemblance between MPs today and CMPs in 15 years We need to invent and define differences

Download ppt "Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin."

Similar presentations

Ads by Google