Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Single-Chip Multiprocessing

Similar presentations


Presentation on theme: "Adaptive Single-Chip Multiprocessing"— Presentation transcript:

1 Adaptive Single-Chip Multiprocessing
Dan Gibson University of Wisconsin-Madison Department of Electrical and Computer Engineering

2 Introduction Moore’s Law continues to provide more transistors
Devices are getting smaller Devices are getting faster Leads to increases in clock frequency Memories are getting bigger Large memories often require more time to access RC Circuits continue to charge exponentially Long-wire signal propagation time is not improving as rapidly as switching speed On-chip communication time is slower relative to processor clock speeds ECE Qualifying Exam

3 The Memory Wall Processors grow faster, memory grows slower
Off-chip cache misses can halt even aggressive out-of-order processors On-chip cache accesses are becoming long-latency events Latency can sometimes be tolerated Caching Perfecting Speculation Out-of-order execution Multithreading ECE Qualifying Exam

4 The “Power” Wall More devices, faster clocks => More power
Power supply accounts for lots of pins in chip packaging (3,057 of 5,370 pins on the POWER5) Heat dissipation increases total cost of ownership (~34W cooling power required to remove 100W of heat) Dynamic Power in CMOS Devices get smaller, faster, and more numerous More Capacitance Higher Frequency Architects can constrain α, CL, and f ECE Qualifying Exam

5 Enter Chip Multiprocessors (CMPs)
One chip, many processors Multiple cores per chip Often multiple threads per core Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 ECE Qualifying Exam

6 CMPs CMPs can have good performance
Explicit thread-level parallelism Related threads experience constructive prefetching CMPs can tolerate long-latency events well Many concurrent threads => long-latency memory accesses can be overlapped CMPs can be power-efficient Enables use of simpler cores Distributes “hot spots” ECE Qualifying Exam

7 CMPs CMPs are very specialized Parallel machines are difficult to use
Assumes (highly) threaded workload Parallel machines are difficult to use Parallel programming is not (yet) commonplace Many problems similar to traditional multiprocessors Cache coherence Memory consistency Many new opportunities Cache sharing More integration ECE Qualifying Exam

8 Adaptive CMPs To combat specialization, adapt a CMP dynamically to its current workload and system: Adapt caching policy ( Beckmann et. al., Chang et. al., and more ) Adapt cache structure ( Alameldeen et. al., and more ) Adapt thread scheduling ( Kihm et. Al., in the SMT space) Current idea: Adaptive thread scheduling from the space of un-stalled and stalled threads A union of single-core multithreading and runahead execution in the context of CMPs ECE Qualifying Exam

9 Single-Core Multithreading
Allow multiple (HW) threads within the same execution pipeline Shares processor resources: FUs, Decode, ROB, etc. Shares local memory resources: L1 caches, LSQ, etc. Can increase processor and memory utilization Sun’s Niagara pipeline block diagram ( Kongetira et. al.) ECE Qualifying Exam

10 Runahead Execution Continue execution in the face of a cache miss
“Checkpoint” architectural state Continue execution speculatively Convert memory accesses to prefetches “Runahead” prefetches can be highly accurate, and can greatly improve cache performance ( Mutlu, et. al.) It is possible to issue useless prefetches Can be power-inefficient (Mutlu, et. al.) ECE Qualifying Exam

11 Runahead/Multithreaded Core Interaction
Similar Hardware Requirements: Additional register files Additional LSQ entries Competition for Similar Resources: Execution time (Processor pipeline, Functional units, etc) Memory bandwidth TLB Entries, cache space, etc. ECE Qualifying Exam

12 Runahead/Multithreaded Core Interaction
A multithreaded core in a CMP, with runahead, must make a difficult scheduling decisions: Thread scheduling considerations: Which thread should run? Should the thread use runahead? How long should the thread run/runahead? Scheduling implications: Is an idle thread making foreword progress at the expense of a useful thread? Is a thread spinning on a lock held by another thread? Is runahead effective for a given thread? Is a given thread causing performance problems elsewhere in the CMP? ECE Qualifying Exam

13 Proposed Mechanism Track per-thread state on: Selection criteria:
Runahead prefetching accuracy High accuracy favors allowing thread to runahead HW-assigned thread priority Highly “useful” threads are preferred Selection criteria: Heuristic-guided Select the best priority/accuracy pair Probabilistically-guided Select a thread with likelihood proportional to its priority/accuracy Useful-first Select non-runahead threads first, then select runahead threads ECE Qualifying Exam

14 Future Directions Dynamically Adaptable CMPs offer several future areas of research: Adapt for power savings / heat dissipation Computation relocation, load balancing, automatic low-power modes, etc. Adapt to error conditions Dynamically allocate backup threads Automatically relocate threads to improve resource sharing Combined HW/SW/VM approach ECE Qualifying Exam

15 Summary Latency now dominates off-chip communication
On-chip communication isn’t far behind Many techniques to tolerate latency, including multithreading CMPs provide new challenges and opportunities to computer architects Latency tolerance Potential for power savings Can adapt a CMP’s behavior to its workload Dynamic management of shared resources ECE Qualifying Exam


Download ppt "Adaptive Single-Chip Multiprocessing"

Similar presentations


Ads by Google