Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS Lecture 20 The Case for a Single-Chip Multiprocessor

Similar presentations


Presentation on theme: "CS Lecture 20 The Case for a Single-Chip Multiprocessor"— Presentation transcript:

1 CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor
K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October 1996

2 CMP vs. Wide-Issue Superscalar
What is the best use of on-chip real estate? wide-issue processor (complex design/clock, diminishing ILP returns) CMP (simple design, high TLP, lower ILP) Contributions: Takes area and latencies into account Attempts fine-grain parallelization

3 Scalability of Superscalars
Properties of large-window processors: Requires good branch prediction and fetch High rename complexity High issue queue complexity (grows with issue width and window size) High bypassing complexity High port requirements in the register file and cache  Necessitates partitioned architectures

4 Application Requirements
Low-ILP programs (SPEC-Int) benefit little from wide-issue superscalar machines (1-wide R5000 is within 30% of 4-wide R10000) High-ILP programs (SPEC-FP) benefit from large windows – typically, loop-level parallelism that might be easy to extract

5 The CMP Argument Build many small CPU cores
The small cores are enough to optimize low-ILP programs (high thruput with multiprogramming) For high-ILP programs, the compiler parallelizes the application into multiple threads – since the cores are on a single die, cost of communication is affordable Low communication cost  even integer programs with moderate ILP could be parallelized

6 The CMP Approach Wide-issue superscalar  the brute force method
that extracts parallelism by blindly increasing in-flight window size and using more hardware CMP  extract parallelism by static analysis; minimum hardware complexity and maximum compiler smarts CMP can exploit far-flung ILP, has low hw cost Far-flung ILP and SPEC-Int threads are hard to automatically extract  memory disam, control flow

7 external interface unit
Area Extrapolations 4-wide 6-wide SS 4x2-way CMP Comments 32KB DL1 13 17 4x 3 Banking/muxing 32KB IL1 14 18 TLB 5 15 4x 5 Bpred 9 28 4x 7 Decode 11 38 Quadratic effect Queues 50 4x 4 ROB/Regs 34 4x 2 Int FUs 10 31 4x 10 More FUs in CMP FP FUs 12 37 4x 12 Crossbar Multi-L1s  L2 L2, clock, external interface unit 163 Remains unchanged

8 Processor Parameters

9 Applications Benchmark Description Parallelism Integer compress
Compresses and uncompresses file in memory None eqntott Translates logic eqns into truth tables Manual m88ksim Motorola CPU simulator MPsim Verilog simulation of a multiprocessor FP applu Solver for partial differential eqns SUIF apsi Temp, wind, velocity models swim Shallow water model tomcatv Mesh-generation with Thompson solver Multiprogramming pmake Parallel compilation for gnuchess Multi-task

10 2-Wide  6-Wide No change in branch prediction accuracy  area penalty for 6-wide? More speculation  more cache misses IPC improvements of at least 30% for all programs

11 CMP Statistics Application Icache L1D 2-way 4x2-way L2 Compress 3.5
3.5 1.0 Eqntott 0.6 0.8 5.4 0.7 1.2 M88ksim 2.3 0.4 3.3 MPsim 4.8 2.5 3.4 Applu 2.0 2.1 1.7 1.8 Apsi 2.7 4.1 6.9 Swim 1.5 Tomcatv 7.7 7.8 2.2 pmake 2.4 4.6

12 Results

13 Clustered SMT vs. CMP Single-Thread Performance CMP Fetch Fetch Fetch
Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 DL1 Interconnect for cache coherence traffic

14 Clustered SMT vs. CMP Multi-Program Performance CMP Fetch Fetch Fetch
Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 Interconnect for cache coherence traffic

15 Clustered SMT vs. CMP Multi-thread Performance CMP Fetch Fetch Fetch
Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 Interconnect for cache coherence traffic

16 Clustered SMT vs. CMP Multi-thread Performance CMP Fetch Fetch Fetch
Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 DL1 Interconnect for cache coherence traffic

17 Conclusions CMP reduces hardware/power overhead
Clustered SMT can yield better single-thread and multi-programmed performance (at high cost) CMP can improve application performance if the compiler can extract thread-level parallelism What is the most effective use of on-chip real estate? Depends on the workload Depends on compiler technology

18 Next Class’ Paper “The Potential for Using Thread-Level Data
Speculation to Facilitate Automatic Parallelization”, J.G. Steffan and T.C. Mowry, Proceedings of HPCA-4, February 1998

19 Title Bullet


Download ppt "CS Lecture 20 The Case for a Single-Chip Multiprocessor"

Similar presentations


Ads by Google