CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.

CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October 1996

CMP vs. Wide-Issue Superscalar What is the best use of on-chip real estate?  wide-issue processor (complex design/clock, diminishing ILP returns)  CMP (simple design, high TLP, lower ILP) Contributions:  Takes area and latencies into account  Attempts fine-grain parallelization

Scalability of Superscalars Properties of large-window processors: Requires good branch prediction and fetch High rename complexity High issue queue complexity (grows with issue width and window size) High bypassing complexity High port requirements in the register file and cache  Necessitates partitioned architectures

Application Requirements Low-ILP programs (SPEC-Int) benefit little from wide-issue superscalar machines (1-wide R5000 is within 30% of 4-wide R10000) High-ILP programs (SPEC-FP) benefit from large windows – typically, loop-level parallelism that might be easy to extract

The CMP Argument Build many small CPU cores The small cores are enough to optimize low-ILP programs (high thruput with multiprogramming) For high-ILP programs, the compiler parallelizes the application into multiple threads – since the cores are on a single die, cost of communication is affordable Low communication cost  even integer programs with moderate ILP could be parallelized

The CMP Approach Wide-issue superscalar  the brute force method that extracts parallelism by blindly increasing in-flight window size and using more hardware CMP  extract parallelism by static analysis; minimum hardware complexity and maximum compiler smarts + CMP can exploit far-flung ILP, has low hw cost - Far-flung ILP and SPEC-Int threads are hard to automatically extract  memory disam, control flow

Area Extrapolations 4-wide6-wide SS4x2-way CMPComments 32KB DL113174x 3Banking/muxing 32KB IL114184x 3Banking/muxing TLB5154x 5 Bpred9284x 7 Decode11384x 5Quadratic effect Queues14504x 4Quadratic effect ROB/Regs9344x 2Quadratic effect Int FUs10314x 10More FUs in CMP FP FUs12374x 12More FUs in CMP Crossbar50Multi-L1s  L2 L2, clock, external interface unit 163 Remains unchanged

Processor Parameters

Applications BenchmarkDescriptionParallelism Integer compressCompresses and uncompresses file in memoryNone eqntottTranslates logic eqns into truth tablesManual m88ksimMotorola 88000 CPU simulatorManual MPsimVerilog simulation of a multiprocessorManual FP appluSolver for partial differential eqnsSUIF apsiTemp, wind, velocity modelsSUIF swimShallow water modelSUIF tomcatvMesh-generation with Thompson solverSUIF Multiprogramming pmakeParallel compilation for gnuchessMulti-task

2-Wide  6-Wide No change in branch prediction accuracy  area penalty for 6-wide? More speculation  more cache misses IPC improvements of at least 30% for all programs

CMP Statistics ApplicationIcacheL1D 2-way L1D 4x2-way L2 2-way L2 4x2-way Compress03.5 1.0 Eqntott0.60.85.40.71.2 M88ksim2.30.43.300 MPsim4.82.32.52.33.4 Applu02.02.11.71.8 Apsi2.74.16.92.12.0 Swim01.2 1.5 Tomcatv07.77.82.22.5 pmake2.42.14.60.40.7

Results

Clustered SMT vs. CMP Single-Thread Performance Fetch Cluster DL1 Fetch Proc DL1 Clustered SMT CMP Interconnect for register traffic Interconnect for cache coherence traffic

Clustered SMT vs. CMP Multi-Program Performance Fetch Cluster DL1 Fetch Proc DL1 Clustered SMT CMP Interconnect for register traffic Interconnect for cache coherence traffic

Clustered SMT vs. CMP Multi-thread Performance Fetch Cluster DL1 Fetch Proc DL1 Clustered SMT CMP Interconnect for register traffic Interconnect for cache coherence traffic

Clustered SMT vs. CMP Multi-thread Performance Fetch Cluster DL1 Fetch Proc DL1 Clustered SMT CMP Interconnect for register traffic Interconnect for cache coherence traffic Cluster DL1

Conclusions CMP reduces hardware/power overhead Clustered SMT can yield better single-thread and multi-programmed performance (at high cost) CMP can improve application performance if the compiler can extract thread-level parallelism What is the most effective use of on-chip real estate?  Depends on the workload  Depends on compiler technology

Next Class’ Paper “The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization”, J.G. Steffan and T.C. Mowry, Proceedings of HPCA-4, February 1998

Title Bullet

CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.

Similar presentations

Presentation on theme: "CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.

Similar presentations

Presentation on theme: "CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October."— Presentation transcript:

Similar presentations

About project

Feedback