CS Lecture 20 The Case for a Single-Chip Multiprocessor

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
Multiscalar processors
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
CS Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
A few issues on the design of future multicores André Seznec IRISA/INRIA.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Lynn Choi School of Electrical Engineering
Simultaneous Multithreading
Multi-core processors
‘99 ACM/IEEE International Symposium on Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Yingmin Li Ting Yan Qi Zhao
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The Vector-Thread Architecture
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
8 – Simultaneous Multithreading
Lecture 22: Multithreading
The University of Adelaide, School of Computer Science
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Presentation transcript:

CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October 1996

CMP vs. Wide-Issue Superscalar What is the best use of on-chip real estate? wide-issue processor (complex design/clock, diminishing ILP returns) CMP (simple design, high TLP, lower ILP) Contributions: Takes area and latencies into account Attempts fine-grain parallelization

Scalability of Superscalars Properties of large-window processors: Requires good branch prediction and fetch High rename complexity High issue queue complexity (grows with issue width and window size) High bypassing complexity High port requirements in the register file and cache  Necessitates partitioned architectures

Application Requirements Low-ILP programs (SPEC-Int) benefit little from wide-issue superscalar machines (1-wide R5000 is within 30% of 4-wide R10000) High-ILP programs (SPEC-FP) benefit from large windows – typically, loop-level parallelism that might be easy to extract

The CMP Argument Build many small CPU cores The small cores are enough to optimize low-ILP programs (high thruput with multiprogramming) For high-ILP programs, the compiler parallelizes the application into multiple threads – since the cores are on a single die, cost of communication is affordable Low communication cost  even integer programs with moderate ILP could be parallelized

The CMP Approach Wide-issue superscalar  the brute force method that extracts parallelism by blindly increasing in-flight window size and using more hardware CMP  extract parallelism by static analysis; minimum hardware complexity and maximum compiler smarts CMP can exploit far-flung ILP, has low hw cost Far-flung ILP and SPEC-Int threads are hard to automatically extract  memory disam, control flow

external interface unit Area Extrapolations 4-wide 6-wide SS 4x2-way CMP Comments 32KB DL1 13 17 4x 3 Banking/muxing 32KB IL1 14 18 TLB 5 15 4x 5 Bpred 9 28 4x 7 Decode 11 38 Quadratic effect Queues 50 4x 4 ROB/Regs 34 4x 2 Int FUs 10 31 4x 10 More FUs in CMP FP FUs 12 37 4x 12 Crossbar Multi-L1s  L2 L2, clock, external interface unit 163 Remains unchanged

Processor Parameters

Applications Benchmark Description Parallelism Integer compress Compresses and uncompresses file in memory None eqntott Translates logic eqns into truth tables Manual m88ksim Motorola 88000 CPU simulator MPsim Verilog simulation of a multiprocessor FP applu Solver for partial differential eqns SUIF apsi Temp, wind, velocity models swim Shallow water model tomcatv Mesh-generation with Thompson solver Multiprogramming pmake Parallel compilation for gnuchess Multi-task

2-Wide  6-Wide No change in branch prediction accuracy  area penalty for 6-wide? More speculation  more cache misses IPC improvements of at least 30% for all programs

CMP Statistics Application Icache L1D 2-way 4x2-way L2 Compress 3.5 3.5 1.0 Eqntott 0.6 0.8 5.4 0.7 1.2 M88ksim 2.3 0.4 3.3 MPsim 4.8 2.5 3.4 Applu 2.0 2.1 1.7 1.8 Apsi 2.7 4.1 6.9 Swim 1.5 Tomcatv 7.7 7.8 2.2 pmake 2.4 4.6

Results

Clustered SMT vs. CMP Single-Thread Performance CMP Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 DL1 Interconnect for cache coherence traffic

Clustered SMT vs. CMP Multi-Program Performance CMP Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 Interconnect for cache coherence traffic

Clustered SMT vs. CMP Multi-thread Performance CMP Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 Interconnect for cache coherence traffic

Clustered SMT vs. CMP Multi-thread Performance CMP Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 DL1 Interconnect for cache coherence traffic

Conclusions CMP reduces hardware/power overhead Clustered SMT can yield better single-thread and multi-programmed performance (at high cost) CMP can improve application performance if the compiler can extract thread-level parallelism What is the most effective use of on-chip real estate? Depends on the workload Depends on compiler technology

Next Class’ Paper “The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization”, J.G. Steffan and T.C. Mowry, Proceedings of HPCA-4, February 1998

Title Bullet