Www.compaq.com Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.

Slides:



Advertisements
Similar presentations
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Advertisements

Lecture 6: Multicore Systems
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.
CS 252 Graduate Computer Architecture Lecture 13: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
Better answers Relaxing Constraints: Thoughts on the Evolution of Computer Architecture Joel Emer Alpha Development Group Compaq Computer Corporation.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.
Alpha 21364: A Scalable Single-chip SMP
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.
CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.
Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.
The Alpha – Data Stream Matt Ziegler.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture: Multithreading (II)
COMP 740: Computer Architecture and Implementation
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading: Multiplying Alpha Performance
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Emulating Unimplemented Instructions in an SMT
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Lecture 22: Multithreading
Prof. Onur Mutlu Carnegie Mellon University 9/28/2012
Presentation transcript:

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq Computer Corporation

Outline  Alpha Processor Roadmap  Motivation for Introducing SMT  Implementation of an SMT CPU  Performance Estimates  Architectural Abstraction

Higher Performance Lower Cost EV EV  m EV  m 0.18  m EV7... EV  m Alpha Microprocessor Overview EV  m First System Ship

EV8 Technology Overview  Leading edge process technology – GHz 0.125µm CMOS 0.125µm CMOS SOI-compatible SOI-compatible Cu interconnect Cu interconnect low-k dielectrics low-k dielectrics  Chip characteristics ~1.2V Vdd ~1.2V Vdd ~250 Million transistors ~250 Million transistors ~1100 signal pins in flip chip packaging ~1100 signal pins in flip chip packaging

EV8 Architecture Overview  Enhanced out-of-order execution  8-wide superscalar  Large on-chip L2 cache  Direct RAMBUS interface  On-chip router for system interconnect  Glueless, directory-based, ccNUMA for up to 512-way SMP  4-way simultaneous multithreading (SMT)

Goals  Leadership single stream performance  Extra multistream performance with multithreading Without major architectural changes Without major architectural changes Without significant additional cost Without significant additional cost

Instruction Issue Reduced function unit utilization due to dependencies Time

Superscalar Issue Superscalar leads to more performance, but lower utilization Time

Predicated Issue Adds to function unit utilization, but results are thrown away Time

Chip Multiprocessor Limited utilization when only running one thread Time

Fine Grained Multithreading Intra-thread dependencies still limit performance Time

Simultaneous Multithreading Maximum utilization of function units by independent operations Time

Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire PC Icache Register Map Dcache Regs Thread-blind

SMT Pipeline Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire Icache Dcache PC Register Map Regs

Changes for SMT  Basic pipeline – unchanged  Replicated resources Program counters Program counters Register maps Register maps  Shared resources Register file (size increased) Register file (size increased) Instruction queue Instruction queue First and second level caches First and second level caches Translation buffers Translation buffers Branch predictor Branch predictor

Multiprogrammed workload

Decomposed SPEC95 Applications

Multithreaded Applications

Architectural Abstraction  1 CPU with 4 Thread Processing Units (TPUs)  Shared hardware resources TPU 0TPU1TPU2TPU3 IcacheTLBDcache Scache

System Block Diagram EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO 0123

Quiescing Idle Threads  Problem: Spin looping thread consumes resources  Solution: Provide quiescing operation that allows a TPU to sleep until a memory location changes

Summary  Alpha will maintain single stream performance leadership  SMT will significantly enhance multistream performance Across a wide range of applications, Across a wide range of applications, Without significant hardware cost, and Without significant hardware cost, and Without major architectural changes Without major architectural changes

References  "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95.  "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreaded Processor" by Tullsen, Eggers, Emer, Levy, Lo and Stamm in ISCA96.  “Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm and Tullsen in ACM Transactions on Computer Systems, August  “Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.