Simultaneous Multithreading

Slides:



Advertisements
Similar presentations
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Advertisements

Computer Organization and Architecture
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors Jack L. Lo, Luiz André Barroso, Susan Eggers Kourosh Gharachorloo,
Simultaneous Multithreading (SMT)
SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
How Multi-threading can increase on-chip parallelism
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Computer Architecture Lec 10 –Simultaneous Multithreading.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture: Multithreading (II)
COMP 740: Computer Architecture and Implementation
CS Lecture 20 The Case for a Single-Chip Multiprocessor
Electrical and Computer Engineering
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading: Multiplying Alpha Performance
Multi-core processors
Computer Structure Multi-Threading
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Computer Architecture: Multithreading (I)
SMT Issues SMT-7 SMT-8 SMT-9 SMT CPU performance gain potential.
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Limits to ILP Conflicting studies of amount
Simultaneous Multithreading in Superscalar Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
ECE/CS 752: Midterm 2 Review ECE/CS 752 Fall 2017
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Hardware Multithreading
The University of Adelaide, School of Computer Science
Prof. Onur Mutlu Carnegie Mellon University 9/28/2012
Resource Replication 6 Integer Units 4 FP units 8 Sets of architectural registers Renaming registers (Int/FP) HW Context (PC, Return Stack.
Presentation transcript:

Simultaneous Multithreading Pratyusa Manadhata (pratyus@cs) Vyas Sekar(vyass@cs) Carnegie Mellon, 15740 Fall 03

References Susan Eggers, Joel Emer, Henry Levy, Jack Lo, Rebecca Stamm, and Dean Tullsen. Simultaneous Multithreading: A Platform for Next-generation Processors, in IEEE Micro, September/October 1997, pages 12-18. Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading, in ACM Transactions on Computer Systems, August 1997, pages 322-354. Dean Tullsen, Susan Eggers, Joel Emer, Henry Levy, Jack Lo, and Rebecca Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , in Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. Carnegie Mellon, 15740 Fall 03

Motivation For significant performance improvement, improving memory subsystem or increasing system integration not sufficient. So increase parallelism in all its available form Instruction Level Parallelism (ILP) Thread Level Parallelism (TLP) Carnegie Mellon, 15740 Fall 03

Architectural Alternatives Superscalar Multithreaded Super scalar MultiProcessors Neither superscalar or SMP can capture ILP/TLP in its entirety Incapable of adapting to dynamic levels of ILP, and TLP Carnegie Mellon, 15740 Fall 03

Simultaneous Multithreading TLP from either multithreaded parallel programs or from multiprogramming workload ILP from each thread Characteristics of SMT processors: from superscalar: issue multiple instructions per cycle from multithreaded: h/w state for multiple threads Carnegie Mellon, 15740 Fall 03

Superscalar Issue slots SMT Multithreaded Carnegie Mellon, 15740 Fall 03

Comparison Superscalar: Multithreaded: SMT : looks at multiple instructions from same process, both horizontal and vertical waste. Multithreaded: minimizes vertical waste: tolerate long latency operations SMT : Selects instructions from any "ready" thread Carnegie Mellon, 15740 Fall 03

SMT Model Minimal extension of superscalar processor Changes in IF stage and register files only No static partitioning of resources Most of the hardware is still available to a single thread. Carnegie Mellon, 15740 Fall 03

SMT Model Per thread Large register file State for hardware context (PC, registers) Instruction retirement, trapping, subroutine return Per thread id in BTB and TLB I cache port Large register file No of physical registers = 8 * 32 + registers for renaming Longer access time Carnegie Mellon, 15740 Fall 03

Pipeline superscalar SMT Carnegie Mellon, 15740 Fall 03

Fetch Mechanism (2.8 scheme) Select 2 threads not incurring I cache miss, read 8 instructions from each thread. Choose as many possible from first thread and rest from the second, upto 8. Alternative – 1.8, 2.4, 4.2 Carnegie Mellon, 15740 Fall 03

I Count Which thread to fetch from threads that have least number of instructions in the decode, rename and queue pipeline stages. even distribution, prevents starvation Carnegie Mellon, 15740 Fall 03

Results/Observations Superscalars: approximately give an IPC of about 1-2 SMT: significantly higher than the values reported for superscalar Longer latency for a single thread? Why? not a significant performance effect Carnegie Mellon, 15740 Fall 03

Results/Observations… SMT absorbs additional conflicts: greater ability to hide latency by using multiple issues from multiple threads. SMP MP2 and MP4 hindered by static resource partitioning SMT dynamically partitions resources among threads Carnegie Mellon, 15740 Fall 03

Results/Observations.. Multithreading can increase cache misses/conflicts More memory requirement More stress on branch prediction h/w Impact on program performance is not significant -> SMT + h/w + compiler opts can hide latency Carnegie Mellon, 15740 Fall 03

Future Directions Each processor in an SMP can use SMT Next generation architectures: SMP on chip instead of wider superscalars Is the performance gain adequate with the additional resource cost Processor Cycle Design Time: Cost vs Performance Writing optimizing Compilers to take advantage of SMT. OS support for thread scheduling, thread priority etc Carnegie Mellon, 15740 Fall 03

Q & A ? Carnegie Mellon, 15740 Fall 03

Thank You. Carnegie Mellon, 15740 Fall 03