SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Instruction Level Parallelism (ILP) Colin Stevens.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
How Multi-threading can increase on-chip parallelism
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Chapter One Introduction to Pipelined Processors.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
– Mehmet SEVİK – Yasin İNAĞ
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
EKT303/4 Superscalar vs Super-pipelined.
E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP 740: Computer Architecture and Implementation
Parallel Processing - introduction
Simultaneous Multithreading
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Resource Replication 6 Integer Units 4 FP units 8 Sets of architectural registers Renaming registers (Int/FP) HW Context (PC, Return Stack.
Presentation transcript:

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong

Contemporary forms of parallelism Instruction-level parallelism(ILP)  Wide-issue Superscalar processors (SS)  4 or more instruction per cycle  Executing a single program or thread  Attempts to find multiple instructions to issue each cycle. Thread-level parallelism(TLP)  Fine-grained multithreaded superscalars(FGMS)  Contain hardware state for several threads  Executing multiple threads  On any given cycle a processor executes instructions from one of the threads  Multiprocessor(MP)  Performance improved by adding more CPUs

Simultaneous Multithreading Key idea Issue multiple instructions from multiple threads each cycle Features  Fully exploit thread-level parallelism and instruction- level parallelism.  Better Performance  Mix of independent programs  Programs that are parallelizable  Single threaded program

Superscalar(SS) Multithreading(FGMT) SMT Issue slots

Multiprocessor vs. SMT Multiprocessor(MP2) SMT

SMT Architecture(1) Base Processor: like out-of-order superscalar processor.[MIPS R10000] Changes: With N simultaneous running threads, need N PC and N subroutine return stacks and more than N*32 physical registers for register renaming in total.

SMT Architecture(2) Need large register files, longer register access time, pipeline stages are added.[Register reads and writes each take 2 stages.] Share the cache hierarchy and branch prediction hardware. Each cycle: select up to 2 threads and each fetch up to 4 instructions.(2.4 scheme) FetchDecodeRenamingQueueReg Read ExecReg WriteCommit

Effectively Using Parallelism on a SMT Processor Parallel workload threadsSSMP2MP4FGMTSMT Instruction Throughput executing a parallel workload

Effects of Thread Interference In Shared Structures Interthread Cache Interference Increased Memory Requirements Interference in Branch Prediction Hardware

Interthread Cache Interference Because the share the cache, so more threads, lower hit-rate. Two reasons why this is not a significant problem: 1. The L1 Cache miss can almost be entirely covered by the 4-way set associative L2 cache. 2. Out-of-order execution, write buffering and the use of multiple threads allow SMT to hide the small increases of additional memory latency. 0.1% speed up without interthread cache miss.

Increased Memory Requirements More threads are used, more memory references per cycle. Bank conflicts in L1 cache account for the most part of the memory accesses. It is ignorable: 1. For longer cache line: gains due to better spatial locality outweighted the costs of L1 bank contention % speedup if no interthread contentions.

Interference in Branch Prediction Hardware Since all threads share the prediction hardware, it will experience interthread interference. This effect is ignorable since: 1. the speedup outweighted the additional latencies 2. From 1 to 8 threads, branch and jump misprediction rates range from 2.0%-2.8% (branch) 0.0%-0.1% (jump)

Discussion ???