Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Computer Organization and Architecture

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

How Multi-threading can increase on-chip parallelism

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Hyper-Threading Technology Architecture and Micro-Architecture.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Hyper-Threading Technology Architecture and Microarchitecture

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.

EKT303/4 Superscalar vs Super-pipelined.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Computer Architecture: Multithreading (II)

COMP 740: Computer Architecture and Implementation

Electrical and Computer Engineering

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Simultaneous Multithreading

Multi-core processors

Computer Structure Multi-Threading

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Lecture: SMT, Cache Hierarchies

Computer Architecture: Multithreading (I)

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

Simultaneous Multithreading in Superscalar Processors

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

Hardware Multithreading

Presentation transcript:

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY

AGENDA  INTRODUCTION Motivation Types of Parallesim Vertical and Horizontal Wasted Slot Superscalar Processors  Multithreading  Simultaneous Multithreading The Idea SMT Model Issues: What to Fetch and What to Issue? Caching  Performance Analysis Simulation Results Comparision Drawbacks  Commercial Examples IBM POWER5  Future Tendincies

INTRODUCTION: Motivation  Microprocessor Design Optimization Some Focus Areas: 1. Memory latency Increased processor speeds make memory appear further away Longer stalls possible 2. Branch Processing Mispredict more costly as pipeline depth increases resulting in stalls and wasted power Predication drives increased power and larger chip area 3. Execution Unit Utilization 20-25% execution unit utilization common SMT Adresses these areas!

INTRODUCTION: Motivation  Memory subsystem improvement or increasing system integration is not sufficient for significant performance improvement.  Solution: Increase parallelism in all its available form  Combine the multiple-issue-per-instruction features of modern superscalar processors  With latency-hiding ability of multithreaded architectures

INTRODUCTION: Types of Parallesim  Bit-level Wider processor datapaths (8,16,32,64…)  Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)  Instruction-level Pipelining Superscalar VLIW and EPIC  Task and Application-levels Explicit parallel programming Multiple threads Multiple applications

INTRODUCTION: Vertical Slot & Horizontal Slot  Vertical waste is introduced when the processor issues no instructions in a cycle  Horizontal waste is introduced when not all issue slots can be filled in a cycle.  %61 of the wasted cycles are vertical waste.

INTRODUCTION: Superscalar  Issues multiple instructions in each cycle. Typically 4.  Several functional units of the same type, e.g. ALUs  Dispatcher reads instructions, decides which can run in parallel  Limited by instruction dependencies and long- latency operations  Effects Horizontal & Vertical Waste  Low Utilization even with higher-issue machines; 8 Issue with %20

INTRODUCTION: Superscalar  Many slots in the execution core are unused.

MULTITHREADING  Processor is extended with the concept of thread allowing the scheduler to chose instructions from one thread or another at each clock.  Two types in thread scheduling: coarse- grain multithreading and fine-grain multithreading.  SMT uses both types of Multithreading

MULTITHREADING

 What a processor needs for Multithreading? 1. Processor must be aware of several independent states, one per each thread: Program Counter Register File (and Flags) Memory 2. Either multiple resources in the processor or a fast way to switch across states

MULTITHREADING: Coarse - Grain Multithreading  Swith between threads only on costly stalls  This form of multithreading only hides long latency events.  Easy to implement but has large grains

MULTITHREADING: Coarse-Grain

MULTITHREADING: Fine - Grain Multithreading  Context switch the threads on every clock cycle.  Occupancy of the execution core is now much higher  Hides both long and short latency events  Vertical waste are eliminated but horizontal waste is not. If a thread has little or no operations to execute issue slots will be wasted.

MULTITHREADING: Fine-Grain

Simultaneous Multithreading: Idea  Combine Superscalar and Multithreading such that; 1. Issue multiple instructions per cycle – Supercalar 2. Hardware state for several programs/threads – Multithreading  So; issue multiple instructions from multiple threads in each cycle

Simultaneous Multithreading: Idea

Simultaneous Multithreading: Model  Extend, replicate and redesign some units of superscalar to achive multithreading  Resources replicated State for hardware contexts (registers, PCs) Per thread mechanisms for Pipeline flushing and subroutine returns Per thread identiers for branch target buffer and translation lookaside buffer

Simultaneous Multithreading: Model  Resources redesigned Instruction fetch unit Processor pipeline  Instruction Scheduling Does not require additional hardware Register renaming (same as superscalar)

Simultaneous Multithreading: Model SuperScalar Architecture

Simultaneous Multithreading: Model Block Diagram

Simultaneous Multithreading: Model  Instruction Fetch Unit Takes advantage of inter-thread competition  Partitioning bandwidth  Fetching threads that give maximum local benefit 2.8 fetching  Fetch 1 inst. per logical processor, for 2 threads  Decode 1 thread till branch/end of cache line, then jump to the other ICount feedback  Highest priority to threads with fewest instructions in the decode, renaming, and queue pipeline stages  Small hardware addition to track queue lengths

Simultaneous Multithreading: Model  Register File Each thread has 32 registers Register File: 32 * #threads + rename registers So, larger register file longer access time

Simultaneous Multithreading: Model Pipeline Format  Superscalar  SMT

Simultaneous Multithreading: Model Pipeline Format  To avoid increase in clock cycle time, SMT pipeline extended to allow 2 cycle register reads and writes  2 cycle reads/writes increase branch misprediction penalty

Simultaneous Multithreading: Where to Fetch  Where to Fetch  Static solutions: Round-robin Each cycle 8 instructions from 1 thread Each cycle 4 instructions from 2 threads, 2 from 4,… Each cycle 8 instructions from 2 threads, and forward as many as possible from #1 then when long latency instruction in #1 pick rest from #2  Dynamic solutions: Check execution queues! Favour threads with minimal # of in-flight branches Favour threads with minimal # of outstanding misses Favour threads with minimal # of in-flight instructions Favour threads with instructions far from queue head

Simultaneous Multithreading: What to Issue  Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent ones waiting, etc.) In SMT not so clear: branch-speculation level and optimism (cache-hit speculation) vary across threads  Based on this the selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branches first…  Important result: doesn’t matter too much!

Simultaneous Multithreading: Compiler Optimizations  Should try to minimize cache interference  Latency hiding techniques like speculation should be enhanced  Sharing optimization techniques from multiprocessors changed – data sharing is now good

Simultaneous Multithreading: Caching  Same cache shared among threads  Performance degradation due to cache sharing  Possibility of cache thrashing

PERFORMANCE ANALYSIS  Four model is selected  Basic Machine is 10 FU, 8 Issue 1. Fine-Grain Multithreading 2. SM:Full Simultaneous Issue: Eight threads compete for each of the issue slots each cycle. 3. SM:Single Issue,SM:Dual Issue, SM:Four Issue: Limit the number of instructions each thread can issue e.g: each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. 4. SM:Limited Connection: Each hardware context is directly connected to exactly one of each type of functional unit.

PERFORMANCE ANALYSIS

PERFORMANCE ANALYSIS: H/W COMPLEXITY

COMPARISION  SMT vs. Multiprocessing  Multiprocessing statically assigns functional units to threads  SMT allows threads to expand Using available resources

COMPARISION

DRAWBACKS  Two main drawbacks 1. Single thread perfomance decreases due to the architectural constraints 2. Additional contexts will increase power consumption

Commercial Examples  Compaq Alpha (EV8) 4T SMT Project killed June 2001  Intel Pentium IV (Xeon) 2T SMT Availability in 2002 (already there before, but not enabled) 10-30% gains expected Also called as Hyperthreading  SUN Ultra IV 2-core CMP, 2T SMT  IBM POWER5 Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core: Up to 2 virtual processors per real processor 24% area growth per core for SMT

Commercial Examples: IBM POWER5

 SMT added to Superscalar Micro-architecture  Second Program Counter (PC) added to share I- fetch bandwidth  GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread)  Completion logic replicated to track two threads  Thread bit added to most address/tag buses

Commercial Examples: IBM POWER5

 Includes; 1. Thread Priority Mechanism: Power Efficiency, 8 levels 2. Dynamic Thread Switching Used if no task ready for second thread to run Allocates all machine resources to one thread Initiated by SW

Commercial Examples: IBM POWER5  Dormant thread wakes up on: 1. External interrupt 2. Decrementer interrupt 3. Special instruction from active thread

Future Tendincies  Simultaneous & Redundantly Threaded Processors(SRT) Increase reliability with fault detection and correction. Run multiple copies of the same programme simultaneously  Software Pre-Execution in SMT: In some cases data adress is extremely hard to predict. Prefetching is useless Use an idle thread of SMT for pre-execution. A complete software solution  Speculation More techniques on speculation E.g Speculative Data-Driven Multithreading, Threaded Multiple Path Execution, Simultaneous Subordinate Microthreading and Thread Level Speculation

REFERANCES  "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95.  “Simultaneous Multithreading: Present Developments and Future Directions” by Miquel Peric, June 2003  “Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor” by IBM, Aug 2004  “Simultaneous Multithreading: A Platform for Next- Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.

Q&A THANKS!