CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 17 Parallel Processing.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Multi-Core Architectures
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Hyper-Threading Technology Architecture and Micro-Architecture.
Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.
– Mehmet SEVİK – Yasin İNAĞ
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Hyper-Threading Technology Architecture and Microarchitecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP 740: Computer Architecture and Implementation
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Computer Structure Multi-Threading
INTEL HYPER THREADING TECHNOLOGY
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture 26: Multiprocessors
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 27: Multiprocessors
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture 21: Synchronization & Consistency
Lecture 22: Multithreading
Presentation transcript:

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic

 AM LaCASALaCASA 2 Outline Trends in microarchitecture Exploiting thread-level parallelism Exploiting TLP within a processor Resource sharing Performance implications Design challenges Intel’s HT technology

 AM LaCASALaCASA 3 Trends in microarchitecture Higher clock speeds To achieve high clock frequency make pipeline deeper (superpipelining) Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles ILP: Instruction Level Parallelism Extract parallelism in a single program Superscalar processors have multiple execution units working in parallel Challenge to find enough instructions that can be executed concurrently Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order

 AM LaCASALaCASA 4 Trends in microarchitecture Cache hierarchies Processor-memory speed gap Use caches to reduce memory latency Multiple levels of caches: smaller and faster closer to the processor core Thread-level Parallelism Multiple programs execute concurrently Web-servers have an abundance of software threads Users: surfing the web, listening to music, encoding/decoding video streams, etc.

 AM LaCASALaCASA 5 Exploiting thread-level parallelism CMP – Chip Multiprocessing Multiple processors, each with a full set of architectural resources, reside on the same die Processors may share an on-chip cache or each can have its own cache Examples: HP Mako, IBM Power4 Challenges: Power, Die area (cost) Time-slice multithreading Processor switches between software threads after a predefined time slice Can minimize the effects of long lasting events Still, some execution slots are wasted

 AM LaCASALaCASA 6 Multithreading Within a Processor Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? Why is this desireable? inexpensive – one CPU, no external interconnects no remote or coherence misses (more capacity misses) Why does this make sense? most processors can’t find enough work – peak IPC is 6, average IPC is 1.5! threads can share resources  we can increase threads without a corresponding linear increase in area

 AM LaCASALaCASA 7 What Resources are Shared? Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs) For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources Each additional thread costs a PC, rename table, and ROB – cheap!

 AM LaCASALaCASA 8 Approaches to Multithreading Within a Processor Fine-grained multithreading: switches threads on every clock cycle Pro: hide latency of from both short and long stalls Con: Slows down execution of the individual threads ready to go Course-grained multithreading: switches threads only on costly stalls (e.g., L2 stalls) Pros: no switching each clock cycle, no slow down for ready-to-go threads Con: limitations in hiding shorter stalls Simultaneous Multithreading: exploits TLP at the same time it exploits ILP

 AM LaCASALaCASA 9 How Resources are Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot Fine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle Cycles Superscalar Coarse-grained Multithreading

 AM LaCASALaCASA 10 R1  R1 + R2 R3  R1 + R4 R5  R1 + R3 R2  R1 + R2 R5  R1 + R2 R3  R5 + R3 P73  P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 P73  P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 FU Instr Fetch Instr Rename Issue Queue Register File Thread-1 Thread-2 Resource Sharing

 AM LaCASALaCASA 11 Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 Alpha and Intel Pentium 4 are examples of SMT

 AM LaCASALaCASA 12 Design Challenges How many threads? Many to find enough parallelism However, mixing many threads will compromise execution of individual threads Processor front-end (instruction fetch) Fetch as far as possible in a single thread (to maximize thread performance) However, this limits the number of instructions available for scheduling from other threads Larger register files (multiple contexts) Minimize clock cycle time Cache conflicts

 AM LaCASALaCASA 13 Pentium 4: Hyperthreading architecture One physical processor appears as multiple logical processors HT implementation on NetBurst microarchitecture has 2 logical processors Architectural State Processor execution resources Architectural state: - general purpose registers - control registers - APIC: advanced programmable interrupt controller

 AM LaCASALaCASA 14 Pentium 4: Hyperthreading architecture Main processor resources are shared caches, branch predictors, execution units, buses, control logic Duplicated resources register alias tables (map the architectural registers to physical rename registers) next instruction pointer and associated control logic return stack pointer instruction streaming buffer and trace cache fill buffers

 AM LaCASALaCASA 15 Pentium 4: Die Size and Complexity

 AM LaCASALaCASA 16 Pentium 4: Resources sharing schemes Partition – dedicate equal resources to each logical processors Good when expect high utilization and somewhat unpredicatable Threshold – flexible resource sharing with a limit on maximum resource usage Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods Full sharing – flexible with no limits Good for large structures, with variable working-set sizes

 AM LaCASALaCASA 17 Pentium 4: Shared vs. partitioned queues shared partitioned

 AM LaCASALaCASA 18 NetBurst Pipeline partitioned threshold

 AM LaCASALaCASA 19 Pentium 4: Shared vs. partitioned resources Partitioned E.g., major pipeline queues Threshold Puts a threshold on the number of resource entries a logical processor can have E.g., scheduler Fully shared resources E.g., caches Modest interference Benefit if we have shared code and/or data

 AM LaCASALaCASA 20 Pentium 4: Scheduler occupancy

 AM LaCASALaCASA 21 Pentium 4: Shared vs. partitioned cache

 AM LaCASALaCASA 22 Pentium 4: Performance Improvements

 AM LaCASALaCASA 23 Multi-Programmed Speedup