Levels of Parallelism within a Single Processor

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Sam Sandbote CSE 8383 Advanced Computer Architecture Multithreaded Microprocessors and Multiprocessor SoCs Sam Sandbote CSE 8383 Advanced Computer Architecture.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Processor Level Parallelism 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
Computer Architecture Principles Dr. Mike Frank
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
Yingmin Li Ting Yan Qi Zhao
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
How to improve (decrease) CPI
Instruction Level Parallelism (ILP)
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
Hardware Multithreading
8 – Simultaneous Multithreading
How to improve (decrease) CPI
Lecture 22: Multithreading
The University of Adelaide, School of Computer Science
Presentation transcript:

Levels of Parallelism within a Single Processor ILP: smallest grain of parallelism Resources are not that well utilized (far from ideal CPI) Stalls on operations with long latencies (cache miss, division) Multiprogramming: Several applications (or large sections of applications) running concurrently O.S. directed activity Change of application requires a context-switch (e.g., on a page fault) Multithreading Main goal: tolerate latency of long operations without paying the price of a full context-switch Multithr. CSE 471 Autumn 02

Multithreading The processor supports several instructions streams running “concurrently” Each instruction stream has its own context (process state) Registers PC, status register, special control registers etc. The multiple streams are multiplexed by the hardware on a set of common functional units Multithr. CSE 471 Autumn 02

Fine Grain Multithreading Conceptually, at every cycle a new instruction stream dispatches an instruction to the functional units If enough instruction streams are present, long latencies can be hidden For example, if 32 streams can dispatch an instruction, latencies of 32 cycles could be tolerated For a single application, requires highly sophisticated compiler technology Discover many threads in a single application Basic idea behind Tera (now Cray) MTA Burton Smith third such machine (he started in late 70’s) Multithr. CSE 471 Autumn 02

Tera’s MTA Each processor can support 16 multiprogrammed applications 128 streams At every clock cycle, processor selects a ready stream and issues an instruction from that stream An instruction is a “LIW”: Memory, arithmetic, control Several instructions of the same stream can be in flight simultaneously (ILP) Instructions of different streams can be in flight simultaneously (multithreading or Thread level parallelism TLP) Multithr. CSE 471 Autumn 02

Tera’s MTA (c’ed) Since several streams belong to the same application, synchronization is very important (will be discussed later in the quarter) Needs instructions – and compiler support – to allocate, activate, and deallocate streams Compiler support: loop level parallelism and software pipelining Hardware support: dynamic allocation of streams (depending on mix of applications etc.) Multithr. CSE 471 Autumn 02

Coarse Grain Multithreading Switch threads (contexts) only at certain events Change thread context takes a few (10-20?) cycles Used when long memory latency operations, e.g., access to a remote memory in a shared-memory multiprocessor (100’s of cycles) Of course, context-switches occur when there are exceptions such as page faults Many fewer contexts needed than in fine-grain multithreading Multithr. CSE 471 Autumn 02

Simultaneous Multithreading (SMT) Combines the advantages of ILP and fine grain multithreading Hoz. waste still present but not as much. Vert. waste does not necessarily disappear completely as this figure implies Vertical waste of the order of 60% of overall waste Hoz. waste ILP SMT Multithr. CSE 471 Autumn 02

SMT (a UW invention) Needs one context per thread But fewer threads needed than in fine grain multithreading Can issue simultaneously from distinct threads in the same cycle Can share resources For example: physical registers for renaming, caches, BPT etc. Future generation Alpha was to be based on SMT Intel Hyperthreading is SMT with … 2 threads Multithr. CSE 471 Autumn 02

SMT (c’ed) Compared with an ILP superscalar of same issue width Requires 5% more real estate More complex to design (thread scheduling, identifying threads that raise exceptions etc.) Drawback (common to all wide-issue processors): centralized design Benefits Increases throughput of applications running concurrently (but can also increase latency of a given application) Dynamic scheduling No partitioning of many resources (in contrast with chip multiprocessors) Multithr. CSE 471 Autumn 02

Speculative Multithreading A current area of research Several approaches Try and identify “critical path” of instructions within a loop and have them run speculatively (in advance) of the main thread in further iterations Span speculative threads on events such as: Procedure call: continue with the caller and callee concurrently In loops: generate a speculative thread from the loop exit In hard to predict branches: have threads execute both sides Main advantage Warm up caches and branch predictors. Multithr. CSE 471 Autumn 02

Trace Caches Filling up the instruction buffer of wide issue processors is a challenge (even more so in SMT) Instead of fetching from I-cache, fetch from a trace cache The trace cache is a complementary (or a replacement for) instruction cache that stores sequences of instructions organized in dynamic program execution order Implemented in Intel Pentium 4 and some Sun Sparc architecture. One way to do dynamic optimization of programs and to find critical path of instructions for speculative multithreading Multithr. CSE 471 Autumn 02