Levels of Parallelism within a Single Processor

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Instruction Level Parallelism (ILP) Colin Stevens.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Sam Sandbote CSE 8383 Advanced Computer Architecture Multithreaded Microprocessors and Multiprocessor SoCs Sam Sandbote CSE 8383 Advanced Computer Architecture.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Processor Level Parallelism 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
Computer Architecture Principles Dr. Mike Frank
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
Instruction Level Parallelism and Superscalar Processors
Yingmin Li Ting Yan Qi Zhao
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
How to improve (decrease) CPI
Instruction Level Parallelism (ILP)
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
Hardware Multithreading
How to improve (decrease) CPI
Lecture 22: Multithreading
The University of Adelaide, School of Computer Science
Presentation transcript:

Levels of Parallelism within a Single Processor ILP: smallest grain of parallelism Resources are not that well utilized (far from ideal CPI) Stalls on operations with long latencies (division, cache miss) Multiprogramming: Several applications (or large sections of applications) running concurrently O.S. directed activity Change of application requires a context-switch (e.g., on a page fault) Multithreading Main goal: tolerate latency of long operations without paying the price of a full context-switch 4/6/2019 Multithr. CSE 471 Autumn 01

Multithreading The processor supports several instructions streams running “concurrently” Each instruction stream has its own context (process state) Registers PC, status register, special control registers etc. The multiple streams are multiplexed by the hardware on a set of common functional units 4/6/2019 Multithr. CSE 471 Autumn 01

Fine Grain Multithreading Conceptually, at every cycle a new instruction stream dispatches an instruction to the functional units If enough instruction streams are present, long latencies can be hidden For example, if 32 streams can dispatch an instruction, latencies of 32 cycles could be tolerated For a single application, requires highly sophisticated compiler technology Discover many threads in a single application Basic idea behind Tera (now Cray) MTA Burton Smith third such machine (he started in late 70’s) 4/6/2019 Multithr. CSE 471 Autumn 01

Tera’s MTA Each processor can execute 16 applications in parallel (multiprogramming) 128 streams At every clock cycle, processor selects a ready stream and issues an instruction from that stream An instruction is a “LIW”: Memory, arithmetic, control Several instructions of the same stream can be in flight simultaneously (ILP) Instructions of different streams can be in flight simultaneously (multithreading) 4/6/2019 Multithr. CSE 471 Autumn 01

Tera’s MTA (c’ed) Since several streams belong to the same application, synchronization is very important (will be discussed later in the quarter) Needs instructions – and compiler support – to allocate, activate, and deallocate streams Compiler support: loop level parallelism and software pipelining Hardware support: dynamic allocation of streams (depending on mix of applications etc.) 4/6/2019 Multithr. CSE 471 Autumn 01

Coarse Grain Multithreading Switch threads (contexts) only at certain events Change thread context takes a few (10-20?) cycles Used when long memory latency operations, e.g., access to a remote memory in a shared-memory multiprocessor (100’s of cycles) Of course, context-switches occur when there are exceptions such as page faults Many fewer contexts needed than in fine-grain multithreading 4/6/2019 Multithr. CSE 471 Autumn 01

Simultaneous Multithreading (SMT) Combines the advantages of ILP and fine grain multithreading Hoz. waste still present but not as much. Vert. waste does not necessarily all disappear as this figure implies Vertical waste of the order of 60% of overall waste Hoz. waste ILP SMT 4/6/2019 Multithr. CSE 471 Autumn 01

SMT (a UW invention) Needs one context per thread But fewer threads needed than in fine grain multithreading Can issue simultaneously from distinct threads in the same cycle Can share resources For example: physical registers for renaming, caches, BPT etc. Future generation Alpha was to be based on SMT Intel Hyperthreading is SMT with … 2 threads 4/6/2019 Multithr. CSE 471 Autumn 01

SMT (c’ed) Compared with an ILP superscalar of same issue width Requires 5% more real estate More complex to design (thread scheduling, identifying threads that raise exceptions etc.) Drawback (common to all wide-issue processors): centralized design Benefits Increases throughput of applications running concurrently Dynamic scheduling No partitioning of many resources (in contrast with chip multiprocessors) 4/6/2019 Multithr. CSE 471 Autumn 01

Speculative Multithreading A current area of research Several approaches Try and identify “critical path” of instructions within a loop and have them run speculatively (in advance) of the main thread in further iterations Span speculative threads on events such as: Procedure call: continue with the caller and callee concurrently In loops: generate a speculative thread from the loop exit In hard to predict branches: have threads execute both sides Main advantage Warm up caches and branch predictors. 4/6/2019 Multithr. CSE 471 Autumn 01

Trace Caches Filling up the instruction buffer of wide issue processors is a challenge (even more so in SMT) Instead of fetching from I-cache, fetch from a trace cache The trace cache is a complementary instruction cache that stores sequences of instructions organized in dynamic program execution order Implemented in forthcoming Intel Willamette and some Sun Sparc architecture. One way to do dynamic optimization of programs and to find critical path of instructions for speculative multithreading 4/6/2019 Multithr. CSE 471 Autumn 01