CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
4/4/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency II Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Address Translation and Protection Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 252 Graduate Computer Architecture Lecture 13: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 6, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading I Steve Ko Computer Sciences and Engineering University at Buffalo.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
March 16, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer.
April 13, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
April 8, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
April 15, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
CPE 731 Advanced Computer Architecture Thread Level Parallelism Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
© Krste Asanovic, 2015CS252, Spring 2015, Lecture 13 CS252 Graduate Computer Architecture Fall 2015 Lecture 13: Multithreading Krste Asanovic
Hyper-Threading Technology Architecture and Microarchitecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures.
CENG331 Computer Organization Multithreading
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Ch3. Limits on Instruction-Level Parallelism 1. ILP Limits 2. SMT (Simultaneous Multithreading) ECE562/468 Advanced Computer Architecture Prof. Honggang.
COSC3330 Computer Architecture Lecture 21. TLP and Multi-core
Symmetric Multiprocessors: Synchronization and Sequential Consistency
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
Krste Asanovic Electrical Engineering and Computer Sciences
CPE 731 Advanced Computer Architecture Thread Level Parallelism
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lec 10 – Example Architectures - Simultaneous Multithreading
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Levels of Parallelism within a Single Processor
Krste Asanovic Electrical Engineering and Computer Sciences
Symmetric Multiprocessors: Synchronization and Sequential Consistency
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 14 – Multithreading Krste Asanovic Speaker: David Biancolin.
Presentation transcript:

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo

CSE 490/590, Spring Last time… Multithreading executes instructions from different threads Coarse-grained multithreading switches threads on cache misses Most of the OoO superscalar units are idle.

CSE 490/590, Spring Superscalar Machine Efficiency Issue width Time Completely idle cycle (vertical waste) Instruction issue Partially filled cycle, i.e., IPC < 4 (horizontal waste)

CSE 490/590, Spring Vertical Multithreading What is the effect of cycle-by-cycle interleaving? Issue width Time Second thread interleaved cycle-by-cycle Instruction issue Partially filled cycle, i.e., IPC < 4 (horizontal waste)

CSE 490/590, Spring Chip Multiprocessing (CMP) What is the effect of splitting into multiple processors? Issue width Time

CSE 490/590, Spring Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995] Interleave multiple threads to multiple issue slots with no restrictions Issue width Time

CSE 490/590, Spring O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996] Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads OOO instruction window already has most of the circuitry required to schedule from multiple threads Any single thread can utilize whole machine

CSE 490/590, Spring IBM Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

CSE 490/590, Spring Power 4 Power 5 2 fetch (PC), 2 initial decodes 2 commits (architected register sets)

CSE 490/590, Spring Power 5 data flow... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

CSE 490/590, Spring Changes in Power 5 to support SMT Increased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs MB) and L3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

CSE 490/590, Spring CSE 490/590 Administrivia Quiz 2 (next Friday 4/8): After midterm until next Monday

CSE 490/590, Spring Pentium-4 Hyperthreading (2002) First commercial SMT design (2-way SMT) –Hyperthreading == SMT Logical processors share nearly all resources of the physical processor –Caches, execution units, branch predictors Die area overhead of hyperthreading ~ 5% When one logical processor is stalled, the other can make progress –No logical processor can use all entries in queues when two threads are active Processor running only one active software thread runs at approximately same speed with or without hyperthreading Hyperthreading dropped on OoO P6 based followons to Pentium- 4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in Intel Atom (in-order x86 core) has two-way vertical multithreading

CSE 490/590, Spring Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate –Pentium 4 is dual threaded SMT –SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (26 2 runs) speed-ups from 0.90 to 1.58; average was 1.20 Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate Power 5 running 2 copies of each app speedup between 0.89 and 1.41 –Most gained some –Fl.Pt. apps had most cache conflicts and least gains

CSE 490/590, Spring SMT adaptation to parallelism type For regions with high thread level parallelism (TLP) entire machine width is shared by all threads Issue width Time Issue width Time For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP)

CSE 490/590, Spring Icount Choosing Policy Why does this enhance throughput? Fetch from thread with the least instructions in flight.

CSE 490/590, Spring Summary: Multithreaded Categories Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

CSE 490/590, Spring Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, X

CSE 490/590, Spring Parallel Processing: Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance  )  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’09Intel/’09IBM/’09Sun/’09 Processors/chip Threads/Processor 1248 Threads/chip

CSE 490/590, Spring symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) Symmetric Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O controller I/O bus Networks Processor

CSE 490/590, Spring Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion: Ensure that only one process uses a resource at a given time producer consumer Shared Resource P1 P2

CSE 490/590, Spring A Producer-Consumer Example The program is written assuming instructions are executed in order. Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(R) Producer Consumer tailhead R tail R head R Problems?

CSE 490/590, Spring A Producer-Consumer Example continued Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(R) Can the tail pointer get updated before the item x is stored? Programmer assumes that if 3 happens after 2, then 4 happens after 1. Problem sequences are: 2, 3, 4, 1 4, 1, 2,

CSE 490/590, Spring Sequential Consistency A Memory Model “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs M PPPPPP

CSE 490/590, Spring Acknowledgements These slides heavily contain material developed and copyright by –Krste Asanovic (MIT/UCB) –David Patterson (UCB) And also by: –Arvind (MIT) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) MIT material derived from course UCB material derived from course CS252