On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Slides:



Advertisements
Similar presentations
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Advertisements

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Computer Organization and Architecture
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
How Multi-threading can increase on-chip parallelism
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-Core Architectures
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS 352H: Computer Systems Architecture
CS Lecture 20 The Case for a Single-Chip Multiprocessor
Instruction Level Parallelism
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Simultaneous Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hardware Multithreading
Comparison of Two Processors
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Virtual Memory Overcoming main memory size limitation
/ Computer Architecture and Design
Levels of Parallelism within a Single Processor
8 – Simultaneous Multithreading
Main Memory Background
The University of Adelaide, School of Computer Science
Presentation transcript:

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

2 © Alvin R. Lebeck 1999 CPS 221 Administrivia Today simultaneous multithreading, MP on a chip project presentations (10-15 minutes) midterm II, Wed April 29, in class project write-up due Friday May 1 Noon –approximately 8 pages

3 © Alvin R. Lebeck 1999 CPS 221 Review: Software Coherence Protocols Requires Access Control Messaging System –small control messages –large bulk transfer Programmable Processor –Support for Protocol operations Questions Kernel-based vs. User-Level? Integration of processor with other requirements?

4 © Alvin R. Lebeck 1999 CPS 221 Review: Typhoon Fully Integrated (processor, access control, NI) Mem P $ P $ RTLB NI P $ P $ P $

5 © Alvin R. Lebeck 1999 CPS 221 Software Fine-Grain Access Control Low cost, can run on network of workstations Flexibility of Software protocol processing Like SW Dirty Bits, but more general Foreach load/store, check access bits –if access fault invoke fault handler Lookup Options –table lookup (Blizzard-S) –magic cookie (Shasta, Blizzard-COW) Instrumentation Options –compiler –executabe editing

6 © Alvin R. Lebeck 1999 CPS 221 Blizzard-S Supports Tempest Interface Executable Editing (EEL) Fast Table Lookup –mask, shift, add

7 © Alvin R. Lebeck 1999 CPS 221 Shasta Executable Editing (variant of ATOM) Magic Cookie ld r1, r2[300] if r1 == magic_cookie do_out_of_line_check(x); add r3, r1, r4 Incorporates several optimizations –code scheduling –batching checks (refs to same cache lines) –3% overhead on uniprocessor code Multiple coherence granularity Supports Release Consistency

8 © Alvin R. Lebeck 1999 CPS 221 Future Directions Simultaneous Multithreading Single-Chip MP MultiScalar Processors (Wednesday)

9 © Alvin R. Lebeck 1999 CPS 221 Multithreaded Processors Exploit thread-level parallelism to improve performance –Multiple Program Counters Thread –independent programs (multiprogramming) –threads from same program

10 © Alvin R. Lebeck 1999 CPS 221 Deneclor HEP General purpose scientific computer Organized as MP –up to 16 processors –each processor multithreaded –up to 128 memory modules –up to 4 I/O cache modules –Three-input switches and chaotic routing

11 © Alvin R. Lebeck 1999 CPS 221 HEP Processor Organization Multiple contexts (threads) –each has own Program Status Word (PSW) PSWs circulate in control loop –control and data loops pipelined 8 deep –PSW in control can circulate no faster than data in data loop –PSW at queue head fetches and starts execution of next instruction Clock period: 100ns –8 PSWs in control loop => 10MIPS –Each thread gets 1/8 the processor –Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer)

12 © Alvin R. Lebeck 1999 CPS 221 Simultaneous Multithreading Goal: use hardware resources more efficiently –especially for superscalar processors Assume 4-issue superscalar Thread Instruction Horizontal WasteVerticle Waste

13 © Alvin R. Lebeck 1999 CPS 221 Operation of Simultaneous Multithreading Standard multithreading can reduce verticle waste Issue from multiple threads in same cock cycle Eliminate both horizontal and verticle waste Thread Instructions Simultaneous MultithreadingStandard Multithreading

14 © Alvin R. Lebeck 1999 CPS 221 Limitations of SuperScalar Architectures Instruction Fetch –branch prediction –alignment of packet of instructions Dynamic Instruction Issue Need to identify ready instructions Rename Table –No compares –Large number of ports (Operands x Width) Reorder Buffer –n x Q x O x W 1 bit comparators (src and dest) –Quadratic increase in queue size with issue width –PA % of die area to issue queue (56 instruction window)

15 © Alvin R. Lebeck 1999 CPS 221 SuperScalar Limitations (Continued) Instruction Execute Register File –more rename registers –more access ports –complexity quadratic with issue width Bypass logic –complexity quadratic with issue width –wire delays Functional Units –replicate –add ports to data cache (complexity adds to access time)

16 © Alvin R. Lebeck 1999 CPS 221 Why Single Chip MP? Technology Push –Benefits of wide issue are limited –Decentralized microarchitecture: easier to build several simple fast processors than one complex processor Application Pull –Applications exhibit parallelism at different grains –< 10 instructions per cycle (Integer codes) –> 40 instructions per cycle (FP loops)

17 © Alvin R. Lebeck 1999 CPS 221 A 6-Way SuperScalar Processor Integer Unit L2 Cache (256 KB) I-Cache (32 KB) TLB D-Cache (32 KB) External Interface Instruction Fetch Clocking & Pads Instruction Decode & Rename Reorder Buffer, Instruction Queues, and Out-of-Order Logic Floating Point Unit 21 mm

18 © Alvin R. Lebeck 1999 CPS 221 A 4 x 2 Single Chip Multiprocessor L2 Communication Crossbar L2 Cache (256 KB) External Interface Clocking & Pads 21 mm Dcache 1 Dcache 3 Dcache 2 Dcache 4 Icache 1Icache 2 Icache 3Icache 4 Processor #1 Processor #2 Processor #3 Processor #4 21 mm

19 © Alvin R. Lebeck 1999 CPS 221 Performance Comparison

20 © Alvin R. Lebeck 1999 CPS 221 Summary of Performance 4 x 2 MP works well for coarse grain apps –How well would Message Passing Architecture do? –Can SUIF handle pointer intensive codes? For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue