On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Slides:

Advertisements

Similar presentations

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.

Architecture Basics ECE 454 Computer Systems Programming

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 Superscalar Pipelines 11/24/08. 2 Scalar Pipelines A single k stage pipeline capable of executing at most one instruction per clock cycle. All instructions,

The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Hyper-Threading Technology Architecture and Microarchitecture

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS Lecture 20 The Case for a Single-Chip Multiprocessor

Instruction Level Parallelism

/ Computer Architecture and Design

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Simultaneous Multithreading

Lynn Choi Dept. Of Computer and Electronics Engineering

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

Comparison of Two Processors

Simultaneous Multithreading in Superscalar Processors

* From AMD 1996 Publication #18522 Revision E

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

8 – Simultaneous Multithreading

The University of Adelaide, School of Computer Science

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Presentation transcript:

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252

2 © Alvin R. Lebeck 2006 CPS 220 Administrivia Projects Presentations Dec 5 & 7 Documents ~10 pages –Good writing is important –Progress is important Final is Dec 11 (7pm to 10pm)

3 © Alvin R. Lebeck 2006 CPS 220 Multithreaded Processors Exploit thread-level parallelism to improve performance –Multiple Program Counters Thread –independent programs (multiprogramming) –threads from same program

4 © Alvin R. Lebeck 2006 CPS 220 Deneclor HEP General purpose scientific computer Organized as MP –up to 16 processors –each processor multithreaded –up to 128 memory modules –up to 4 I/O cache modules –Three-input switches and chaotic routing

5 © Alvin R. Lebeck 2006 CPS 220 HEP Processor Organization Multiple contexts (threads) –each has own Program Status Word (PSW) PSWs circulate in control loop –control and data loops pipelined 8 deep –PSW in control can circulate no faster than data in data loop –PSW at queue head fetches and starts execution of next instruction Clock period: 100ns –8 PSWs in control loop => 10MIPS –Each thread gets 1/8 the processor –Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer)

6 © Alvin R. Lebeck 2006 CPS 220 Simultaneous Multithreading Goal: use hardware resources more efficiently –especially for superscalar processors Assume 4-issue superscalar Alpha Thread Instruction Horizontal WasteVerticle Waste

7 © Alvin R. Lebeck 2006 CPS 220 Operation of Simultaneous Multithreading Standard multithreading can reduce verticle waste Issue from multiple threads in same cock cycle Eliminate both horizontal and verticle waste Larger Register Files Thread Instructions Simultaneous MultithreadingStandard Multithreading

8 © Alvin R. Lebeck 2006 CPS 220 Limitations of SuperScalar Architectures Instruction Fetch –branch prediction –alignment of packet of instructions Dynamic Instruction Issue Need to identify ready instructions Rename Table –No compares –Large number of ports (Operands x Width) Issue Queue Size –n x Q x O x W 1 bit comparators (src and dest) –Quadratic increase in queue size with issue width –PA % of die area to issue queue (56 instruction window)

9 © Alvin R. Lebeck 2006 CPS 220 SuperScalar Limitations (Continued) Instruction Execute Register File –more rename registers –more access ports –complexity quadratic with issue width Bypass logic –complexity quadratic with issue width –wire delays Functional Units –replicate –add ports to data cache (complexity adds to access time)

10 © Alvin R. Lebeck 2006 CPS 220 Why Single Chip MP? Technology Push –Benefits of wide issue are limited –Decentralized microarchitecture: easier to build several simple fast processors than one complex processor Application Pull –Applications exhibit parallelism at different grains –< 10 instructions per cycle (Integer codes) –> 40 instructions per cycle (FP loops)

11 © Alvin R. Lebeck 2006 CPS 220 A 6-Way SuperScalar Processor Integer Unit L2 Cache (256 KB) I-Cache (32 KB) TLB D-Cache (32 KB) External Interface Instruction Fetch Clocking & Pads Instruction Decode & Rename Reorder Buffer, Instruction Queues, and Out-of-Order Logic Floating Point Unit 21 mm

12 © Alvin R. Lebeck 2006 CPS 220 A 4 x 2 Single Chip Multiprocessor L2 Communication Crossbar L2 Cache (256 KB) External Interface Clocking & Pads 21 mm Dcache 1 Dcache 3 Dcache 2 Dcache 4 Icache 1Icache 2 Icache 3Icache 4 Processor #1 Processor #2 Processor #3 Processor #4 21 mm

13 © Alvin R. Lebeck 2006 CPS 220 Performance Comparison

14 © Alvin R. Lebeck 2006 CPS 220 Summary of Performance 4 x 2 MP works well for coarse grain apps –How well would Message Passing Architecture do? –Can SUIF handle pointer intensive codes? For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue