Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Lecture 4: CPU Performance
Computer Organization and Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Instruction-Level Parallelism (ILP)
C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 1 Advanced Computers Architecture Lecture 4 By Rohit Khokher Department.
© Kavita Bala, Computer Science, Cornell University Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipelining See: P&H Chapter 4.5.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Instruction Level Parallelism (ILP) Colin Stevens.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!
Basics and Architectures
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
COMP 740: Computer Architecture and Implementation
Catalog of useful (structural) modules and architectures
William Stallings Computer Organization and Architecture 8th Edition
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
buses, crossing switch, multistage network.
Jonathan Walpole Computer Science Portland State University
/ Computer Architecture and Design
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Sequential Execution Semantics
Instruction Level Parallelism and Superscalar Processors
Mattan Erez The University of Texas at Austin
Vishwani D. Agrawal James J. Danaher Professor
Mattan Erez The University of Texas at Austin
Lecture 11: Memory Data Flow Techniques
Instruction Level Parallelism and Superscalar Processors
Computer Structure S.Abinash 11/29/ _02.
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Coe818 Advanced Computer Architecture
buses, crossing switch, multistage network.
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
EE 4xx: Computer Architecture and Performance Programming
Vishwani D. Agrawal James J. Danaher Professor
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
Overview Prof. Eric Rotenberg
Mattan Erez The University of Texas at Austin
Morgan Kaufmann Publishers The Processor
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Lecture 1 An Overview of High-Performance Computer Architecture
The University of Adelaide, School of Computer Science
Guest Lecturer: Justin Hsia
Pipelined datapath and control
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lectures 8 & 9 – Parallelism in Hardware Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer memory hierarchy EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 What are the parallel resources? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r4 3:add r6, r5, r3 1 F D I R E W C 2 3 EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r4 3:add r6, r5, r3 1 F D I R E W C 2 3 EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 Communication and synchronization mechanisms? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:ld r5, r4 3:add r6, r5, r3 4:add r7, r1, r3 1 F D I R E W C 2 L 3 4 EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Simplified view of a OOO pipelined processor dispatch (issue) issue (dispatch) reg. access execute write-back commit fetch decode rename sequencer scheduler 1:add r4, r1, r2 2:ld r5, r4 3:add r6, r5, r3 4:add r7, r1, r3 1 F D I d R E W C 2 L 3 4 Communication and synchronization mechanisms? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Pipelining Summary Pipelining is using parallelism to hide latency Do useful work while waiting for other work to finish Multiple parallel components, not multiple instances of same component Examples: Execution pipeline Memory pipelines Issue multiple requests to memory without waiting for previous requests to complete Software pipelines Overlap different software blocks to hide latency: computation/communication EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Parallel Execution Concurrency Communication Synchronization what are the multiple resources? Communication and storage Synchronization Implicit? Explicit? what is being shared ? What is being partitioned ? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Resources in a parallel processor/system Execution ALUs Cores/processors Control Sequencers Instructions OOO schedulers State Registers Memories Networks EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Communication and synchronization Clock – explicit compiler order Explicit signals (e.g., dependences) Implicit signals (e.g., flush/stall) More for pipelining than multiple ALUs Communication Bypass networks Registers Memory Explicit (over some network) EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Organizations for ILP (for multiple ALUs) EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Superscalar (ILP for multiple ALUs) sequencer scheduler Synchronization Explicit signals (dependences) Communication Bypass, registers, mem Shared Sequencer, OOO, registers, memories, net, ALUs Partitioned Instructions memory hierarchy How many ALUs? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

SMT/TLS (ILP for multiple ALUs) EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

SMT/TLS (ILP for multiple ALUs) sequencer scheduler sequencer Synchronization Explicit signals (dependences) Communication Bypass, registers, mem Shared OOO, registers, memories, net, ALUs Partitioned Sequencer, Instructions, arch. registers memory hierarchy Why is this ILP? How many threads? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

VLIW (ILP for multiple ALUs) EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

VLIW (ILP for multiple ALUs) sequencer Synchronization Clock+compiler Communication Registers, mem, bypass Shared Sequencer, OOO, registers, memories, net Partitioned Instructions, ALUs memory hierarchy How many ALUs? EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Explicit Dataflow (ILP for multiple ALUs) EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Explicit Dataflow (ILP for multiple ALUs) sequencer scheduler scheduler scheduler scheduler Synchronization Explicit signals Communication Registers+explicit Shared Sequencer, memories, net Partitioned Instructions, OOO, ALUs memory hierarchy EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

DLP for multiple ALUs From HW – this is SIMD EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

SIMD (DLP for multiple ALUs) sequencer Synchronization Clock+compiler Communication Explicit Shared Sequencer, instructions Partitioned Registers, memories, ALUs Sometimes: memories, net Local Mem Local Mem Local Mem Local Mem memory hierarchy EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Vectors (DLP for multiple ALUs) sequencer Local Mem Local Mem Local Mem Local Mem memory hierarchy Vectors: memory addresses are part of single-instruction and not part of multiple-data EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

TLP for multiple ALUs EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

MIMD – shared memory (TLP for multiple ALUs) sequencer sequencer sequencer sequencer scheduler scheduler scheduler scheduler Synchronization Explicit, memory Communication Memory Shared Memories, net Partitioned Sequencer, instructions, OOO, ALUs, registers, some nets memory hierarchy EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

MIMD – distributed memory sequencer sequencer sequencer sequencer scheduler scheduler scheduler scheduler Synchronization Explicit Communication Shared Net Partitioned Sequencer, instructions, OOO, ALUs, registers, some nets, memories memory hierarchy memory hierarchy memory hierarchy memory hierarchy EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Summary of communication and synchronization Style Synchronization Communication Superscalar explicit signals (RS) registers+bypass VLIW clock+compiler registers (bypass?) Dataflow explicit signals registers+explicit SIMD explicit MIMD memory+explicit EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Summary of sharing in ILP HW Style Seq Inst OOO Regs Mem ALUs Net Superscalar S P SMT/TLS VLIW N/A Dataflow B EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez

Summary of sharing in DLP and TLP Style Seq Inst OOO Regs Mem ALUs Net Vector S N/A P s B SIMD p MIMD S/P EE382N: Parallelilsm and Locality, Fall 2011 -- Lectures 8 & 9 (c) Mattan Erez