We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byCaden Clayburn
Modified over 2 years ago
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
2 Chapter 3 Review Baseline: simple MIPS 5-stage pipeline IF, ID, EX, MEM, WB How to exploit Instruction-Level Parallelism (ILP) to improve the performance? Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Copyright © 2012, Elsevier Inc. All rights reserved.
3 Hazards & Stalls Structural Hazards Cause: resource contention Solution: add more resources & better scheduling Control Hazards Cause: branch instructions, change of program flow Solution: loop unrolling, branch prediction, hardware speculation Data Hazards Cause: Dependences True data dependence: property of program: RAW Name dependence: reuse of registers, WAR & WAW Solution: loop unrolling, dynamic scheduling, register renaming, hardware speculation Copyright © 2012, Elsevier Inc. All rights reserved.
4 Ideal CPI Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue
5 Loop Unrolling (pp.161) Copyright © 2012, Elsevier Inc. All rights reserved. Finds that the loop iterations were independent Uses different registers to avoid unnecessary constraints (name dependence) Eliminate extra test and branch instructions (control dependence) Interchanges the load and store instructions if possible (utilize time of stalls) Schedule the code: avoid/mitigate stalls while maintaining true data dependence
6 Branch Predication 1-bit or 2-bit predictor: local predicator Uses the past results of the branch itself as an indicator Correlated predictor: global predicator Uses the pass results of correlated branches as an indicator (m,n) predictor: Two-level predicator The number of bits in an (m, n) predictor is 2 m * n * number of prediction entries Tournament predictor: an adaptive one Combining local & global predictor together Select the right predictor for a particular branch Copyright © 2012, Elsevier Inc. All rights reserved.
7 Dynamic Scheduling Hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behaviors. Simple pipeline: In-order issue, in-order execution, in-order completion Dynamic scheduling: In-order issue, out-of-order execution, out-of-order execution Out-of-order execution results in WAR & WAW hazards Out-of-order completion results in unexpected exception behaviors Copyright © 2012, Elsevier Inc. All rights reserved.
8 Dynamic Scheduling Addressing WAW & WAR hazards by out-of- order execution Tomasulos Approach: Register Renaming (Reservation station, Common data bus) Issue, Execute, Write Results Basic structure of Tomasulos algorithm: PP 173 Copyright © 2012, Elsevier Inc. All rights reserved.
9 Dynamic Scheduling Addressing unexpected exception behaviors by out-of-order completion Hardware speculation: Reorder buffer(pass the results, guaranteeing in-order completion) Issue, Execute, Write Results Commit Basic structure of hardware speculation: PP 185 Now: pipeline with dynamic scheduling In-order issue, out-of-order execution, in-order completion Copyright © 2012, Elsevier Inc. All rights reserved.
10 Decreasing the CPI Multiple Issue Statically scheduled superscalar processors VLIW(very long instruction word) processors Dynamically scheduled superscalar processors See a summary table on pp. 194 Copyright © 2012, Elsevier Inc. All rights reserved.
11 Chapter 4 Review SISD (single instruction, single data) architecture Examples in Chapter 3 SIMD (single instruction, multiple data) architecture: exploiting data-level parallelism Vector architecture Multimedia SIMD instruction set extensions Graphics processing units (GPUs) Data Independence Copyright © 2012, Elsevier Inc. All rights reserved.
12 Vector Architecture Primary components: VMIPS Vector registers Vector functional units Vector load/store unit A set of scalar registers Basic structure of a vector architecture (pp. 265) Copyright © 2012, Elsevier Inc. All rights reserved.
13 Vector Architecture Execution time Length of the operand vectors; Structural hazards among the operations; Data dependencies. Convoy The set of vector instructions that could potentially execute together NO structural hazards Chaining: address data dependency in a convoy Chime The unit of time taken to execute one convoy Copyright © 2012, Elsevier Inc. All rights reserved.
14 Vector Architecture Execution time a vector sequence: m convoys, a vector length of n m * n clock cycles Copyright © 2012, Elsevier Inc. All rights reserved.
15 Vector Architecture Executes a single vector faster than one element per clock cycle Multiple Lanes Handles programs where the vector lengths are not the same as the length of the vector register Vector-Length Register: MTC1 VLR, R1 Strip mining: if the vector length is longer than MVL Handles IF statements in vector loops Vector Mask Registers: CVM, POP Supplying bandwidth for vector load/store units Memory Banks: allow multiple independent data accesses Copyright © 2012, Elsevier Inc. All rights reserved.
16 Vector Architecture Handles multidimensional arrays Stride LVWS V1, (R1, R2) SVWS (R1, R2),V1 Handles Sparse Matrices Gather-Scatter LVI V1, (R1, V2) SVI (R1,V2), V1 Programming Vector Architectures: Program structures affecting performance most of them are spent on improving memory accesses, and most of them are modifications to the vector instruction set Copyright © 2012, Elsevier Inc. All rights reserved.
17 SIMD Instruction Set Execution Observation: many media applications operate on narrower data types than the 32-bit processors were optimize for 8 bits represent each of three primary colors 8 bits for transparency Limitation Fix the number of data operands in the opcode Does not offer the more sophisticated addressing modes of vector architectures: stride & gather-scatter Does not offer the mask registers Roofline Visual Performance Model Copyright © 2012, Elsevier Inc. All rights reserved.
18 Copyright © 2012, Elsevier Inc. All rights reserved. SIMD Implementations Implementations: Intel MMX (1996) Eight 8-bit integer ops or four 16-bit integer ops Streaming SIMD Extensions (SSE) (1999) Eight 16-bit integer ops Four 32-bit integer/fp ops or two 64-bit integer/fp ops Advanced Vector Extensions (2010) Four 64-bit integer/fp ops Operands must be consecutive and aligned memory locations Generally designed to accelerate carefully written libraries rather than for compilers Advantages over vector architecture: Cost little to add to the standard ALU and easy to implement Require little extra state easy for context-switch Require little extra memory bandwidth No virtual memory problem of cross-page access and page-fault SIMD Instruction Set Extensions for Multimedia
19 Graphics Processing Units Challenges: Not simply getting good performance on the GPU Coordinating the scheduling of computation on the system processor and the GPU, and the transfer of data between system memory and GPU memory Heterogeneous architecture & computing CPU + GPU Individual memories for CPU & GPU Like a distributed system on a node CUDA or OpenCL languages Programming model is Single Instruction Multiple Thread (SIMT) Copyright © 2012, Elsevier Inc. All rights reserved.
20 Copyright © 2012, Elsevier Inc. All rights reserved. Threads, Blocks, Grids A thread is associated with each data element Threads are organized into blocks Blocks are organized into a grid GPU hardware handles thread management, not applications or OS Graphical Processing Units
21 Copyright © 2012, Elsevier Inc. All rights reserved. NVIDIA GPU Architecture Similarities to vector machines: Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor Graphical Processing Units
22 Copyright © 2012, Elsevier Inc. All rights reserved. Terminology Threads of SIMD instructions Each has its own PC Thread scheduler uses scoreboard to dispatch No data dependencies between threads! Keeps track of up to 48 threads of SIMD instructions Hides memory latency Thread block scheduler schedules blocks to SIMD processors Within each SIMD processor: 32 SIMD lanes Wide and shallow compared to vector processors Graphical Processing Units
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Chapter 2 Instruction-Level Parallelism and Its Exploitation.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 2004 Morgan Kaufmann Publishers Chapter Six. 2 2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
+ William Stallings Computer Organization and Architecture 9 th Edition.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2007 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1 Chapter 04 Authors: John Hennessy & David Patterson.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Heterogeneous Computing using openCL lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.
EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
COMP4611 Tutorial 6 Instruction Level Parallelism October 22 nd /23 rd
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
Instruction Set Issues MIPS easy –Instructions are only committed at MEM WB transition Other architectures are more difficult –Instructions may update.
Chapter 12 Pipelining Strategies Performance Hazards.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards.
© 2017 SlidePlayer.com Inc. All rights reserved.