Pipelining and Parallelism Mark Staveley

Slides:



Advertisements
Similar presentations
CH14 Instruction Level Parallelism and Superscalar Processors
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
10-1 Chapter 10 - Trends in Computer Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Parallel.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Chapter 12 Pipelining Strategies Performance Hazards.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
11/11/05ELEC CISC (Complex Instruction Set Computer) Veeraraghavan Ramamurthy ELEC 6200 Computer Architecture and Design Fall 2005.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
10-1 Chapter 10 - Trends in Computer Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.
Processor Organization and Architecture
Advanced Computer Architectures
Architecture Basics ECE 454 Computer Systems Programming
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Parallelism Processing more than one instruction at a time. Pipelining
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Basics and Architectures
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
6-1 Chapter 6 - Languages and the Machine Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
EKT303/4 Superscalar vs Super-pipelined.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
10-1 Chapter 10 - Trends in Computer Architecture Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
PipeliningPipelining Computer Architecture (Fall 2006)
Advanced Architectures
Visit for more Learning Resources
William Stallings Computer Organization and Architecture 8th Edition
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Advanced Topic: Alternative Architectures Chapter 9 Objectives
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Computer Architecture
Parallel and Multiprocessor Architectures
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Superscalar Pipelines Part 2
Instruction Level Parallelism and Superscalar Processors
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Chapter 4 Multiprocessors
Advanced Topic: Alternative Architectures Chapter 9 Objectives
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

Pipelining and Parallelism Mark Staveley

Outline °Quantitative Analysis of Program Execution °CISC vs RISC °Pipelining °Superscalar and VLIW Architectures °Parallel Architecture °Interconnection Networks References: Murdocca/Heuring (ch. 10), Heuring/Jordan (ch. 3)

Quantitative Analysis of Program Execution °Prior to late 1970s, computer architects just focused on the complexity of instructions and addressing modes to improve the computer performance.  Complex Instruction Set Computer (CISC) °Knuth showed that most of the statements used in a typical Fortran program are simple instructions. °Later research by Hennessy and Patterson showed that most complex instructions and addressing modes are not used in a typical program. They coined the use of program analysis and benchmarking to evaluate the impact of architecture upon performance. Arithmetic and other powerful instructions account for only 7%.

Quantitative Analysis of Program Execution (Cont’d) °More quantitative metrics: All these metrics show that there is no or little payoff in increasing the complexity of the instructions. Moreover, analysis showed that compilers usually do not take advantage of complex instructions and addressing modes. These observations, brought an evolution from CISC to Reduced Instruction Set Computer (RISC). °The focus is to make the frequent case fast and simple.  make assignments fast.  use only LOAD and STORE to access memory.

Quantitative Analysis of Program Execution (Cont’d) °Load/Store machine: is a typical RISC architecture. Only these two instructions can communicate with memory. Others should use registers. Access to memory can be overlapped as there is less side effect. A large number of registers is needed.  A simple instruction set results in a simpler CPU, which frees up space on microprocessor to be used for other purposes, such as registers, and caches. °Quantitative performance analysis: execution time is the most important performance factor. Speedup, S = ; percent speedup =  100 T wo TwTw T wo - T w TwTw

Quantitative Analysis of Program Execution (Cont’d) °Example: adding a 1MB cache reduces the execution time of a benchmark from 12 seconds to 8 seconds.  S = 1.5, or %50 speedup. °Suppose t is the machine’s clock period, CPI is the average number of clock cycles per instruction, and IC is the instruction count. Then, the Total execution time = T = IC  CPI  t Thus, (%) S =  100 Example: moving from a CPU having CPI of 5 to a CPU with a CPI of 3.5, and with the clock period increased from 100 ns to 120 ns, then the speedup is equal to: (%) S =  100 = 19% IC wo  CPI wo  t wo – IC w  CPI w  t w IC w  CPI w  t w 5  100 – 3.5   120 Reduce execution time by reducing # of instructions, average cycle per instruction, or clock cycle time.

CISC vs RISC °CISC: Complex Instruction Set Computer Many complex instructions and addressing modes. Suitable for when the memory access times were very high and number of registers in CPU low. Some instructions take many steps to execute. Not suitable for pipelining because it has different instructions with different complexity. °RISC: Reduced Instruction Set Computer All instructions are of fixed length. This simplifies fetch and decoding. Few, simple instructions and addressing modes. Instructions can be issued into the pipeline at a rate of one per clock cycle. Pipelining allows different instructions to use different parts of the execution unit on each clock cycle and an instruction can be completed in every clock cycle. A Load-Store architecture. All operands must be in registers. There should be a large number of registers. A program may have more number of instructions than CISC but runs faster as the instructions are simpler.

CISC vs RISC (Cont’d) Use hardwired technique. Avoid microcode. Let the compiler do the complex things. °Other RISC benefits: Allows prefetching instructions to hide the latency of fetching instructions and data. Pipelining: begin execution of an instruction before the previous instructions have completed. Superscalar operation: issuing more than one instruction simultaneously (Instruction-level parallelism: ILP) Delayed loads, stores, and branches: Operands may not be available when an instruction attempts to access them. Register windows: ability to switch to a different set of CPU registers with a single command. Alleviates procedure call/return overhead.

Pipelining °Pipelining takes an assembly line approach to instruction execution. °One instruction enters the pipeline at each clock cycle. °At the end of each stage, with the clock, output is latched and is passed to the next stage. °As soon as the pipeline is full, the instructions are completed in every clock cycle. °Different pipelines may have different number of stages.

Pipelining (Cont’d) °What if the instruction is a branch? As soon as the pipeline is full and a branch is taken, then the pipeline has to be flushed by filling it with no- operations (nops). They are also called pipeline bubbles. This also allows to delay until the branch is known to be taken or not taken. °When a LOAD or STORE occurs, we may have to expand the execute phase from one clock cycle to two clock cycles. This is also known as delayed load. Delayed branch by inserting NOPs. The other approach is to do branch prediction or speculatively execute the instructions after the branch. Bubbles are also inserted when an interrupt occurs.

Pipelining (Cont’d) °Analysis of pipeline efficiency °Example: A CPU has a 5 stage pipeline. With a branch taken, 4 cycles should be flushed. Thus, branch penalty is 4. The probability that an instruction is a branch instruction is P b =.25. The probability that the branch is taken is 0.5. Compute the average number of cycles needed to execute an instruction, and the execution efficiency. CPI No-Branch = 1. When there are branches, then CPI AVG = (1 – P b ) (CPI No-Branch ) + P b [P t (1 + b) + (1 – P t )(CPI No-Branch )] = 1 + bP b P t Thus, CPI AVG = (1 -.25)(1) +.25[.5(1 + 4) + (1 -.5)(1)] = 1.5 cycles. Execution efficiency = (CPI No-Branch )/(CPI AVG ) = 1/1.5 = 67%  The processor runs at 67% of its potential speed as a result of branches. Much better than five cycles per instruction that might be needed without pipelining.

Superscalar and VLIW Architectures °Superscalar architecture: might have one or more separate integer units (IUs), floating point units (FPUs), and branch processing units (BPUs). Thus, with separate functional units, several instructions are executed at the same time. Instructions should be scheduled into various execution units and might be executed out of order. Out-of-order execution: means that instructions need to be examined prior to dispatching them to an execution unit, not only to determine which unit should execute them but also to determine whether executing them out of order would result in an incorrect program because of dependencies between the instructions.  Out-of-order issue, in-order issue, retiring instructions °Very Long Instruction Word (VLIW): where multiple operations are packed into a single instruction word. The compiler is responsible to organize multiple instructions into the instruction word.

Parallel Architecture °Parallel processing: where a number of processors work collectively, in parallel, on a common problem. MIMD: Shared-memory multiprocessor systems, and message- passing multiprocessor systems. °Flynn Taxonomy: SISD SIMD MIMD MISD

Parallel Architecture (Cont’d) °Speedup = °Amdahl’s law says if there are even a small number of sequential operations in a given program, then the speedup can be significantly limited. Amdahl’s law = = S °Example: f = 10% sequential, then speedup can be no greater than 10.  S = 1/( /10) = 5.3 for p = 10 processors  S = 1/( /  ) = 10 for p =  processors °Efficiency = °However, note that we can always increase the class size of an application. T sequential T parallel 1 f f P speedup P

Interconnection Networks