Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pipelining and Parallelism Mark Staveley

Similar presentations


Presentation on theme: "Pipelining and Parallelism Mark Staveley"— Presentation transcript:

1 Pipelining and Parallelism Mark Staveley Mark.Staveley@mun.ca

2 Outline °Quantitative Analysis of Program Execution °CISC vs RISC °Pipelining °Superscalar and VLIW Architectures °Parallel Architecture °Interconnection Networks References: Murdocca/Heuring (ch. 10), Heuring/Jordan (ch. 3)

3 Quantitative Analysis of Program Execution °Prior to late 1970s, computer architects just focused on the complexity of instructions and addressing modes to improve the computer performance.  Complex Instruction Set Computer (CISC) °Knuth showed that most of the statements used in a typical Fortran program are simple instructions. °Later research by Hennessy and Patterson showed that most complex instructions and addressing modes are not used in a typical program. They coined the use of program analysis and benchmarking to evaluate the impact of architecture upon performance. Arithmetic and other powerful instructions account for only 7%.

4 Quantitative Analysis of Program Execution (Cont’d) °More quantitative metrics: All these metrics show that there is no or little payoff in increasing the complexity of the instructions. Moreover, analysis showed that compilers usually do not take advantage of complex instructions and addressing modes. These observations, brought an evolution from CISC to Reduced Instruction Set Computer (RISC). °The focus is to make the frequent case fast and simple.  make assignments fast.  use only LOAD and STORE to access memory.

5 Quantitative Analysis of Program Execution (Cont’d) °Load/Store machine: is a typical RISC architecture. Only these two instructions can communicate with memory. Others should use registers. Access to memory can be overlapped as there is less side effect. A large number of registers is needed.  A simple instruction set results in a simpler CPU, which frees up space on microprocessor to be used for other purposes, such as registers, and caches. °Quantitative performance analysis: execution time is the most important performance factor. Speedup, S = ; percent speedup =  100 T wo TwTw T wo - T w TwTw

6 Quantitative Analysis of Program Execution (Cont’d) °Example: adding a 1MB cache reduces the execution time of a benchmark from 12 seconds to 8 seconds.  S = 1.5, or %50 speedup. °Suppose t is the machine’s clock period, CPI is the average number of clock cycles per instruction, and IC is the instruction count. Then, the Total execution time = T = IC  CPI  t Thus, (%) S =  100 Example: moving from a CPU having CPI of 5 to a CPU with a CPI of 3.5, and with the clock period increased from 100 ns to 120 ns, then the speedup is equal to: (%) S =  100 = 19% IC wo  CPI wo  t wo – IC w  CPI w  t w IC w  CPI w  t w 5  100 – 3.5  120 3.5  120 Reduce execution time by reducing # of instructions, average cycle per instruction, or clock cycle time.

7 CISC vs RISC °CISC: Complex Instruction Set Computer Many complex instructions and addressing modes. Suitable for when the memory access times were very high and number of registers in CPU low. Some instructions take many steps to execute. Not suitable for pipelining because it has different instructions with different complexity. °RISC: Reduced Instruction Set Computer All instructions are of fixed length. This simplifies fetch and decoding. Few, simple instructions and addressing modes. Instructions can be issued into the pipeline at a rate of one per clock cycle. Pipelining allows different instructions to use different parts of the execution unit on each clock cycle and an instruction can be completed in every clock cycle. A Load-Store architecture. All operands must be in registers. There should be a large number of registers. A program may have more number of instructions than CISC but runs faster as the instructions are simpler.

8 CISC vs RISC (Cont’d) Use hardwired technique. Avoid microcode. Let the compiler do the complex things. °Other RISC benefits: Allows prefetching instructions to hide the latency of fetching instructions and data. Pipelining: begin execution of an instruction before the previous instructions have completed. Superscalar operation: issuing more than one instruction simultaneously (Instruction-level parallelism: ILP) Delayed loads, stores, and branches: Operands may not be available when an instruction attempts to access them. Register windows: ability to switch to a different set of CPU registers with a single command. Alleviates procedure call/return overhead.

9 Pipelining °Pipelining takes an assembly line approach to instruction execution. °One instruction enters the pipeline at each clock cycle. °At the end of each stage, with the clock, output is latched and is passed to the next stage. °As soon as the pipeline is full, the instructions are completed in every clock cycle. °Different pipelines may have different number of stages.

10 Pipelining (Cont’d) °What if the instruction is a branch? As soon as the pipeline is full and a branch is taken, then the pipeline has to be flushed by filling it with no- operations (nops). They are also called pipeline bubbles. This also allows to delay until the branch is known to be taken or not taken. °When a LOAD or STORE occurs, we may have to expand the execute phase from one clock cycle to two clock cycles. This is also known as delayed load. Delayed branch by inserting NOPs. The other approach is to do branch prediction or speculatively execute the instructions after the branch. Bubbles are also inserted when an interrupt occurs.

11 Pipelining (Cont’d) °Analysis of pipeline efficiency °Example: A CPU has a 5 stage pipeline. With a branch taken, 4 cycles should be flushed. Thus, branch penalty is 4. The probability that an instruction is a branch instruction is P b =.25. The probability that the branch is taken is 0.5. Compute the average number of cycles needed to execute an instruction, and the execution efficiency. CPI No-Branch = 1. When there are branches, then CPI AVG = (1 – P b ) (CPI No-Branch ) + P b [P t (1 + b) + (1 – P t )(CPI No-Branch )] = 1 + bP b P t Thus, CPI AVG = (1 -.25)(1) +.25[.5(1 + 4) + (1 -.5)(1)] = 1.5 cycles. Execution efficiency = (CPI No-Branch )/(CPI AVG ) = 1/1.5 = 67%  The processor runs at 67% of its potential speed as a result of branches. Much better than five cycles per instruction that might be needed without pipelining.

12 Superscalar and VLIW Architectures °Superscalar architecture: might have one or more separate integer units (IUs), floating point units (FPUs), and branch processing units (BPUs). Thus, with separate functional units, several instructions are executed at the same time. Instructions should be scheduled into various execution units and might be executed out of order. Out-of-order execution: means that instructions need to be examined prior to dispatching them to an execution unit, not only to determine which unit should execute them but also to determine whether executing them out of order would result in an incorrect program because of dependencies between the instructions.  Out-of-order issue, in-order issue, retiring instructions °Very Long Instruction Word (VLIW): where multiple operations are packed into a single instruction word. The compiler is responsible to organize multiple instructions into the instruction word.

13 Parallel Architecture °Parallel processing: where a number of processors work collectively, in parallel, on a common problem. MIMD: Shared-memory multiprocessor systems, and message- passing multiprocessor systems. °Flynn Taxonomy: SISD SIMD MIMD MISD

14 Parallel Architecture (Cont’d) °Speedup = °Amdahl’s law says if there are even a small number of sequential operations in a given program, then the speedup can be significantly limited. Amdahl’s law = = S °Example: f = 10% sequential, then speedup can be no greater than 10.  S = 1/(0.1 + 0.9/10) = 5.3 for p = 10 processors  S = 1/(0.1 + 0.9/  ) = 10 for p =  processors °Efficiency = °However, note that we can always increase the class size of an application. T sequential T parallel 1 f + 1 - f P speedup P

15 Interconnection Networks


Download ppt "Pipelining and Parallelism Mark Staveley"

Similar presentations


Ads by Google