Presentation is loading. Please wait.

Presentation is loading. Please wait.

LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

Similar presentations

Presentation on theme: "LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology."— Presentation transcript:

1 LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology

2 2 Data path

3 3 The processor : Data Path and Control  Two types of functional units:  elements that operate on data values (combinational)  elements that contain state (state elements)

4 4 Single Cycle Implementation  Typical execution:  read contents of some state elements,  send values through some combinational logic  write results to one or more state elements  Using a clock signal for synchronization  Edge triggered methodology

5 5 A portion of the datapath used for fetching instructions

6 6 The datapath for R-type instructions

7 7 The datapath for load and store insructions

8 8 The datapath for branch instructions

9 9 Complete Data Path

10 10 Control  Selecting the operations to perform (ALU, read/write, etc.)  Controlling the flow of data (multiplexor inputs)  Information comes from the 32 bits of the instruction  Example: lW $1, 100($2)  Value of control signals is dependent upon:  what instruction is being executed  which step is being performed

11 11 Data Path with Control

12 12 Single Cycle Implementation  Calculate cycle time assuming negligible delays except:  memory (2ns), ALU and adders (2ns), register file access (1ns)

13 13 Multicycle approach  Single Cycle Problems:  what if we had a more complicated instruction like floating point?  wasteful of area  One Solution:  use a “smaller” cycle time  have different instructions take different numbers of cycles  a “multicycle” datapath:

14 14 Multicycle Approach  Break up the instructions into steps, each step takes a cycle  balance the amount of work to be done  restrict each cycle to use only one major functional unit  At the end of a cycle  store values for use in later cycles (easiest thing to do)  introduce additional “internal” registers

15 15 Multicycle Approach Implementation

16 16 Five Execution Steps  Instruction Fetch  Instruction Decode and Register Fetch  Execution, Memory Address Computation, or Branch Completion  Memory Access or R-type instruction completion  Write-back step  INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

17 17 Five Execution Steps Step name Action for R-type instructions Action for Memory- reference Instructions Action for branches Action for jumps Instruction fetchIR = MEM[PC] PC = PC + 4 Instruction decode/ register fetch A = Reg[IR[25-21]] B = Reg[IR[20-16]] ALUOut = PC + (sign extend (IR[15-0])<<2) Execution, address computation, branch/jump completion ALUOut = A op BALUOut = A+sign extend(IR[15-0]) IF(A==B) Then PC=ALUOut PC=PC[31- 28]||(IR[25- 0]<<2) Memory access or R-type completion Reg[IR[15-11]] = ALUOut Load:MDR =Mem[ALUOut] or Store:Mem[ALUOut] = B Memory read completionLoad: Reg[IR[20-16]] = MDR

18 18 pipelining

19 19 Pipelining is Natural!  Pipelining provides a method for executing multiple instructions at the same time.  Laundry Example  Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold  Washer takes 30 minutes  Dryer takes 40 minutes  “Folder” takes 20 minutes ABCD

20 20 Sequential Laundry  Sequential laundry takes 6 hours for 4 loads  If they learned pipelining, how long would laundry take? ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time

21 21 Pipelined Laundry: Start work ASAP  Pipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20

22 22 Pipelining Lessons  Pipelining doesn’t help latency of single task, it helps throughput of entire workload  Pipeline rate limited by slowest pipeline stage  Multiple tasks operating simultaneously using different resources  Potential speedup = Number pipe stages  Unbalanced lengths of pipe stages reduces speedup  Time to “fill” pipeline and time to “drain” it reduces speedup  Stall for Dependences ABCD 6 PM 789 TaskOrderTaskOrder Time 3040 20

23 23 Basic DataPath What do we need to add to actually split the datapath into stages?

24 24 Pipeline DataPath

25 25 The Five Stages of the Load Instruction  Ifetch: Instruction Fetch  Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: Calculate the memory address  Mem: Read the data from the Data Memory  Wr: Write the data back to the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IfetchReg/DecExecMemWrLoad

26 26 Pipelined Execution  On a processor multiple instructions are in various stages at the same time.  Assume each instruction takes five cycles IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB Program Flow Time

27 27 Single Cycle, Multiple Cycle, vs. Pipeline

28 28 Graphically Representing Pipelines Can help with answering questions like:  How many cycles does it take to execute this code?  What is the ALU doing during cycle 4?  Are two instructions trying to use the same resource at the same time?

29 29 Why Pipeline? Because the resources are there!

30 30 Pipelining MIPS Execution

31 31 Observations  Use separate instruction and data memories  To reduce memory conflict  The memory system must deliver 5 times the bandwidth, if the pipelined processor has a clock cycle that is equal to the unpipelined processor.  Register file is used in the two stages  One for reading in ID and one for writing in WB  2 reads 1 write in every clock cycle  Maintaining PC  IF stage: preparing the PC for the next instruction  ID stage: to compute the branch target address  Problem of a branch does not change the PC until the ID stage

32 32 Why Pipeline?  Suppose  100 instructions are executed  The single cycle machine has a cycle time of 45 ns  The multicycle and pipeline machines have cycle times of 10 ns  The multicycle machine has a CPI of 4.6  Single Cycle Machine  45 ns/cycle x 1 CPI x 100 inst = 4500 ns  Multicycle Machine  10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns  Ideal pipelined machine  10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns  Ideal pipelined vs. single cycle speedup  4500 ns / 1040 ns = 4.33  What has not yet been considered?

33 33 Compare Performance Compare: Single-cycle, multicycle and pipelined control using SPECint2000  Single-cycle: memory access = 200ps, ALU = 100ps, register file read and write = 50ps  200+50+100+200+50=600ps  Multicycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU  CPI = 4.12, The clock cycle = 200ps (longest functional unit)  Pipelined  1 clock cycle when there is no load-use dependence  2 when there is, average 1.5 per load  Stores and ALU take 1 clock cycle  Branches - 1 when predicted correctly and 2 when not, average 1.25  Jump – 2  1.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17  Average instruction time: single-cycle = 600ps, multicycle = 4.12x200=824, pipelined 1.17x200 = 234ps  Memory access 200ps is the bottleneck. How to improve?

34 34 Can pipelining get us into trouble?  Yes: Pipeline Hazards  structural hazards: attempt to use the same resource two different ways at the same time E.g., two instructions try to read the same memory at the same time  data hazards: attempt to use item before it is ready instruction depends on result of prior instruction still in the pipeline add r1, r2, r3 sub r4, r2, r1  control hazards: attempt to make a decision before condition is evaulated branch instructions beq r1, loop add r1, r2, r3  Can always resolve hazards by waiting  pipeline control must detect the hazard  take action (or delay action) to resolve hazards

35 35 Mem Single Memory is a Structural Hazard I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Reg MemReg ALU Mem Reg MemReg Detection is easy in this case! (right half highlight means read, left half write)

36 36 Structural Hazards limit performance  Example: if 1.3 memory accesses per instruction and only one memory access per cycle then  average CPI = 1.3  otherwise resource is more than 100% utilized  Solution 1: Use separate instruction and data memories  Solution 2: Allow memory to read and write more than one word per cycle  Solution 3: Stall

37 37  Stall: wait until decision is clear  Its possible to move up decision to 2nd stage by adding hardware to check registers as being read  Impact: 2 clock cycles per branch instruction => slow Control Hazard Solutions I n s t r. O r d e r Time (clock cycles) Add Beq Load ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Reg MemReg Mem

38 38  Predict: guess one direction then back up if wrong  Predict not taken  Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right ­ 50% of time)  More dynamic scheme: history of 1 branch (­ 90%) Control Hazard Solutions I n s t r. O r d e r Time (clock cycles) Add Beq Load ALU Mem Reg MemReg ALU Mem Reg MemReg Mem ALU Reg MemReg

39 39  Redefine branch behavior (takes place after next instruction) “delayed branch”  Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time)  Launch more instructions per clock cycle=>less useful Control Hazard Solutions I n s t r. O r d e r Time (clock cycles) Add Beq Misc ALU Mem Reg MemReg ALU Mem Reg MemReg Mem ALU Reg MemReg Load Mem ALU Reg MemReg

40 40 Data Hazard

41 41 Data Hazard---Forwarding  Use temporary results, don’t wait for them to be written  register file forwarding to handle read/write to same register  ALU forwarding

42 42 Can’t always forward  Load word can still cause a hazard:  an instruction tries to read a register following a load instruction that writes to the same register.  Thus, we need a hazard detection unit to “stall” the load instruction

43 43 Stalling  We can stall the pipeline by keeping an instruction in the same stage

44 44 Designing Instruction Set for Pipelining  MIPS instructions are the same length.  This makes much easer to fetch instructions in the first stage and to decode them in the second stage.  In IA-32 instructions vary from 1-17 bytes, pipelining is considerably more challenging.  All recent IA-32 architectures translate instructions into micro_operations, which are pipelined  MIPS has only few instruction formats, with the source registers fields in the same place  This symmetry – the second stage can read the register file at the same time that the hardware is decoding the type of instruction  Without this symmetry we would need to split stage 2

45 45  Memory operands only appears in loads or stores in MIPS  We can use the execute stage to calculate the memory and then access memory in the following stage  If we operate on operand in memory, stages 3 and 4 would expand to an address stage, memory stage and then execute stage  Operands must be aligned in memory  We can have instructions requiring two data memory access, the transfer can be done in a single pipeline stage Designing Instruction Set for Pipelining

46 46 Summary: Pipelining  What makes it easy  all instructions are the same length  just a few instruction formats  memory operands appear only in loads and stores  What makes it hard?  structural hazards: suppose we had only one memory  control hazards: need to worry about branch instructions  data hazards: an instruction depends on a previous instruction  We’ll talk about modern processors and what really makes it hard:  exception handling  trying to improve performance with out-of-order execution, etc.

47 47 Summary  Pipelining is a fundamental concept  multiple steps using distinct resources  Utilize capabilities of the datapath by pipelined instruction processing  start next instruction while working on the current one  limited by length of longest stage (plus fill/flush)  detect and resolve hazards

Download ppt "LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology."

Similar presentations

Ads by Google