Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morgan Kaufmann Publishers The Processor

Similar presentations


Presentation on theme: "Morgan Kaufmann Publishers The Processor"— Presentation transcript:

1 Morgan Kaufmann Publishers The Processor
11 September, 2018 Chapter 4 The Processor Chapter 4 — The Processor

2 Morgan Kaufmann Publishers
ALU Control 11 September, 2018 Load/Store (LDUR/STUR): ALU computes the memory address by addition R-type instructions: ALU performs one of the four actions (AND, OR, subtract, or add), depending on the value of the 11-bit opcode field in the instruction compare and branch zero (CBZ): ALU just passes the register input value. Small control unit Input: opcode field of the instruction and a 2-bit control field, called ALUOp, with the following values: (00) indicates the operation to be performed should be add for loads and stores, (01) pass input b for CBZ, (10) determined by the operation encoded in the opcode field. Output: 4-bit signal that directly controls the ALU by generating one of the 6 combinations shown below §4.4 A Simple Implementation Scheme ALU control lines Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 pass input b 1100 NOR Chapter 4 — The Processor

3 Morgan Kaufmann Publishers
ALU Control 11 September, 2018 ALU control inputs based on the 2-bit ALUOp control and the 11-bit opcode. ALUOp bits are generated from the main control unit. Multiple levels of decoding - common implementation technique can reduce the size of the main control unit potentially reduce the latency of the control unit opcode ALUOp Operation Opcode field ALU function ALU control LDUR 00 load register XXXXXXXXXXX add 0010 STUR store register CBZ 01 compare and branch on zero pass input b 0111 R-type 10 100000 subtract 100010 0110 AND 100100 0000 ORR 100101 OR 0001 Chapter 4 — The Processor

4 Morgan Kaufmann Publishers
The Main Control Unit 11 September, 2018 Control signals derived from instruction Opcode field: 6 – 11 bits wide, bit positions 31:26 to 31:21 First register operand: bit positions 9:5 (Rn) Other register operand: bit positions 20:16 (Rm), 4:0 (Rt) Another operand: 19-bit offset (CBZ) or 9-bit offset (Load/Store) The destination register for R-type instructions (Rd) and for loads (Rt) is in bit positions 4:0. Chapter 4 — The Processor

5 Datapath with Multiplexors and Control Lines

6 Control Signals

7 Datapath with control unit and control signals
Morgan Kaufmann Publishers 11 September, 2018 Datapath with control unit and control signals Chapter 4 — The Processor

8 Setting Control Signals
The setting of the control lines depends only on the opcode, The table shows whether each control signal should be 0, 1, or don’t care (X) for each of the opcode values

9 Morgan Kaufmann Publishers
11 September, 2018 R-Type Instruction ADD X1,X2,X3 Four steps to execute the instruction The instruction is fetched, and the PC is incremented Two registers, X2 and X3, are read from the register file; also, the main control unit computes the setting of the control lines during this step. The ALU operates on the data read from the register file, using portions of the opcode to generate the ALU function The result from the ALU is written into the destination register (X1) in the register file. Chapter 4 — The Processor

10 Morgan Kaufmann Publishers
11 September, 2018 Load Instruction LDUR X1, [X2, offset] Five steps to execute the instruction An instruction is fetched from the instruction memory, and the PC is incremented. A register (X2) value is read from the register file. The ALU computes the sum of the value read from the register file and the sign-extended 9 bits of the instruction (offset). The sum from the ALU is used as the address for the data memory. The data from the memory unit is written into the register file (X1). Chapter 4 — The Processor

11 Morgan Kaufmann Publishers
11 September, 2018 CBZ Instruction CBZ X1, offset Five steps to execute the instruction An instruction is fetched from the instruction memory, and the PC is incremented. The register, X1 is read from the register file using bits 4:0 of the instruction (Rt). The ALU passes the data value read from the register file. The value of PC is added to the sign-extended, 19 bits of the instruction (offset) are shifted left by two; the result is the branch target address. The Zero status information from the ALU is used to decide which adder result to store in the PC. Chapter 4 — The Processor

12 Control Function for the simple single-cycle implementation
The outputs of the control function are the control lines, and the input is the opcode field

13 Implementing Unconditional Branch
Morgan Kaufmann Publishers 11 September, 2018 Implementing Unconditional Branch 2 address 31:26 25:0 Jump Jump uses word address Update PC with concatenation of Top 4 bits of old PC 26-bit jump address 00 Need an extra control signal decoded from opcode Chapter 4 — The Processor

14 Morgan Kaufmann Publishers
11 September, 2018 Datapath With B Added Implement a branch by storing into the PC sum of the PC and the sign extended and shifted 26-bit offset. An additional OR-gate is used with a control signal to select the branch target PC always. Chapter 4 — The Processor

15 Morgan Kaufmann Publishers
11 September, 2018 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory  register file  ALU  data memory  register file Not feasible to vary period for different instructions Violates design principle Making the common case fast We will improve performance by pipelining Chapter 4 — The Processor

16 Morgan Kaufmann Publishers
Pipelining Analogy 11 September, 2018 Pipelined laundry: overlapping execution Parallelism improves performance Pipelining improves throughput of our laundry system. When many loads of laundry to do, the improvement in throughput decreases the total time to complete the work §4.5 An Overview of Pipelining Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n ≈ 4 = number of stages Chapter 4 — The Processor

17 Morgan Kaufmann Publishers
11 September, 2018 LEGv8 Pipeline Five stages, one step per stage IF: Instruction fetch from memory ID: Instruction decode & register read EX: Execute operation or calculate address MEM: Access memory operand WB: Write result back to register Chapter 4 — The Processor

18 Morgan Kaufmann Publishers
Pipeline Performance 11 September, 2018 Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath The single-cycle design must allow for the slowest instruction—it is LDUR—so the time required for every instruction is 800 ps. Instr Instr fetch Register read ALU op Memory access Register write Total time LDUR 200ps 100 ps 800ps STUR 700ps R-format (ADD, SUB, AND, ORR) 600ps CBZ 500ps Chapter 4 — The Processor

19 Morgan Kaufmann Publishers
Pipeline Performance 11 September, 2018 Single-cycle (Tc= 800ps) All the pipeline stages take a single clock cycle, so the clock cycle must be long enough to accommodate the slowest operation worst-case clock cycle of 200 ps Pipelined (Tc= 200ps) Chapter 4 — The Processor

20 Morgan Kaufmann Publishers
11 September, 2018 Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Pipelining improves performance by increasing instruction throughput, in contrast to decreasing the execution time of an individual instruction. Instruction throughput is the important metric because real programs execute billions of instructions. Chapter 4 — The Processor

21 Pipelining and ISA Design
Morgan Kaufmann Publishers 11 September, 2018 Pipelining and ISA Design LEGv8 ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 15-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 — The Processor


Download ppt "Morgan Kaufmann Publishers The Processor"

Similar presentations


Ads by Google