1 The Processor: Datapath and Control We will design a microprocessor that includes a subset of the MIPS instruction set: –Memory access: load/store word.
Published byModified over 4 years ago
Presentation on theme: "1 The Processor: Datapath and Control We will design a microprocessor that includes a subset of the MIPS instruction set: –Memory access: load/store word."— Presentation transcript:
1 The Processor: Datapath and Control We will design a microprocessor that includes a subset of the MIPS instruction set: –Memory access: load/store word ( lw, sw ) –AL instructions: add, sub, and, or, and slt. –Branch instructions: beq and jump ( j ). The subset doesn't include all the integer nor any fp instructions but the principle is the same. For every instruction the first two steps are identical: –Fetch an instruction from where the PC points to in memory. –Decode the instruction and read the registers or memory contents specified.
2 Abstract View of the DataPath The data path contains 2 types of logic elements: –Combinational: Elements that operate on data values. Their outputs depend on their inputs. The ALU is an combinnational element. –State: Elements with internal storage. Their state is defined by the values they contain (memory and registers).
3 Clocking Methodology A state element has at least two inputs and one output. The inputs are the data value to be written into the element and the clock signal which determines when the value will be written. The output is the data value stored in the element. Thus a state element can be read from at any time but written depending on the clock. A clocking methodology defines when signals can be read and written. This is crucial (חיוני) to the correct design of a computer. We will assume an edge-triggered clocking methodology. Any values stored in the machine are updated only on a clock edge.
4 Edge-Triggered Clocking Because only state elements can store values, any collection of combinational logic must have its inputs coming from a set of state elements and its outputs written to set of state elements. The time necessary for the signals to reach element 2 defines the length of the clock cycle. An edge-triggered methodology allows us to read the contents of an register, send the value through some combinational logic and write that register in the same clock cycle. We assume that state elements have implicit clock signals.
5 Fetching an Instruction A memory unit will hold the instructions that are to be executed. The address of the next instruction is in the PC. We need an ALU that performs only addition in order to calculate the next instruction to fetch. Thick arrows symbolize 32-bit buses unless specified differently. Thin arrows specify 1-bit lines, colored lines specify control lines.
6 The Register File The R-type instructions (also called the arithmetic-logical instructions) read the contents of 2 registers, perform an ALU op., and write the result back into a third register. The 32 registers are stored in the register file. The register file has 3 5-bit inputs to specify the registers, 2 32-bit outputs for the data read, 1 32-bit input for the data written and 1 control signal to decide if data should be written in. In addition we will need an ALU to perform the operations.
7 Data Memory The 2 elements needed to implement load and store instructions are data memory and a unit that sign-extends the 16-bit constant in an I-type instruction. In addition we use the existing ALU to compute the address to access. The data memory has 2 32-bit inputs, the address and the write data, and 1 32-input the read data. In addition it has 2 control lines: MemWrite and MemRead.
8 Branch Equal The beq instruction has 3 operands two registers that are compared for equality and a 16-bit offset used to compute the branch address relative to the PC. To implement this instruction we must add the sign-extend offset to the PC. There are 2 important details: 1. The base for the address calculation is the address after the current instruction's address. But since we compute PC+4 when fetching we already have this address 2. The offset is in words not bytes so we have to shift left the offset by 2.
9 Combining ALU and Memory Instructions The ALU datapath (slide 6) and the Memory datapath (slide 7) are similar. The differences are: –The second input to the ALU is a register (R-type) or the sign-extended offset (I-type). –The value stored into the destination register comes from the ALU (R-type) or from memory (I-type). Using 2 multiplexors (Mux) we can combine both datapaths.
10 The Complete Datapath This simple processor can compute ALU instructions, access memory or compute the next instruction's address in a single cycle.
11 ALU Control The ALU has 3 control inputs, we use 5 of the 8 possible input combinations: 000AND 001OR 010add 110subtract 111slt The ALU control uses as its inputs the funct field of the instruction and a 2-bit control field called the ALUOp. For lw/sw the ALU computes the address using addition (ALUOp=00), for the R-type instructions the ALU performs one of 5 actions depending on the function field of the instruction (ALUOp=10), for beq the ALU performs a subtraction (ALUOp=01). The ALU control is a large truth table that given the funct field and ALUOp outputs 3-bit controls for the ALU.
12 Main Control Look at the formats of the R-type and I-type instructions: Field opcodersrtrdshamt funct Bits 31-26 25-21 20-16 15-11 10-6 5-0 Field opcodersrt address Bits 31-26 25-21 20-16 15-0 The following observations can be made: –The opcode is always in bits 31-26 –The 2 registers to be read are always the rs (25-21) and rt (20-16) fields (R-type, beq, and store). –The base register for load/ store instructions is always rs (25-21) –The 16-bit offset for beq, lw,sw is always in bits (15-0) –The destination register is in one of two places: For a lw it is rt (20-16), for a R-type it is rd (15-11). Thus we need a MUX to select which field of the instruction is written.
13 The Main Control Signals There are 7 control signals in our microprocessor, let's see what happens when they are asserted (set to 1) and deasserted (set to 0): Signal DeassertedAsserted RegDst The Write reg is rt The Write reg is rd RegWriteNone The Write register is written with the Write data ALUSrcThe 2nd ALU operand The 2nd ALU operand is the comes from the register file is the 16-bit address PCSrc PC=PC + 4 PC=Branch target MemRead None Memory contents at the address input are put on the Read data output MemWriteNone Memory contents at the address input are replaced by the Write data input MemtoRegThe value of the reg. Write The value of the reg. Write data input is from the ALU data input is from memory
15 Opcode to Control The control lines are determined by the opcodes of the instructions. The exception is the PCSrc line which is dependent on the output of the beq instruction as well (x means don't care). LineR-type lwswbeq RegDst 10xx ALUSrc0110 MemtoReg01xx RegWrite1100 MemRead0100 MemWrite0010 Branch0001 ALUOp10000001 At this stage the Control is a block box, which receives inputs and gives outputs.
16 Operation of the Datapath Let's see the stages of execution of a R-type instruction add $t1,$t2,$t3: 1. An instruction is fetched from memory, the PC is incremented 2. Two registers $t2 and $t3 are read from the register file. 3. The ALU operates on the data read from the register file. 4. The results of the ALU is written into the register $t3. This doesn't really happen in 4 steps because the implementation is combinational, but at the end of the clock cycle the result is written into the destination register. Let's look at lw $t1,offset($t2) 1. An instruction is fetched from memory, the PC is incremented 2. The register $t2 is read from the register file. 3. The ALU computes the sum of $t2 and the sign-extended offset. 4. The sum from the ALU is used as the address for the data memory. 5. The data from memory is written into register $t1.
17 Adding the Jump Instruction The j instruction uses pseudodirect addressing, the upper 4 bits of PC+4 are concatenated (מחוברים) to the 26 bits (shifted left by 2) of the address in the J-type instruction.
18 Performance of Single-Cycle Machines Let's assume that the operation time for the following units is: Memory - 2 nanoseconds (ns), ALU and adders - 2 ns, Register file - 1 ns. We will assume that MUXs, control, sign-extension, PC accesses, and wires have no delays. Which implementation is faster? 1. Every instruction operates in 1 clock cycle of fixed length. 2. Every instruction operates in a varying length clock cycle. Lets look at the time needed by each instruction: Inst. Fetch Reg. Rd ALU op Memory Reg. Wr Total R-Type 2 1 2 0 1 6ns Load 2 1 2 2 1 8ns Store 2 1 2 2 7ns Branch 2 1 2 5ns Jump 2 2ns
19 Fixed vs. Variable Cycle Length Lets Assume a program has the following instruction mix: 24% loads, 12% stores, 44% R-type, 18% branchs, 2% jumps. CPU execution time = Instruction count * Cycle time For the fixed cycle length the cycle time is 8 ns, long enough for the longest instruction (load). Thus each instruction takes 8 ns to execute. For the variable cycle time the average CPU clock cycle is: 8*24% + 7*12% + 6*44% + 5*18% + 2*2% = 6.3 ns It is obvious that the variable clock implementation is faster but it is extremely hard to implement. So why not use the single cycle implementation which is only 6.3/8 = 78% slower? When adding instructions such as multiply and divide which can take tens of cycles this scheme is too slow.
20 A Multicycle Implementation We broke each instruction into several steps, we can use these steps to build a multicycle implementation. Each step takes 1 cycle, the multicycle implementation allows a functional unit to be used more than once in each instruction as long as it is used on different clock cycles. We now have only a single memory unit and a single ALU. In addition we need registers to hold the output of each stage.
21 New Registers and MUXs We have now added several new registers(which hare transparent to the programmer) and some new MUXs: –Instruction Register (IR) - the instruction fetched –Memory Data Register (MDR) - data read from memory –A, B - registers read from the register file –ALUOut - result of ALU operation The new MUXs added are: –An additional MUX to the 1st ALU input, chooses between the A register and the PC. –The MUX on the 2nd ALU input is changed from a 2-way to a 4-way MUX. The additional inputs are the constant 4 (used to increment the PC) and the sign-extended and shifted offset field (used in beq ).
22 Multicycle Diagram There are 3 possible sources for the PC value: 1. The output of the ALU which is PC+4; 2. The register ALUOut which is the address of the computed branch target; 3. The lower 26 bits of the IR shifted left by 2, concatenated with the 4 upper bits of the PC.
23 The Instruction Execution Stages (1,2) 1. Instruction Fetch (IF)- Fetch the instruction from memory and compute the address of the next sequential address: IR = Memory[PC]; PC= PC + 4; 2. Instruction Decode (ID) and register fetch - get the registers from the register file and compute the potential branch address (even if it isn't needed in the future): A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extended(IR[15-0])<<2);
24 The Instruction Execution Stages (3) 3. Execution (EX), Memory address computation or branch completion - In this stage the operation is determined by the the instruction class: A. Memory reference: ALUOut = A + sign-extended(IR[15-0]); B. R-type: ALUOut = A op B; C. Branch: if (A == B) PC = ALUOut; D. Jump: PC = PC[31-28] cat (IR[25-0]<<2)
25 The Instruction Execution Stages (4,5) 4. Memory access (Mem) or R-type completion - During this step the load/store instruction accesses memory or the AL instruction write its results. A. Memory reference: MDR = Memory[ALUOut]; (load) Memory[ALUOut] = B; (store) B. R-type: Reg[IR[15-11]] = ALUOut; 5. Memory read completion step - The load completes by writing the value from memory into a register. Reg[IR[20-16]]=MDR;
26 Cycles Per Instruction (CPI) The CPI of a program defines how many cycles an average instruction takes. Assuming an instruction mix (for the gcc compiler) of 22% loads, 11% stores, 49% R-type, %16 branches, and 2% jumps what is the CPI, assuming each state requires one clock cycle? The number of clock cycles for each instruction format is: Loads: 5; Stores: 4; R-type: 4; Branches: 3; Jumps: 3 Thus the CPI = 0.22*5 + (0.11 + 0.49)*4 + (0.16 + 0.02)*3 = 4.04 This is better than the worst case CPI in which each instruction would have taken the same number of clock cycles.
27 Exceptions One of the most hardest parts of control is implementing exceptions and interrupts, events other than branches and jumps which change the normal flow of instruction execution. An exception is an unexpected event that happens during program execution such as an arithmetic overflow or an illegal instruction (which are the only 2 in our design). An interrupt is an event that is external to the processor, such as requests by I/O devices. When an exception occurs the machine must save the address of the offending instruction in the exception program counter (EPC), and then transfer execution to the OS. The OS might service the exception and return control to the program or terminate execution.
28 Causes of Exceptions In order for the OS to handle the exception it must know the cause of the exception. MIPS has a register called the Cause register which holds the reason of the exception. A second method is called vectored interrupts. In a vectored interrupt the address to which control is transferred is determined by the exception cause. The OS knows the cause of the exception by the address that is jumped to. We need two additional registers the EPC which holds the address of the instruction and the Cause Register which holds 0 for an undefined instruction and 1 for arithmetic overflow. We will need 2 control signals to write to the EPC and cause registers (EPCWrite and CauseWrite) and a signal to set the LSB of the Cause register (IntCause).
29 Datapath with Exceptions IntCause is defined by the control if it can't decode the instruction or if the ALU signals an overflow. The next PC MUX now has 4 inputs, the exception handler addr is added