Presentation is loading. Please wait.

Presentation is loading. Please wait.

Designing a Multicycle Processor

Similar presentations


Presentation on theme: "Designing a Multicycle Processor"— Presentation transcript:

1 Designing a Multicycle Processor
Adapted from Dave Patterson (http.cs.berkeley.edu/~patterson) lecture slides

2 Dave Patterson serving as ACM president

3 Recap: Processor Design is a Process
Bottom-up assemble components in target technology to establish critical timing Top-down specify component behavior from high-level requirements Iterative refinement establish partial solution, expand and improve => Instruction Set Architecture processor datapath control Reg. File Mux ALU Reg Mem Decoder Sequencer Cells Gates

4 Recap: A Single Cycle Datapath
We have everything except control signals (underlined) How to generate the control signals? 32 ALUctr Clk busW RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Instruction Fetch Unit Zero Instruction<31:0> 1 <21:25> <16:20> <11:15> <0:15> Imm16 nPC_sel The result of the last lecture is this single-cycle datapath. +1 = 6 min. (X:46)

5 The “Truth Table” for the Main Control
op 6 ALU (Local) func 3 ALUop ALUctr RegDst ALUSrc : Conditions op R-type ori lw sw beq jump RegDst 1 x x x ALUSrc 1 1 1 x Now that we have taken care of the Local Control (ALU Control), let’s refocus our attention to the Main Controller. The job of the Main Control is to look at the Opcode field of the instruction and generate these control signals for the datapath (RegDst, ... ExtOp) as well as the 3-bit ALUop field for the ALU Control. Here, I have shown you the symbolic value of the ALUop field as well as the actual bit assignment. For example here (2nd column), the R-type ALUop is encode as 100 and the Add operation (3rd column) is encoded as 000.. This is call a quote “Truth Table” unquote because if you think about it, this is like having the truth table rotates 90 degrees. Let me show you what I mean by that. +3 = 65 min. (Y:45) MemtoReg 1 x x x RegWrite 1 1 1 MemWrite 1 Branch 1 Jump 1 ExtOp x 1 1 x x ALUop (Symbolic) “R-type” Or Add Add Subtract xxx ALUop <2> 1 x ALUop <1> 1 x ALUop <0> 1 x

6 PLA Implementation of the Main Control
op<0> op<5> . <0> R-type ori lw sw beq jump RegWrite ALUSrc Similarly, for ALUSrc, we need to OR the ori, load, and store terms together because we need to assert the ALUSrc signals whenever we have the Ori, load, or store instructions. The RegDst, MemtoReg, MemWrite, Branch, and Jump signals are very simple. They don’t need to OR any product terms together because each is asserted for only one instruction. For example, RegDst is asserted ONLY for R-type instruction and MemtoReg is asserted ONLY for load instruction. ExtOp, on the other hand, needs to be set to 1 for both the load and store instructions so the immediate field is sign extended properly. Therefore, we need to OR the load and store terms together to form the signal ExtOp. Finally, we have the ALUop signals. But clever encoding of the ALUop field, we are able to keep them simple so that no OR gates is needed. If you don’t already know, this regular structure with an array of AND gates followed by another array of OR gates is called a Programmable Logic Array, or PLA for short. It is one of the most common ways to implement logic function and there are a lot of CAD tools available to simplify them. +3 = 70 min. (Y:50) RegDst MemtoReg MemWrite Branch Jump ExtOp ALUop<2> ALUop<1> ALUop<0>

7 Systematic Generation of Control
OPcode Control Logic / Store (PLA, ROM) Decode microinstruction Conditions Instruction Control Points Datapath In our single-cycle processor, each instruction is realized by exactly one control command or “microinstruction” in general, the controller is a finite state machine microinstruction can also control sequencing

8 Behavioral models of Datapath Components
entity adder16 is generic (ccOut_delay : TIME := 12 ns; adderOut_delay: TIME := 12 ns); port(A, B: in vlbit_1d(15 downto 0); DOUT: out vlbit_1d(15 downto 0); CIN: in vlbit; COUT: out vlbit); end adder16; skipped architecture behavior of adder16 is begin adder16_process: process(A, B, CIN) variable tmp : vlbit_1d(18 downto 0); variable adder_out : vlbit_1d(15 downto 0); variable carry : vlbit; tmp := addum (addum (A, B), CIN); adder_out := tmp(15 downto 0); carry :=tmp(16); COUT <= carry after ccOut_delay; DOUT <= adder_out after adderOut_delay; end process; end behavior; 16 A B DOUT Cin Cout addum adds two n-bit vectors to produce an n+1 bit vector

9 skipped Note on VHDL variables count: process (x)
Can only be used in processes Are declared before the begin keyword of a process statement Assignment operator is := Can be assigned an initial value by adding the := and some constant expression after the type part of the declaration Do not have or cause events and are modified differently than signals In the following code, if x is a bit signal, then the process will count the number of rising and falling edges that occur on the signal x skipped count: process (x) variable changes : integer := -1; begin changes:=changes+1; end process;

10 Behavioral Specification of Control Logic
entity maincontrol is port(opcode: in vlbit_1d(5 downto 0); equal_cond: in vlbit; extop out vlbit; ALUsrc out vlbit; ALUop out vlbit_1d(2 downto 0); MEMwr out vlbit; MemtoReg out vlbit; RegWr out vlbit; RegDst out vlbit; nPC out vlbit; end maincontrol; skipped Decode / Control-store address modeled by Case statement Each arm drives control signals for that operation just like the microinstruction either can be symbolic

11 Abstract View of our single cycle processor
Main Control op ALU control fun Equal ALUSrc ExtOp MemRd MemWr nPC_sel RegDst RegWr MemWr ALUctr Reg. Wrt Register Fetch ALU Ext Mem Access PC Next PC Instruction Fetch Result Store Data Mem looks like a FSM with PC as state

12 What’s wrong with our CPI=1 processor?
Arithmetic & Logical PC Inst Memory Reg File ALU mux mux setup Load PC Inst Memory Reg File ALU Data Mem mux mux setup Critical Path Store PC Inst Memory Reg File ALU Data Mem mux Branch PC Inst Memory Reg File cmp mux Long Cycle Time All instructions take as much time as the slowest Real memory is not so nice as our idealized memory cannot always get the job done in one (short) cycle It seems the timing of Branch has been deliberately shortened with the addition of a highly efficient comparator.

13 Memory Access Time Physics => fast memories are small (large memories are slow) question: register file vs. memory => Use a hierarchy of memories Storage Array selected word line storage cell address bit line address decoder sense amps mem. bus proc. bus memory L2 Cache Cache Processor 1 cycle 2-3 cycles cycles

14 => Reducing Cycle Time
Cut combinational dependency graph and insert register / latch Do same work in two fast cycles, rather than one slow one storage element storage element Acyclic Combinational Logic (A) Acyclic Combinational Logic => storage element Acyclic Combinational Logic (B) storage element storage element

15 Basic Limits on Cycle Time
Next address logic PC = (branch == 1) ? PC offset : PC + 4 Instruction Fetch InstructionReg <= Mem[PC] Register Access A <= R[rs] ALU operation S <= A + B Control MemRd MemWr RegWr MemWr nPC_sel RegDst ALUSrc ALUctr ExtOp Reg. File Operand Fetch Exec Instruction Fetch Mem Access PC Next PC Result Store Data Mem

16 Partitioning the CPI=1 Datapath
Add registers between smallest steps MemRd MemWr nPC_sel RegDst RegWr MemWr ExtOp ALUSrc ALUctr Reg. File Operand Fetch Exec Instruction Fetch Mem Access PC Next PC Result Store Data Mem

17 Example Multicycle Datapath
MemToReg RegWr MemRd MemWr RegDst nPC_sel ALUSrc ALUctr ExtOp Equal Reg. File A Ext ALU Reg File S PC IR Next PC B Mem Access M Data Mem Instruction Fetch Result Store Operand Fetch

18 Recall: Step-by-step Processor Design
Step 1: ISA => Logical Register Transfers Step 2: Components of the Datapath Step 3: RTL + Components => Datapath Step 4: Datapath + Logical RTs => Physical RTs Step 5: Physical RTs => Control

19 Step 4: R-type (add, sub, . . .) Logical Register Transfer
Physical Register Transfers inst Logical Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 inst Physical Register Transfers IR <– MEM[PC] ADDU A<– R[rs]; B <– R[rt] S <– A + B R[rd] <– S; PC <– PC + 4 Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem

20 Step 4:Logical immed Logical Register Transfer
Physical Register Transfers inst Logical Register Transfers ORi R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ORi A<– R[rs]; B <– R[rt] S <– A OR ZeroExt(Im16) R[rt] <– S; PC <– PC + 4 Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem

21 Step 4 : Load Logical Register Transfer Physical Register Transfers
inst Logical Register Transfers LW R[rt] <– MEM(R[rs] + sx(Im16)); PC <– PC + 4 Logical Register Transfer Physical Register Transfers inst Physical Register Transfers IR <– MEM[PC] LW A<– R[rs]; B <– R[rt] S <– A + SignEx(Im16) M <– MEM[S] R[rt] <– M; PC <– PC + 4 Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem

22 Step 4 : Store Logical Register Transfer Physical Register Transfers
inst Logical Register Transfers SW MEM(R[rs] + sx(Im16) <– R[rt]; PC <– PC + 4 Logical Register Transfer Physical Register Transfers inst Physical Register Transfers IR <– MEM[PC] SW A<– R[rs]; B <– R[rt] S <– A + SignEx(Im16); MEM[S] <– B PC <– PC + 4 Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem

23 Step 4 : Branch Logical Register Transfer Physical Register Transfers
inst Logical Register Transfers BEQ if R[rs] == R[rt] then PC <= PC sx(Im16) << 2 else PC <= PC + 4 inst Physical Register Transfers IR <– MEM[PC] A<– R[rs]; B <– R[rt] BEQ|Eq PC <– PC + 4 inst Physical Register Transfers IR <– MEM[PC] A<– R[rs]; B <– R[rt] BEQ|Eq PC <– PC sx(Im16)<<2 Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem

24 Alternative datapath: Multiple Cycle Datapath
Minimizes Hardware: 1 memory (Princeton organization), 1 adder (ALU) PCWr PCWrCond PCSel BrWr Zero IorD MemWr IRWr RegDst RegWr ALUSelA 1 Target 32 32 Mux PC Mux 1 32 Zero Rs Mux 1 Ra 32 RAdr 5 32 Rt Rb busA Putting it all together, here it is: the multiple cycle datapath we set out to built. +1 = 47 min. (Y:47) 32 32 Ideal Memory ALU Instruction Reg Mux 1 5 Reg File 32 ALU Out Rt 4 Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 2 Extend Imm 16 32 ALUOp ExtOp MemtoReg ALUSelB Note: On p. 323 of COD ed. 3, RAdr & WrAdr are combined into Address.

25 Seminal works of Mealy and Moore
G.H. Mealy, A method for synthesizing sequential circuits, Bell Technical Journal, vol. 34, pp. 1045~1079, 1955. E.F. Moore, Gedanken-experiments on sequential machines, Automata Studies (C.E. Shannon & J. McCarthy, eds), pp.29~153, Princeton University Press, 1956. Note: Gedanken = thought, mind Aphorisms: "Anything on earth you want to do is play. Anything on earth you have to do is work. Play will never kill you, work will. I never worked a day in my life." - Dr. Leila Denmark, 105 (in 2004), USA's oldest practicing physician. “The purpose of computing is insight, not numbers” and “Anything a faculty member can learn, a student can easily.” - Richard Hamming, inventor of Hamming code.

26 Control Model State specifies control points for Register Transfer
Transfer occurs upon exiting state (same falling edge) inputs (conditions) Next State Logic State X Register Transfer Control Points Control State Depends on Input Output Logic outputs (control points)

27 Step 4 => Control Specification for multicycle proc
IR <= MEM[PC] R-type A <= R[rs] B <= R[rt] S <= A fun B R[rd] <= S PC <= PC + 4 S <= A or ZX R[rt] <= S ORi S <= A + SX R[rt] <= M M <= MEM[S] LW MEM[S] <= B BEQ & Equal BEQ & ~Equal +SX << 2 SW “instruction fetch” “decode / operand fetch” Execute read/write Memory Write-back

28 Traditional FSM Controller
(current) state next state op cond control signals Truth Table next state control signals 11 Equal 6 state 4 op datapath state

29 Step 5: datapath + state diagram => control
Translate RTs (register transfers) into control points Assign states Then go build the controller

30 Mapping RTs to Control Signals
IR <= MEM[PC] “instruction fetch” imem_rd, IRen A <= R[rs] B <= R[rt] “decode / operand fetch” Aen, Ben LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute S <= A fun B PC <= PC + 4 +SX << 2 S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 ALUfun, Sen read/write Memory M <= MEM[S] MEM[S] <= B PC <= PC + 4 R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 Write-back RegDst, RegWr, PCen

31 Assigning States (State Assignment)
IR <= MEM[PC] “instruction fetch” 0000 A <= R[rs] B <= R[rt] “decode / operand fetch” 0001 LW BEQ & Equal R-type ORi SW BEQ & ~Equal PC <= PC + 4 +SX << 2 Execute S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 0100 0110 1000 1011 0011 0010 read/write Memory M <= MEM[S] MEM[S] <= B PC <= PC + 4 1001 1100 R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 Write-back 0101 0111 1010

32 Detailed Control Specification
State Op field Eq Next IR PC Ops Exec Mem Write-Back en en sel AenBenEx Sr ALU SenR W MenM-R Wr Dst 0000 ?????? ? 0001 BEQ 0001 BEQ 0001 R-type x 0001 orI x 0001 LW x 0001 SW x 0010 xxxxxx x 0011 xxxxxx x 0100 xxxxxx x fun 1 0101 xxxxxx x 0110 xxxxxx x or 1 0111 xxxxxx x 1000 xxxxxx x add 1 1001 xxxxxx x 1010 xxxxxx x 1011 xxxxxx x add 1 1100 xxxxxx x -all same in Moore machine R: ORi: LW: SW:

33 Performance Evaluation
What is the average CPI? state diagram gives CPI for each instruction type workload gives frequency of each type Type CPIi for type Frequency CPIi x freqi Arith/Logic 4 40% 1.6 Load 5 30% 1.5 Store 4 10% 0.4 branch 3 20% 0.6 Average CPI: 4.1

34 Controller Design The state diagrams that define the controller for an instruction set processor are highly structured Use this structure to construct a simple “microsequencer” Control reduces to programming this very simple device microprogramming sequencer control datapath control microinstruction micro-PC sequencer

35 Example: Jump-Counter
i i 0000 i+1 Mapping ROM op-code zero inc load Counter

36 Using a Jump Counter “instruction fetch” IR <= MEM[PC] 0000 inc
A <= R[rs] B <= R[rt] “decode” 0001 load inc LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute PC <= PC + SX || 00 S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 0100 0110 1000 1011 0011 0010 inc inc inc inc zero zero Memory M <= MEM[S] MEM[S] <= B PC <= PC + 4 1001 1100 inc R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 Write-back 0101 0111 1010 zero zero zero zero

37 Our Microsequencer 注意:此圖之控制記憶體已較 Traditional FSM Controller
5 taken datapath control Z I L Micro-PC op-code 注意:此圖之控制記憶體已較 Traditional FSM Controller 投影片所示者縮小甚多,請注意 比較此二圖之差異。 Mapping ROM

38 Microprogram Control Specification
µPC Taken Next IR PC Ops Exec Mem Write-Back en sel A B Ex Sr ALU S R W M M-R Wr Dst 0000 ? inc 1 inc load zero zero 0100 x inc fun 1 0101 x zero 0110 x inc or 1 0111 x zero 1000 x inc add 1 1001 x inc x zero 1011 x inc add 1 1100 x zero BEQ R: ORi: LW: SW: 註:表中各暫存器(IR、A、B、S、M)之載入致能訊號皆省略en。

39 Mapping ROM– mapping opcode to PC
opcode PC (input) (output) R-type BEQ Ori LW SW 註:Mapping ROM 輸出須配合load訊號存入PC中

40 Example: Controlling Memory
PC addr InstMem_rd Instruction Memory IM_wait Data (instruction) Inst. Reg IR_en

41 Controller handles non-ideal memory
“instruction fetch” IR <= MEM[PC] wait ~wait A <= R[rs] B <= R[rt] “decode / operand fetch” LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute PC <= PC + SX || 00 S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 Memory M <= MEM[S] MEM[S] <= B ~wait wait wait ~wait R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 Write-back PC <= PC + 4

42 Really Simple Time-State Control
IR <= MEM[PC] R-type A <= R[rs] B <= R[rt] S <= A fun B R[rd] <= S PC <= PC + 4 S <= A or ZX R[rt] <= S ORi S <= A + SX R[rt] <= M M <= MEM[S] LW MEM[S] <= B BEQ & Equal BEQ & ~Equal PC <= PC + SX || 00 SW ~wait wait instruction fetch decode Execute Memory write-back

43 Time-state Control Path
Local decode and control at each stage Valid IRex IR IRwb Inst. Mem IRmem WB Ctrl Dcd Ctrl Ex Ctrl Mem Ctrl Equal Reg. File Reg File A S Exec PC Next PC B Mem Access M Data Mem Valid 位元指示IR、IRex、IRmem、IRwb 中之控制位元 是否已有效;此種方式常用於pipeline控制單元之設計。

44 “microprogrammed control”
Overview of Control Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique. Initial Representation Finite State Diagram Microprogram Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs (即Mapping ROMs) Logic Representation Logic Equations Truth Tables Implementation PLA ROM Technique “hardwired control” “microprogrammed control”

45 Summary Disadvantages of the Single Cycle Processor Long cycle time
Cycle time is too long for all instructions except the Load Multiple Cycle Processor: Divide the instructions into smaller steps Execute each step (instead of the entire instruction) in one cycle Partition datapath into equal-size chunks to minimize cycle time ~10 levels of logic between latches Follow same 5-step method for designing “real” processors

46 Summary (cont’d) Control is specified by finite state diagram
Specialize state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Control is more complicated with: complex instruction sets restricted datapaths (see the book) Simple Instruction set and powerful datapath => simple control could try to reduce hardware (see the book) rather go for speed => many instructions at once (Pipeline) !


Download ppt "Designing a Multicycle Processor"

Similar presentations


Ads by Google