Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1999 ©UCB CS 161 Review for Test 2 Instructor: L.N. Bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)

Similar presentations


Presentation on theme: "1 1999 ©UCB CS 161 Review for Test 2 Instructor: L.N. Bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)"— Presentation transcript:

1 1 1999 ©UCB CS 161 Review for Test 2 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)

2 2 1999 ©UCB How to Study for Test 2 : Chap 5 °Single-cycle (CPI=1) processor know how to reason about processor organization (datapath, control) -e.g., how to add another instruction? (must modify both control, datapath, or both) -How to add multiplexors in the datapath -How to design hardware control unit °Multicycle (CPI>1) processor -Changes to Single Cycle Datapath -Control Design through FSM -how to add new instruction to multicycle?

3 3 1999 ©UCB Putting Together a Datapath for MIPS Memory (Dmem) PCRegisters ALUALU Data InData Out Memory (Imem) Address Data Out Address Data Out Data In Step 1Step 2Step 3Step 4 5 °Question: Which instruction uses which steps and what is the execution time?

4 4 1999 ©UCB Datapath Timing: Single-cycle vs. Pipelined °Suppose the following delays for major functional units: 2 ns for a memory access or ALU operation 1 ns for register file read or write °Total datapath delay for single-cycle: °What about multi-cycle datapath? InsnInsnRegALUDataRegTotal TypeFetchReadOperAccessWriteTime beq 2ns1ns2ns5ns R-form2ns1ns2ns1ns6ns sw 2ns1ns2ns2ns7ns lw 2ns1ns2ns2ns1ns8ns

5 5 1999 ©UCB Implementing Main Control Main Control RegDst Branch MemRead MemtoReg ALUop MemWrite ALUSrc RegWrite op 2 Main Control has one 6-bit input, 9 outputs (7 are 1-bit, ALUOp is 2 bits) To build Main Control as sum-of-products: (1) Construct a minterm for each different instruction (or R-type); each minterm corresponds to a single instruction (or all of the R- type instructions), e.g., M R-format, M lw (2) Determine each main control output by forming the logical OR of relevant minterms (instructions), e.g., RegWrite: M R-format OR M lw

6 6 1999 ©UCB Single-Cycle MIPS-lite CPU Regs Read Reg1 Read data1 ALUALU Read data2 Read Reg2 Write Reg Write Data Zero ALU- con RegWrite Address Read data Write Data Sign Extend Dmem MemRead MemWrite MuxMux MemTo- Reg MuxMux Read Addr Instruc- tion Imem 4 PCPC addadd addadd << 2 MuxMux ALU Control 5:0 ALUOp ALU- src MuxMux 25:21 20:16 15:11 RegDst 15:0 31:0 Branch Main Control op=[31:26] PCSrc

7 7 1999 ©UCB R-format Execution Illustration (step 4) Regs Read Reg1 Read data1 ALUALU Read data2 Read Reg2 Write Reg Write Data Zero ALU- con RegWrite Address Read data Write Data Sign Extend Dmem MemRead MemWrite MuxMux MemTo- Reg=1 MuxMux Read Addr Instruc- tion Imem 4 PCPC addadd addadd << 2 MuxMux PCSrc=0 ALU Control 5:0 ALUOp ALU- src=0 MuxMux 25:21 20:16 15:11 RegDst=1 15:0 31:0 Branch Main Control [r1] + [r2]

8 8 1999 ©UCB Multicycle Datapath (overview) Registers Read Reg1 ALUALU Read Reg2 Write Reg Data PCPC Address Instruction or Data Memory MIPS-lite Multicycle Version A B ALU- Out Instruction Register Data Memory Data Register Read data 1 Read data 2 One ALU (no extra adders) One Memory (no separate Imem, Dmem) New Temporary Registers (“clocked”/require clock input)

9 9 1999 ©UCB Cycle 3 Datapath (R-format) MIPS-lite Multicycle Version ALUALU Regs Read Reg1 Read data1 Read data2 Read Reg2 Write Reg Write Data Sgn Ext- end PCPC << 2 A B ALU- Out Address Read Data Mem Write Data MDRMDR MuxMux 25:21 20:16 15:0 0 1M2 u 3 x MuxMux MuxMux MuxMux IR 4 z 15:11 ALU Control 2 2 3 (funct) 5:0 MuxMux ALUOut=A op B

10 10 1999 ©UCB MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 1 ALUOp = 0 PCWrite PCSrc = 0 state 0 ALUSrcA = 0 ALUSrcB = 3 ALUOp = 0 ALUSrcA = 1 ALUSrcB = 2 ALUOp = 0 ALUSrcA = 1 ALUSrcB = 0 ALUOp =2 ALUSrcA = 1 ALUSrcB = 0 ALUOp =1 PCWriteCond PCSrc = 1 1 2 6 8 Memory Access R-format execution Branch Completion FSM diagram for Multicycle Machine start new instruction cycle 1 cycle 2 cycle 3 lw/sw R-format beq

11 11 1999 ©UCB Implementing the FSM controller (C.3) PCWrite PCWriteCond IorD MemtoReg PCSrc ALUOp ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 O p 5 O p 4 O p 3 O p 2 O p 1 O p 0 S 3 S 2 S 1 S 0 IRWrite MemRead MemWrite Outputs Inputs PLA or ROM implementation of both next-state and output functions Next-state } Datapath Control Points Instruction register opcode field state register

12 12 1999 ©UCB Micro-programmed Control (Chap. 5.5) °In microprogrammed control, FSM states become microinstructions of a microprogram (“microcode”) one FSM state=one microinstruction usually represent each micro-instruction textually, like an assembly instruction °FSM current state register becomes the microprogram counter (micro-PC) normal sequencing: add 1 to micro-PC to get next micro-instruction microprogram branch: separate logic determines next microinstruction

13 13 1999 ©UCB Micro-program for Multi-cycle Machine ALU Reg Mem PCNext OpIn1 In2FileOpSrcWrit  -Instr ----------------------------------------------------------- Fetch: Add PC 4RdPC ALU Add PCSE*4Rd[D1] Mem:AddASE[D2] LW:RdALU WrFetch SW:WrALUFetch Rform:funct A B WrFetch BEQ:Sub A BEquFetch D1 = { Mem, Rform, BEQ } D2 = { LW, SW }

14 14 1999 ©UCB How to Study for Test 2 : Chap 6 °Pipelined Processor how pipelined datapath, control differs from architectures of Chapter 5? -All instructions execute same 5 cycles -pipeline registers to separate the stages of datapath & control Problems for Pipelining -pipeline hazards: structural, data, control (how each solved?)

15 15 1999 ©UCB Pipelining Lessons °Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload °Multiple tasks operating simultaneously using different resources °Potential speedup = Number of pipe stages °What is real speedup? °Time to “fill” pipeline and time to “drain” it reduces speedup 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

16 16 1999 ©UCB Space-Time Diagram °To simplify pipeline, every instruction takes same number of steps, called stages °One clock cycle per stage IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Program Flow Time

17 17 1999 ©UCB Problems for Pipelining °Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load) Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load)

18 18 1999 ©UCB °guess branch taken, then back up if wrong: “branch prediction” For example, Predict not taken Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time) More dynamic scheme: keep history of the branch instruction (~ 90%) Control Hazard : Solution 1 add beq Load ALU IM Reg DMReg ALU IM Reg DMReg IM ALU Reg DMReg I n s t r. O r d e r Time (clock cycles)

19 19 1999 ©UCB °Redefine branch behavior (takes place after next instruction) “delayed branch” °Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” (  50% of time) Control Hazard : Solution 2 add beq Misc ALU IM Reg DMReg ALU IM Reg DMReg IM ALU Reg DMReg Load IM ALU Reg DMReg I n s t r. O r d e r Time (clock cycles)

20 20 1999 ©UCB Dependencies backwards in time are hazards Data Hazard on $1: Illustration add $1,$2,$3 sub $4,$1,$3 and $6,$1,$7 or $8,$1,$9 xor $10,$1,$11 IFID/RFEXMEMWB ALU IM Reg DM Reg ALU IM Reg DMReg ALU IM Reg DMReg IM ALU Reg DMReg ALU IM Reg DMReg I n s t r. O r d e r Time (clock cycles)

21 21 1999 ©UCB “Forward” result from one stage to another “or” OK if implement register file properly Data Hazard : Solution: add $1,$2,$3 sub $4,$1,$3 and $6,$1,$7 or $8,$1,$9 xor $10,$1,$11 IFID/RFEXMEMWB ALU IM Reg DM Reg ALU IM Reg DMReg ALU IM Reg DMReg IM ALU Reg DMReg ALU IM Reg DMReg I n s t r. O r d e r Time (clock cycles)

22 22 1999 ©UCB Must stall pipeline 1 cycle (insert 1 bubble) lw $1, 0($2) sub $4,$1,$6 and $6,$1,$7 or $8,$1,$9 IFID/RFEXMEMWB ALU IM Reg DM Reg ALU IM Reg DMReg ALU IM Reg DMReg IM ALU Reg DM Time (clock cycles) bub ble Data Hazard Even with Forwarding

23 23 1999 ©UCB How to Study for Test 2 : Chap 7 °Processor-Memory performance gap: problem for hardware designers and software developers alike °Memory Hierarchy--The Goal: want to create illusion of single large, fast memory access that hit in highest level are processed most quickly Exploit Principle of Locality to obtain high hit rate °Caches vs. Virtual Memory: how are they similar? Different?

24 24 1999 ©UCB Memory Hierarchy: Terminology °Hit Time: Time to access the upper level which consists of Time to determine hit/miss + Memory access time Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor °Note: Hit Time << Miss Penalty [Note: “<<“ here means “much less than”]

25 25 1999 ©UCB Issues with Direct-Mapped °If block size > 1, rightmost bits of index are really the offset within the indexed block ttttttttttttttttt iiiiiiiiii oooo tagindexbyte to checkto offset if have selectwithin correct blockblockblock Q: How do Set-Associative and Fully- Associative Designs Look?

26 26 1999 ©UCB Read from cache at offset, return word b °000000000000000000 0000000001 0100... Valid Tag 0x0-3 0x4-70x8-b0xc-f 0 1 2 3 4 5 6 7 1022 1023... 1 0abcd Index Tag fieldIndex fieldOffset 0 0 0 0 0 0 0 0 0

27 27 1999 ©UCB Miss Rate Versus Block Size 256 40% 35% 30% 25% 20% 15% 10% 5% 0% M i s s r a t e 64164 Block size (bytes) 1 KB 8 KB 16 KB 64 KB 256 KB total cache size Figure 7.12 - for direct mapped cache

28 28 1999 ©UCB Compromise: N-way Set Associative Cache °N-way set associative: N cache blocks for each Cache Index Like having N direct mapped caches operating in parallel °Example: 2-way set associative cache Cache Index selects a “set” of 2 blocks from the cache The 2 tags in set are compared in parallel Data is selected based on the tag result (which matched the address) Where is a data written? Based on Replacement Policy, FIFO, LRU, Random

29 29 1999 ©UCB Improving Cache Performance °In general, want to minimize Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate (recall Hit Time << Miss Penalty) °Generally, two ways to look at Larger Block Size Larger Cache Higher Associativity Reducing DRAM latency °Miss penalty ? ---> L2 cache approach Reduce Miss Rate Reduces Miss Penalty

30 30 1999 ©UCB Virtual Memory has own terminology °Each process has its own private “virtual address space” (e.g., 2 32 Bytes); CPU actually generates “virtual addresses” °Each computer has a “physical address space” (e.g., 128 MegaBytes DRAM); also called “real memory” °Library analogy: virtual address is like the title of a book physical address is the location of book in the library as given by its Library of Congress call number

31 31 1999 ©UCB Mapping Virtual to Physical Address Virtual Page NumberPage Offset Physical Page Number Translation 31 30 29 28 27.………………….12 11 10 29 28 27.………………….12 11 10 9 8 ……..……. 3 2 1 0 Virtual Address Physical Address 9 8 ……..……. 3 2 1 0 1KB page size

32 32 1999 ©UCB How Translate Fast? °Observation: since there is locality in pages of data, must be locality in virtual addresses of those pages! °Why not create a cache of virtual to physical address translations to make translation fast? (smaller is faster) °For historical reasons, such a “page table cache” is called a Translation Lookaside Buffer, or TLB °TLB organization is same as Icache or Dcache – Direct-mapped or Set Associative

33 33 1999 ©UCB Access TLB and Cache in Parallel? °Recall: address translation is only for virtual page number, not page offset °If cache index bits of PA “fit within” page offset of VA, then index is not translated  can read cache block while simultaneously accessing TLB °“Virtually indexed, physically tagged cache” (avoids aliasing problem) VA PA page offset virtual page number tag index ofs


Download ppt "1 1999 ©UCB CS 161 Review for Test 2 Instructor: L.N. Bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)"

Similar presentations


Ads by Google