Download presentation
Presentation is loading. Please wait.
Published byFred Boynton Modified over 9 years ago
1
1 COSC 3P92 Cosc 3P92 Week 5 Lecture slides Voters quickly forget what a man says. Richard M. Nixon (1913-1994) Former U.S. President
2
2 COSC 3P92 Hardware components MIC (overview) MAR and MDR are registers which latch the addresses and data prior to processing
3
3 COSC 3P92 Hardware components MIC (overview) Translate byte address 0, 1, 2, 3… to 4 byte words. –Shift 2 bits left. –Causes word 0, 1, 2, 3 … to be addressed. –Alignment of words.
4
4 COSC 3P92 Hardware components MIC (overview) Each micro instruction controls –register enables –bus enables –ALU –Memory –Next Micro instruction address
5
5 COSC 3P92 Hardware components MIC (overview)
6
6 COSC 3P92 Memory control MAR - memory address register –CPU writes addresses of memory to read, write MBR - memory buffer register –contains data for write or read both act as ‘latches’ to hold addr, data until memory finished using them.
7
7 COSC 3P92 Control unit main functions of a control unit: - instruction interpretation - instruction sequencing the control unit is a finite-state machine. Control Unit Execution Unit Status signals Control signals External command signals Master clock CPU
8
8 COSC 3P92 Typical CPU model R0 R1 Rn-1 SR (status reg) IR (instn reg) PC (prog cntr) SP (stack ptr) MAR (mem addr reg) MBR (mem buffer reg) etc... General purpose registers Dedicated registers ALU (arithmetic logic unit) Control Unit Dedicated multiply, division firmware (FP) Execution unit An execution unit consists of: –a register section –an ALU –some dedicated hardware or firmware
9
9 COSC 3P92 Data transfer within a CPU A single-bus architecture: To compute R2 <– R0 + R1: 1. A <– R0, 2. B <– R1, 3. R2 <– A+B ALU Buffer reg. ABuffer reg. B R0 R1 etc general purpose regs PC etc special purpose regs
10
10 COSC 3P92 Data transfer within a CPU A two-bus architecture To compute R2 <– R0 + R1: 1. Buffer <– R0 + R1 (via Bus A and Bus B), 2. R2 <– Buffer (via either Bus A or Bus B). ALU R0 R1 etc PC etc BUS A BUS B Special I Special II MBR buffer reg. General regs.
11
11 COSC 3P92 Data transfer within a CPU A three-bus architecture: To compute R2 <– R0 + R1: 1. R2 <– R0 + R1 (via Bus A, Bus B and Bus C). ALU R0 R1 etc PC etc BUS B BUS C Special I Special II MBR BUS A
12
12 COSC 3P92 Design of control units Hardwired approach The control unit is treated as a synchronous (i.e., clocked) sequential circuit and is implemented as a hardwired state machine. Register Combinational Logic Inputs Outputs Feedback paths Register Transfer Model of Finite State Machine Register AND plane Inputs Outputs OR plane Next state PLA Implementation of a Finite State Machine
13
13 COSC 3P92 Microprogramming Use of memory to implement the control unit Instructions are implemented as sequences of instructions stored in control memory Each machine language instruction is interpreted by circuitry, and executed using sequences of microprogram instructions Micro-programs are much like assembled code, except: –direct mapping between instruction fields and hardware components of the CPU. –control fields are specified. –timing is critical; parallelism can be exploited.
14
14 COSC 3P92 Microprogramming Register Combinational Logic Control values What is being controlled? –data paths: inter-register connections –control points: hardware enabling lines which govern register-to-register communications idea is that we can control the operation of ALU and micro-control unit using combinations of control fields encoded in micro-instructions
15
15 COSC 3P92 Microprogramming Each control point specifies a micro-operation –All micro operations which may be executed in parallel can be specified in a single micro instruction. Factors which determine parallel operations. –Buses must only have 1 input active at a time. –Registers can be either read/written »Not both at the same time.
16
16 COSC 3P92 Microprogramming Basic microinstruction formats: {Over heads}
17
17 COSC 3P92 Data path 32-bit registers (none are user- accessible) B bus: main one to ALU C bus: from ALU back to registers H reg: contains other operand for ALU –loaded by performing null op on data, and sending it to H
18
18 COSC 3P92 Data path ALU control: 6 control lines shifter: 2 control –1. logical shift left 8 bits –2. arithmetic shift right 8 bits
19
19 COSC 3P92 Four sub-cycles: –1. control signals set up (w) –2. registers loaded on B bus (x) –3. ALU and shifter (y) –4. results available to registers on C (z) Data path timing
20
20 COSC 3P92 Data path timing These are implicit sub-cycles: they rely on timing of previous steps Only real clock signals used: –falling edge of clock (starts the cycle) –rising edge (loading from C in step 4) ALU is continually processing all intermediate values it sees. It’s output only makes sense at the appropriate time above (after 3) Can operate and save a register in 1 clock cycle: –load PC to B –inc –save to PC
21
21 COSC 3P92 Memory again 2 memory buffers: –32 bit port: MAR, MDR (read, write) »word addresses –8-bit: MBR »low byte from PC (read only) »byte addresses »can be loaded signed, unsigned onto B bus »call reads into MBR “fetches” control: –black arrow: enable from C bus –white arrow: enable onto B bus 2 bus control: –out B –in C –out B / in C –none
22
22 COSC 3P92 Memory again MAR aligned to words (32 bits, 4 bytes): [4.4] Memory is available 2 cycles from when read was initiated –avail. at end of 2nd cycle, so 3rd cycle can use them
23
23 COSC 3P92 Microinstructions 29 signals for data path: –1. 9 signals to control C bus output into registers –2. 9 signals to enable registers onto B bus –3. 9 signals for ALU, shifter functions –4. 2 signals for memory W/R via MAR/MDR –5. 1 signal for memory fetch via PC/MBR Issues: –may load more than 1 reg from C (9 bits) –but never load more than 1 reg onto B (4 bits, encoded will force this) --> 4 signals. Need 2 more fields for determining next m.i.: –NextAddr (9 bits, addr space of 512) –conditional jumps (3 bits)
24
24 COSC 3P92 Microinstructions Fields: –Addr: address of next micro-instruction –JAM: determines how next m.i. selected –ALU: ALU, shifter control –C: which registers written from C bus –Mem: memory functions –B: B source (encoded)
25
25 COSC 3P92 Example micro- architectu re: Mic-1
26
26 COSC 3P92 Example microarchitecture: Mic-1 sequencer: executes microinstructions Two tasks: –set control signals for system –determine next m.i. to execute control store: contains m.i. for interpreting ISA instns. –each instn a 36-bit word like [4.5] –each m.i specifies its successor MPC: MicroProgram Counter –9-bit address of next m.i. to execute MIR: MicroInstruction Register –36-bit m.i. being executed Note that bits in MIR may directly control other parts of the circuit –eg. C
27
27 COSC 3P92 Mic-1 operation cycle Basic ALU cycle: – 1. set up the inputs to the ALU – 2. let the ALU do its computation – 3. store the results Clock cycles for Mic-1 –1. MIR enabled (during subcycle w) –2. MIR signals control data path (B bus; note H always enabled) (subcycle x) –3. B and H inputs are stable, and ALU’s computes output ; shifter finishes; N, Z bits stable (subcycle y) –4. shifter, N, Z outputs loaded from C but into registers »rising clock edge determines end »MIR is reloaded and calculated at this point as well »Memory read is initiated at end too Note that all the above will complete in 1 cycle –microinstructions can specify all these operations in parallel
28
28 COSC 3P92 Mic-1 sequencing First, 9-bit next addr field copied into MPC JAM inspected: –000 = use MPC as it is –if JAMN (or JAMZ) set, then N bit (or Z) are ORed with high-bit of MPC »hence next address is either: MPC, MPC with high-bit ORed with 1 –JMPC set: MBR byte ORed with low byte of NextAddr field »permits multiway jumps »can quickly branch to instn for just-loaded opcodes (ie. opcode number = address in control store!)
29
29 COSC 3P92 Microinstructions and notation As in assembler programming, helps to use higher-level notation instead of raw numeric m.i. fields can specify everything that happens in 1 clock cycle: –permits parallelism: eg. prefetch next instns Notation: high-level, but directly translatable to single m.i.’s Examples: –SP=SP+1: incr SP by 1 –MDR = SP: copy SP into MDR –MDR = SP+H; rd : add SP and H, save in MDR, and initiate a read –SP=MDR=SP+1: incr SP, load into both MDR, SP
30
30 COSC 3P92 Microinstructions and notation Memory takes 2 cycles: MAR=SP; rd : assign value into MDR (another instn) * memory ready now! next addresses: assume it is the labeled next m.i. after current one (unless a conditional jump) –if (Z) goto L1; else goto L2 : sets JAMZ »L1 and L2 are same low-8 bits (set by assembler) Summary of legal operations on operands:
31
31 COSC 3P92 Example M.I. implementation: IJVM A stack-based virtual machine for which Mic-1 is designed to implement. All instructions access the stack: no general registers are used by compiler –eg. parameter passing [4.8] –eg. arithmetic [4.9] Recall: –JVM instruction formats: [5.15] –Java memory usage, registers: [4.10] Complete instruction set: [4.11] Example translated code: [4.14]
32
32 COSC 3P92
33
33 COSC 3P92 JVM Instruction Formats
34
34 COSC 3P92 Memory area of IJVM
35
35 COSC 3P92 IJVM Instruction Set
36
36 COSC 3P92 Translating Java to IJVM
37
37 COSC 3P92 Implementation (cont) See overheads (book page 234-236) Note: –each m.i. contains address of next instn –micro-assembler labels all instns appropriately, and must put them in right control store addresses (equiv. to opcode) –the sequenced instns may reside in any free area of control store! Microassembler auto sets ‘next address fields’. –only explicit ‘goto’s will override this sequencing Two parts: –1. fetch next byte for next instn (done at Main1) –2. branch to that opcode address and carry out instruction Fetching instructions (Main1) –PC always points to next instruction in Java application program –can be reset by branches (see goto5, T, F,...) –When Main1 executed, assumed next opcode ready. the fetch at Main1 is for next opcode. Hence instns must fetch it if necessary(eg. see bipush2)
38
38 COSC 3P92 Implementation (cont) Example 1: iadd (“pop 2 words from stack, push their sum”) –iadd1: reads next-to-top word in stack (TOS register already contains top of stack word); bumps down the SP for writing result –iadd2: sets TOS ready for addition (put in H) –iadd3: add next-to-top value (read in iadd1) to H, update TOS, save result in MDR for writing Example 2: dup (“copy top stack word and push it”) –dup1: incr SP pointer, copy to MAR –dup2: save TOS (top stack word) to new SP, write it –note: can’t write it in dup1, because both SP and MDR must be updated thru data path, and not both at once
39
39 COSC 3P92 Implementation (cont) Example 3: goto offset (“unconditional branch”) –[Fig 4.22] –goto1: save addr of opcode to OPC (old PC) –goto2: get the 2nd byte of offset (1st byte already in MBR) –goto3: shift 1st byte left 8 bits –goto4: OR low byte into high byte –goto5: add 16-bit offset to (old) PC; get next opcode –goto6: goto Main1 –Note: pause needed in goto6 (must wait 2 extra cycle)
40
40 COSC 3P92
41
41 COSC 3P92 Improving performance 1. Faster clock, transistors, electrical circuits 2. simpler organization yields shorter clock cycles –eg. get rid of (B bus) decoder 3. Merge interpreter loop with microcode (pt 2) –[4.23], [4.24] –saves extra cycles if done in all instns –significant speedup! 4. Three-busses –[4.25], [4.26] –reduces need for separate instns to load H reg
42
42 COSC 3P92
43
43 COSC 3P92 2 Bus v.s. 3 Bus
44
44 COSC 3P92 Improving performance 5. Instruction fetch unit [4.27] –in Mic-1, ALU is used to increment PC and fetch instns –this uses up instn. cycles –IFU can be used: »1. pre-fetches all instns outside of main data path »2. pre-fetches operands: if they are required, they are there (else garbage, but ignored anyway)
45
45 COSC 3P92 Fetch Unit
46
46 COSC 3P92 Improving performance Instruction fetch unit (cont) –shift register: always loaded with next bytes from memory –MBR1 (1 byte, as before); and new MBR2 (2 bytes) –values from shift reg dumped into both MBR1, MBR2 after every instn read; if needed, they are quickly put onto data path as req’d –need some fetching logic to know when to read more bytes into shift register, when to refresh MBR1, MBR2 –IMAR: separate memory addr reg (separate from MAR) »own dedicated incrementer (no need for ALU) –IFU must keep PC incremented properly, depending on instn length (if MBR1, MBR2 used) »branches may reset PC as well (from C)
47
47 COSC 3P92 Improving performance Mic-2: –A, B buses –IFU –new IJVM [4.30, See overheads] »smaller, faster »MBR1 always has next opcode (due to IFU)
48
48 COSC 3P92 Mic-2
49
49 COSC 3P92 Improving performance: 6. Pipelining divide instn. execution into modular steps and carry out different steps for seql. instns simultaneously “instruction-level parallelism” superscalar: single pipeline with parallel functional units most instns take more than 1 cycle to complete with pipelining: n instns in n cycles To implement it: [4.31] –add latch to A, B, C buses –they keep values stable during sub-cycles: can use values in 3 sections of the data path »(i) loading before ALU (A, B) »(ii) doing ALU, shift, and loading C latch »(iii) storing C back into registers
50
50 COSC 3P92 Mic-3
51
51 COSC 3P92 Improving performance: 6. Pipelining need 3 cycles now to complete 1 instn –but maximum delay between all components is shorter (1/3) so can speed up clock –advantage: throughput -- 3 instns can be processed simult. –all parts of data path are busy... none are idle (usually) best analogy: car factory assembly line
52
52 COSC 3P92 Pipelining (cont) [4.32, 4.33, 4.44] interpreting instns in pipelined processor (Mic-4): –new sub-cycles: microsteps –takes 3 cycles to process instn (steps i, ii, iii from earlier) –call latches A, B, C (like registers) –advantage [4.33] is that different stages can work independently of one another now more stages in pipeline means higher efficiency
53
53 COSC 3P92
54
54 COSC 3P92
55
55 COSC 3P92 Pipelining (cont) One complication: memory reads –takes 2 cycles to get word from memory –hence a m.i. that uses a word in MDR must wait until it’s available –called a true or RAW (read after write) dependence –pipeline must stall until it is ready –ideally, put other m.i. instns in wait states Another complication: conditional branches –cannot predict which instn to fetch/put into pipeline –have to “squash” or “flush” pipeline when a jump ruins sequence of instns
56
56 COSC 3P92 Pipelines and branch prediction unconditional branches –fetch unit needs to know in advance where to access instns –a jump instn. isn’t decoded right away, and so F.U. won’t know branch location until later: called the delay slot –soln: compiler places other executable instns in delay, that it knows can be executed conditional branches –dynamic prediction: carried out during run time –keep a running table of branched instn addresses, along with a “branch/no branch” bit –if branch in table, and branch bit set, then predict it will be taken --> fetch it –can use 2 prediction bits: predict it’s fetched twice, and not fetched twice (extra logic)
57
57 COSC 3P92 Pipelines and branch prediction static branch prediction: carried out during compile time –if a loop nearly always done, then have a field in the instn. which tells CPU that branch should be fetched (eg. UltraSPARC) –can do simulations to determine how cond. branches executed
58
58 COSC 3P92 Improving performance: out-of- order exec, reg renaming instruction ops can take varying # clock cycles –superscalar systems mean those functional units need more time to process their instns problem: can’t exec one instn that requires results of another –means the pipeline stalls until register values are computed when subsequent instns require them. soln: move instruction order, so that no idle waiting –overall exec must be identical to “linear” order dependencies: –RAW (read after write): try to read reg before another instn has written it. –WAR (write after read): try to write before another has read it –WAW (write after write): both write simult.
59
59 COSC 3P92 In-order exec, in-order completion –decode in cyc n, exec n+1, writeback n+2 (except multiply in n+3) –2 instns decoded simult. –uses scoreboard: 1 counter per reg keeping track of # instns using it as a source or destination –keeps track of max # regs that can be processed concurrently
60
60 COSC 3P92 –idea: execute instns so long as resources are available, and no conflicts –move order of instns to permit this –registers are renamed automatically to reduce conflicts: “secret regs” »eg. if a register is in conflict, rename it so conflict is removed. »copy values to original named reg later if required. –result: huge performance gain (we’re trying to make pipeline maximally useful!) Out-of-order exec, reg renaming (cont)
61
61 COSC 3P92 Improving performance: speculative exec block: a section of sequential code [4.45] Can increase throughput by moving instructions beyond their blocks –hoisting: moving an instruction over a branch speculative execution: executing an instruction before it is known whether it will be needed –OK to do it so long as there is no side effect (eg. write to memory, trap/interrupt) –may sometimes cause slowdown if spec. exec fetches an instn from memory that isn’t needed –otherwise, idea is to move slower instructions up the queue so that their processing can occur in the interim some solns: –speculative instns: only fetch/exec instructions that are in the cache –poison bits: don’t set traps automatically; wait until that instn actually executed, and if a poison bit is set, then set the trap
62
62 COSC 3P92 Speculative exec
63
63 COSC 3P92 Example 1: Pentium II 1. Fetch/decode [4.46] –fetches instns and breaks them into m.i.’s 2.dispatch/exec –takes m.i.’s and execs them 3. retirement unit –completes exec, stores reg values (speculative exec) 1, 2, 3 above act as high-level pipeline ROB (reorder buffer): table of m.i.’s to execute Fetch/decode [4.47] –7-stage pipeline –multiple formats, sizes means instn decoding is involved –analyzes instns to determine: size, branch-prediction –usually between 1 and 4 m.i.’s per ISA instn. –uses reg renaming –both static, dynamic branch prediction used Dispatch/exec [4.48] –5 m.i.’s can be exec’d at once
64
64 COSC 3P92 P2-micro architecture
65
65 COSC 3P92
66
66 COSC 3P92 Example 2: UltraSPARC II [4.49] RISC: all instns are 3-register microinstns already branch prediction: (i) cache flags; (ii) 2-bit prediction; (iii) compiler directions in instns tries to exec 4 instns in parallel all the time –instns may be executed out of order 9-stage pipeline [4.50] –split integer, float pipelines –int adds 2 stages (N1, N2) to keep it same as fp
67
67 COSC 3P92 UltraSPARC
68
68 COSC 3P92 UltraSPARC Pipeline
69
69 COSC 3P92 Example 3: picoJava II [4.51] instn, data caches are optional register file (64 entries) –contains top 64 words of stack –dribbling: reg file read/written to memory when it gets too empty/full –“free” access, w/o accessing caches (which may not be used)
70
70 COSC 3P92
71
71 COSC 3P92 6-stage pipeline [4.52] –CISC instns –not superscalar: instns fetched, retired inorder (unlike Pentium II) no branch prediction alg (economy)
72
72 COSC 3P92 Folding Folding [4.53, 4.54, 4.55] –replace a set of m.i.’s with one m.i. –looks up patterns in a table [4.55], and replaces with equivalent m.i. –only possible if operands are high in stack, in register file –huge gain in speed, like RISC performance
73
73 COSC 3P92
74
74 COSC 3P92
75
75 COSC 3P92 Comparing these examples common features –all m.i.’s contain opcode, 2 source regs, dest reg –1 m.i. per cycle –deep pipelines –split instn and data caches Pentium II: complexity is in deconstructing its CISC instns into micro-operations JVM: complexity is in folding sets of m.i.’s into single operations UltraSparc most straight-forward to implement, because instns require minimal decoding (all RISC instructions are micro-operations already!)
76
76 COSC 3P92 The end
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.