Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPE 382 / ECE 510 Computer Organization & Architecture Appendix B- Instruction Set Architectures based on text: Computer Architecture : A Quantitative.

Similar presentations


Presentation on theme: "CMPE 382 / ECE 510 Computer Organization & Architecture Appendix B- Instruction Set Architectures based on text: Computer Architecture : A Quantitative."— Presentation transcript:

1 CMPE 382 / ECE 510 Computer Organization & Architecture Appendix B- Instruction Set Architectures
based on text: Computer Architecture : A Quantitative Approach (Paperback) John L. Hennessy, David A. Patterson Morgan Kaufmann; 4th edition 2006 Many lecture slides are courtesy of or based on the work of Drs. Asanovic, Patterson, Culler and Amaral CS252 S05

2 Mental Exercise (not handed in)
In some machines, R0 will always contain zero, regardless of what is written to it. How many (or other CISC) instructions can you implement using ADD with “R0”? cmpe382/ece510 ch B

3 Mental Homework Solution - ADD
MOVE R1,R2 CLR R1 NOP as well as (with immediate addressing mode, without R0) INC R1 DEC R1 ADD R1,R2 (two operand) Flexible! Why invent new instructions when a few instructions can do a lot? cmpe382/ece510 ch B

4 Flavours of Instruction Set Architectures
cmpe382/ece510 ch B

5 Memory addressing Byte 1 byte = byte (char)
formerly the minimum unit you can access in computer memory Now 8bits, thanks to IBM Byte size on first 64bit Alpha? 1 byte = byte (char) 2 bytes = half word 16bits (short int) 4 bytes = word 32bits (int, float) 8 bytes = double word 64bits (long int, double) 16 bytes = quad word 128bits cmpe382/ece510 ch B

6 Which Endian? Arabic numerals are Big Endian Big end first, e.g “1984”
Only distinguishable when same memory contents are accessed as different data types, or data is passed between machines of different endian-ness Little endian: byte 0 is least significant byte interpreted [ ] Big endian: byte 0 is most significant byte interpreted [ ] cmpe382/ece510 ch B

7 Data alignment Options for interpreting data misaligned to machine word boundaries: Abort (illegal alignment) Trap to OS to fix in software Provide hardware to hide the two necessary memory accesses cmpe382/ece510 ch B

8 H&P assume you already read P&H Computer Organization & Design
So, we’ll mention the basics of pipelining now and look at the details in Appendix A

9 Execution Cycle Obtain instruction from program storage Instruction
Fetch Decode Operand Execute Result Store Next Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction cmpe382/ece510 ch B

10 What’s a Clock Cycle? Old days: 10 levels of gates
D-flipflop or register combinational logic Old days: 10 levels of gates Today: determined by numerous time-of-flight issues + gate delays clock propagation, wire lengths, drivers D-FF delay = Tsetup + Tprop_clock-Q cmpe382/ece510 ch B

11 Classic 5-stage pipeline
IF instruction fetch ID/RD instruction decode and register read EX execute = ALU operation MEM read or write memory WB register write back results cmpe382/ece510 ch B

12 Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40
Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? cmpe382/ece510 ch B

13 Pipelined Laundry Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads (4h synchronous) cmpe382/ece510 ch B

14 Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup 7 8 9 Time T a s k O r d e 30 40 20 A B C D cmpe382/ece510 ch B

15 Instruction Pipelining
Execute billions of instructions, so throughput is what matters except when (name ~4)? What is desirable in instruction sets for pipelining? Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place? cmpe382/ece510 ch B

16 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch in some see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 cmpe382/ece510 ch B CS252 S05

17 Datapath vs Control Datapath Controller signals Control Points Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals cmpe382/ece510 ch B CS252 S05

18 Approaching an ISA Instruction Set Architecture
Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing Meaning of each instruction is described by RTL on architected registers and memory Given technology constraints assemble adequate datapath Architected storage mapped to actual storage Function units to do all the required operations Possible additional storage (eg. MAR, …) Interconnect to move information among regs and FUs Map each instruction to sequence of RTL operations Collate sequences into symbolic controller state transition diagram (STD) Implement controller cmpe382/ece510 ch B CS252 S05

19 Example: MIPS (Note register location)
Register-Register 31 26 25 21 20 16 15 11 10 6 5 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 target Op cmpe382/ece510 ch B

20 How to design an ISA that resists pipelining and superscalar
Use a processor status word Variable length instructions Opcodes Variable length operands Variable length operands who’s length can’t be determined by the first word of the instruction how can we determine where the next instruction is to put it in the pipeline? Lots of addressing modes on every instruction Anything else to help make the CPI highly variable cmpe382/ece510 ch B

21 Architecture Methodology Dynamic usage
“A program executes 90% of its instructions in 10% of its code.” Can’t determine dynamic usage from static code or compiler information Require execution traces or simulation using real data cmpe382/ece510 ch B

22 Memory Allocation Global Data Area Heap Stack constants and strings
statically declared variables BSS: variables initialized to zero fixed size at compile and link time Heap dynamic declared objects new malloc Stack (not on stack-machine evaluation stack) JSR return address saved registers from procedure call automatic variables (local to procedure or function) cmpe382/ece510 ch B

23 Numerous CISC addressing modes
Addressing mode Example Meaning Register Add R4,R3 R4 ¬ R4+R3 Immediate Add R4,#3 R4 ¬ R4+3 Displacement Add R4,100(R1) R4 ¬ R4+Mem[100+R1] Register indirect Add R4,(R1) R4 ¬ R4+Mem[R1] Indexed / Base Add R3,(R1+R2) R3 ¬ R3+Mem[R1+R2] Direct or absolute Add R1,(1001) R1 ¬ R1+Mem[1001] Memory indirect Add R1 ¬ R1+Mem[Mem[R3]] Auto-increment Add R1,(R2)+ R1 ¬ R1+Mem[R2]; R2 ¬ R2+d Auto-decrement Add R1,–(R2) R2 ¬ R2–d; R1 ¬ R1+Mem[R2] Scaled Add R1,100(R2)[R3] R1 ¬ R1+Mem[100+R2+R3*d] Why Auto-increment/decrement? Scaled? cmpe382/ece510 ch B

24 Cost to implement addressing modes
operations (& hardware) to fetch operand: Dmem+Imem(imm)+ALU+regRead+regWrite Register Immediate Displacement Register indirect Indexed / Base Direct or absolute Memory indirect Auto-increment Auto-decrement Scaled Maximum of all above All of these resources are needed for each operand for CPI=1 cmpe382/ece510 ch B

25 Are all of those addressing modes used
Are all of those addressing modes used? VAX running SPEC89 - dynamic usage cmpe382/ece510 ch B

26 Addressing mode usage 3 desktop programs:
•Displacement: 42% avg, 32% to 55% •Immediate: 33% avg, 17% to 43% •Register deferred (indirect): 13% avg, 3% to 24% •Scaled: 7% avg, 0% to 16% •Memory indirect: 3% avg, 1% to 6% •Misc: 2% avg, 0% to 3% 75% displacement & immediate 88% displacement, immediate & register indirect optimize the common case cmpe382/ece510 ch B

27 Size of displacement and offset
Too many bits increases size of all instructions Too few bits requires piecing together literals Instruction should be a “nice” size cmpe382/ece510 ch B

28 Typical RISC addressing modes
Less than the up to 16 CISC addressing modes Register and immediate addressing mode on all ALU instructions Load/store architecture (only route to data memory via these instructions) Implement displacement addressing mode LD R1,314(R2) for free: Register indirect LD R1,0(R2) also for free: absolute LD R1,42(R0) cmpe382/ece510 ch B

29 Categories of instructions H&P
Integer ALU (arithmetic & logical) Data transfer (load, store) Control (branch, jump, call, return, trap) System (interrupts, IO, VM, protection) Floating Point Decimal (any COBOL programmers in the room?) String Graphics (fast pixel & vertex operations) cmpe382/ece510 ch B

30 MIPS64 data transfer instructions
SD 500(R4), R3 Store double word 64 bits SW 500(R4), R3 Store word SH 502(R2), R3 Store half SB 41(R3), R2 Store byte LD R1, 30(R2) Load double word LW R1, 30(R2) Load word LH R1, 40(R3) Load halfword LHU R1, 40(R3) Load halfword unsigned LB R1, 40(R3) Load byte LBU R1, 40(R3) Load byte unsigned LUI R1, 40 Load Upper Immediate (16 bits shifted left by 16) cmpe382/ece510 ch B

31 MIPS arithmetic instructions
add add $1,$2,$3 $1 = $2 + $3, 3 operands; subtract sub $1,$2,$3 $1 = $2 – $3 3 operands; add immediate addi $1,$2,100 $1 = $ constant; add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; subtract unsigned subu $1,$2,$3 $1 = $2 – $3 3 operands; add imm. unsign. addiu $1,$2,100 $1 = $ constant; multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product divide div $2,$3 Lo = $2 ÷ $3, Lo = quotient, Hi = remainder Hi = $2 mod $3 divide unsigned divu $2,$3 Lo = $2 ÷ $3, Unsigned Hi = $2 mod $3 Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi Move from Lo mflo $1 $1 = Lo Used to get copy of Lo why the departure from 32 elegant registers? cmpe382/ece510 ch B

32 MIPS logical instructions
and and R1,R2,R3 R1 = R2 & R3 3 reg. operands; Logical AND or or R1,R2,R3 R1 = R2 | R3 3 reg. operands; Logical OR xor xor R1,R2,R3 R1 = R2 XOR R3 3 reg. operands; Logical XOR nor nor R1,R2,R3 R1 = ~(R2 |R3) 3 reg. operands; Logical NOR and immediate andi R1,R2,10 R1 = R2 & 10 Logical AND reg, constant or immediate ori R1,R2,10 R1 = R2 | 10 Logical OR reg, constant xor immediate xori R1, R2,10 R1 = R2 XOR10 Logical XOR reg, constant shift left logical sll R1,R2,10 R1 = R2 << 10 Shift left by constant shift right logical srl R1,R2,10 R1 = R2 >> 10 Shift right by constant shift right arithm. sra R1,R2,10 R1 = R2 >> 10 Shift right (sign extend) shift left logical sllv R1,R2,R3 R1 = R2 << R3 Shift left by variable shift right logical srlv R1,R2, R3 R1 = R2 >> R3 Shift right by variable shift right arithm. srav R1,R2, R3 R1 = R2 >> R3 Shift right arith. by variable cmpe382/ece510 ch B

33 Branch offset size vs. frequency H&P fig 20
cmpe382/ece510 ch B

34 MIPS jump, branch, compare instructions No processor status word
branch on equal beq R1,R2,100 if (R1 == R2) go to PC+4+100*4 Equal test; PC relative branch branch on not eq. bne R1,R2,100 if (R1!= R2) go to PC+4+100* Not equal test; PC relative set on less than slt R1,R2,R3 if (R2 < R3) R1=1; else R1= Compare less than; 2’s comp. set less than imm. slti R1,R2,100 if (R2 < 100) R1=1; else R1= Compare < constant; 2’s comp. set less than uns. sltu R1,R2,R3 if (R2 < R3) R1=1; else R1= Compare less than; natural numbers set l. t. imm. uns. sltiu R1,R2,100 if (R2 < 100) R1=1; else R1= Compare < constant; natural numbers jump j go to 10000*4 Jump to target address jump register jr R31 go to R31 For switch, procedure return jump and link jal R31 = PC + 4; go to 10000* For procedure call cmpe382/ece510 ch B

35 Test and branch methods
Compute condition (<,<=,>=,>), leave in single GPR MIPS, Alpha Compare and branch (<,<=, ==, !=,>=,>) VAX, PA-RISC, later MIPS (==, !=) cost? Condition code bits in PSW 80x86, 68000, SPARC, PowerPC Conditional select/move Alpha, MIPS Prefixed conditional execution of instructions Itanium cmpe382/ece510 ch B

36 5 Stages of Classical MIPS Int. Datapath
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address Memory RS2 MUX ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data cmpe382/ece510 ch B CS252 S05

37 Utilization of hardware resources?
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage cmpe382/ece510 ch B CS252 S05

38 Visualizing Pipelining
Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e cmpe382/ece510 ch B

39 MIPS integer datapath includes:
“3-port” register file: 2 read, 1 write ALU (2 operands, 1 result) comparator (equal / not-equal / full >, >=... in ALU) read/write data memory port instruction memory port & decode cmpe382/ece510 ch B

40 MIPS Floating Point 64 and 32 bit precision (double, single)
32 FP 64-bit registers F0..F31 separate load instructions, e.g. L.D, S.D early versions put 64 DP in a pair of 32 bit registers + - * /, transfers, compares ADD.D, ADD.S, MULT.D multiply accumulate MADD.D fd, fr, fs, ft // fd = (fs*ft)+fr paired operation (pairs of 32-bit single precision operands simultaneously in a 64-bit datapath) MULT.PS (paired single precision) cmpe382/ece510 ch B

41 Embedded Processors Big market, “Cars have more value in Silicon than steel” ~No new applications after manufacture Single address space, often no virtual memory Floating point rare Memory (+ code size), cost, size, power sensitive Virtual memory ?? Controllers simple RISC processors Digital Signal Processors special instructions for DSP inner loops Multiplier accumulator at the heart of it all cmpe382/ece510 ch B

42 Multimedia SIMD instructions
SIMD = Single Instruction stream, Multiple Data stream Graphics (RGB), some signal processing If you have the hardware (ALU, registers) for 64 bit, you can also do 2 32-bit (integer or single precision floating point) 4 16-bit (integer) 8 8-bit operands “just uncouple the carry circuits” (and a bit more) “Divided word operations”, “vector” (not) HP PA-RISC MAX2, Alpha MAX, MIPS MDMX, SPARC VIS, Intel MMX, PowerPC AltiVec cmpe382/ece510 ch B

43 The Role of Compilers - wrapping up
If a compiler won’t use an instruction or addressing mode, who will? cmpe382/ece510 ch B

44 Optimizing Compilers Language front end High level optimization
interchangeable, language specific High level optimization mostly machine independent loop transformations, procedure inlining output may even be source code Global optimizer register allocation local (within straight-line basic blocks) global (between basic blocks) Code generator highly machine dependent generates assembler code or object code directly cmpe382/ece510 ch B

45 Register Allocation Avoid unnecessary memory traffic (also reduces code size) 16+ registers handy Special purpose registers are a pain Never enough registers, some variables must spill into memory sometimes (local variables stored in the stack) Interprocedural register optimization is possible not compatible with shared libraries MIPS used this for SPECmarks only Aliasing (with pointers) can defeat allocating variables in registers cmpe382/ece510 ch B

46 Local optimization Within a basic block (between the branches)
Common subexpression elimination Constant propagation Expression stack height reduction cmpe382/ece510 ch B

47 Global Optimization Across branches Copy propagation
Code motion (move code out of loops) Induction variable elimination cmpe382/ece510 ch B

48 Processor-Dependent Strength reduction Pipeline scheduling
Branch offset optimization cmpe382/ece510 ch B

49 Hardware to keep compiler writers happy
Regularity orthogonal architecture registers, opcodes, addressing modes Primitives, not application level solutions not FFT addressing modes, eval polynomial, SIMD instructions Simplify trade-offs Let constants be constants don’t interpret values at run time if those values are known at compile time cmpe382/ece510 ch B CS252 S05


Download ppt "CMPE 382 / ECE 510 Computer Organization & Architecture Appendix B- Instruction Set Architectures based on text: Computer Architecture : A Quantitative."

Similar presentations


Ads by Google