Download presentation
Presentation is loading. Please wait.
1
What is an ISA? Hardware-software interface
Instruction Set Architecture (ISA) defines: STATE OF THE PROGRAM (processor registers, memory) WHAT INSTRUCTIONS DO: Semantics of instructions, how they update state HOW INSTRUCTIONS ARE REPRESENTED: Syntax (bit encodings) …selected so that implications of the above on hardware design/compiler design are optimal Example: register specifier moves around between different instructions-- need multiple lines and a mux before the register file.
2
Why is the ISA important?
Fixed h/w-s/w interface for a generation of processors IBM realized early the value of a fixed ISA But: “stuck” with bad decisions for long time Recent developments mitigate ISA problems (e.g., x86 micro-ops, Transmeta, virtual machines) ISA decisions affect: (Revisit RISC vs. CISC…) Memory cost of the machine Short vs. long bit encodings high vs. low semantic meaning per instruction Hardware design Simple, uniform-complexity ops => efficient pipeline Don’t build hardware for instructions that never get used Compiler and programming language issues How much can compiler exploit ISA to optimize perf. How well does ISA support high-level lang. constructs Choice for hand coding vs. compiler generated code: semantics are easy to use vs. easy to generate code for
3
ISA Design Decisions & Outline:
Style of operand specification: stack, accumulator, registers, etc. Operand access limitations Addressing modes for operands Semantics: Mix of operations supported Control transfers Encoding tradeoffs Compiler influence Example: MIPS
4
Styles of ISAs
5
Styles of ISAs All implement: C=A+B
6
Why stacks, accumulators
Very compact format All calculation operations take zero operands Example use: Java bytecode (low network b/w) Theoretically shortest code for implementing arithmetic expressions All HP calculator fanatics know this Accumulator: Also a very compact format Less dependence on memory than stack-based For both: Compact implies memory efficient Good if memory is expensive
7
Why registers? Faster than memory
Latency: raw access time (once address is known) Cache access: 2-3 cycles (typical) Register access: 1 cycle Register file typically smaller than data cache Register file doesn’t need tag check logic Bandwidth: more practical to multiport a register file ILP requires large number of operand ports ILP requirements High-performance scheduling (ILP) requires detecting data dependent/independent operations early in pipeline Register “addresses” are known at instruction decode time Memory addresses are known quite late due to address computation
8
Why Registers? (cont.) Less memory traffic if values are in registers
Program runs faster if variables are inside registers (compiler does “register allocation”) Bus can be used for other things (e.g., I/O) More flexible for compiler/hardware scheduling (A*B) - (C*D) - (E*F) A*B in R1, -C*D in R2, -E*F in R3: can easily rearrange ADD instructions A to F on the stack: less flexible Need to add swaps/rotates or completely rewrite code
9
Conflict artificially serializes the two instructions
How many registers? Depends on: Compiler ability Program characteristics Lots-o-registers enable two important optimizations: Register allocation (more variables can be in registers) Limiting reuse of registers improves parallelism Reuse example: Load R2, A; Load R3, B; Load R4, C; Load R5, D Add R1, R2, R3 Add R2, R5, R4 (reuse of R2) vs. Add R6, R4, R5 (no reuse: had R6) Without reuse Adds are “parallelizable” if there are two adders Instruction level parallelism (ILP) ILP ~ Average (CPI)-1 ~ Number of registers Conflict artificially serializes the two instructions
10
Operand access limitations
Load/store (0,3) (+) Fixed-length instructions possible: easy fetch/decode (+) Simpler h/w: efficient pipeline & potentially lower CT (-) Higher instruction count (IC) (-) Fixed-length instructions are wasteful Register/memory (1,2) (+) No need for extra loads (+) “A few lengths” better uses bits (-) Destroys source operand (e.g., Add R1,R2) (-) May impact CPI Memory/memory (+) Most compact (code density) (-) High memory traffic (memory bottleneck) Good code density
11
Alignment Byte alignment Word alignment Any access is accommodated
Only accesses that are aligned at natural word boundaries are accommodated due to DRAM/SRAM organization Reduces number of reads/writes to memory Eliminates hardware for alignment (typically expensive) Often handle misalignment via software: Compiler detects & generates appropriate instructions …or O/S detects and runs “fixit” routine memory (bytes) 1 2 Unaligned access 3 4 5 6 7 Word size = 4 bytes 1 2 3 read #1 4 5 6 7 read #2 Asking for words beginning at 0 or 4 is OK Asking for other words requires two reads (e.g., ask for word starting at 2) 4 5 2 3 reorder 2 3 4 5
12
Endian-ness Where is the most-significant byte (MSB) in a word?
Little-endian (e.g., x86) “little”-endian comes from interpreting byte address 0 as the “least”-significant byte Big-endian (e.g., IBM PowerPC) “big”-endian comes from interpreting byte address 0 as the “most”-significant byte 1 2 3 4 5 6 7 LSB MSB Byte address 1 2 3 4 5 6 7 MSB LSB Byte address
13
Common addressing modes
Register Add R4, R3 R4 = R4 + R3 Used when value is in a register Immediate Add R4, #3 R4 = R4 + 3 Useful for small constants, which occur frequently Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1] Accesses the frame (arguments, local variables) Accesses the global data segment Accesses fields of a data struct
14
Addressing modes (cont.)
Register deferred/Register indirect Add R3, (R1) R3 = R3 + Mem[R1] Access using a computed address Indexed Add R3, (R1 + R2) R3 = R3 + Mem[R1 + R2] Array accesses R1 = base, R2 = index Direct/Absolute Add R1, (1001) R1 = R1 + M[1001] Accessing global (“static”) data
15
Addressing modes (cont.)
Memory indirect/Memory deferred Add R1 = R1 + Mem[Mem[R3]] Pointer dereferencing: x = *p; (if p is not register-allocated) Autoincrement/Postincrement Add R1, (R2)+ R1 = R1 + Mem[R2]; R2 = R2 + d (d is size of operation) Looping through arrays, stack pop Autodecrement/Predecrement Add R1, -(R2) R2 = R2 - d; R1 = R1 + Mem[R2] (d is size of operation) Same uses as autoincrement, stack push Scaled Add R1, 100(R2)[R3] R1 = R1 + Mem[100+R2+R3*d] (d is size of operation) Array accesses for non-byte-sized elements
16
Wisdom about modes Need: Choice depends on workload!
Register, Displacement, Immediate and optionally Indexed (indexed simplifies array accesses) Displacement size bits (empirical) Immediate: 8 to 16 bits (empirical) Can synthesize the rest from simpler instructions Example-- MIPS architecture: Register, displacement, Immediate modes only both immediate and displacement: 16 bits Choice depends on workload! For example, floating-point codes might require larger immediates, or 64bit wordsize machines might also require larger immediates (for *p++ kind of operations)
17
Control transfer semantics
Types of branches Conditional Unconditional Normal Call Return PC Relative (Branch) vs. Absolute (Jump) Branch allows relocatable (“position independent”) code Jump allows branching further than PC relative
18
Parts of a control transfer
WHERE Determine target address WHETHER Determine if transfer should occur or not WHEN Determine when in time the transfer should occur Each of the three decisions can be decoupled
19
Types of control transfer (cont).
All three together: Compare and branch instruction Br (R1 = R2), destination (+) A single instruction (-) Heavy hardware requirement, inflexible scheduling WHETHER separate from WHERE/WHEN: Condition code register (CMP R1,R2 … BEQ dest) (+) Sometimes test happens “for free” (-) Hard for compiler to figure out which instructions depend on CC register Condition register (SUB R1,R2 … BEQ R1, dest) (+) Simple to implement, dependencies between instructions are obvious to compiler (-) Uses a register (“register pressure”)
20
Prepare-to-branch Decouple all three of WHERE / WHETHER / WHEN
WHERE: PBR BTR1 = destination BTR1 = “Branch target register #1” WHETHER: CMP PR2 = (R1 = R2) PR2 = “Predicate register #2” WHEN BR BTR1 if PR2 (+) Schedule each instruction so it happens during “free time” when hardware is idle (-) Three instructions: higher IC From the HP Labs PlayDoh architecture
21
Instruction Encoding tradeoffs
Variable width Common instructions are short (1-2 bytes), less common or more complex instructions are long (>2 bytes) (+) Very versatile, uses memory efficiently (-) Instruction words must be decoded before number of instructions is known Fixed width Typically 1 instruction per 32-bit word (Alpha is 2 instructions per word) (+) Every instruction word is an instruction, Easier to fetch/decode (-) Uses memory inefficiently
22
Addressing mode encoding
Each operand has a “mode” field Also called “address specifiers” VAX, 68000 (+) Very versatile (-) Encourages variable-width instructions (hard decode) Opcode specifies addressing mode Most RISCs (+) Encourages fixed-width instructions (easy decode) (+) “Natural” for a load/store ISA (-) Limits what every instruction can do But only matters for loads and stores
23
Compiler impact High-level opt: Low-level opt: Code generation:
Use a “virtual source level” representation Loop interchange, etc. Low-level opt: Clean up parser refuse Each “optimization pass” runs as a filter Enhance parallelism Code generation: Allocate registers Schedule code for high performance More later on this Parse High-level intermediate language High-level Optimize Low-level intermediate language Low-level Optimize Low-level intermediate language Code generation: Allocate, Schedule translate Assembly code
24
Example: MIPS A load/store, fixed-encoding
architecture with a “condition register” architecture I-type instruction Opcode 6 rs1 5 rd 5 16 Immediate Load, store, all immediate operations, conditional branches (rd unused) Jump through register, call through register (“jump and link register”) R-type instruction Opcode is in the same place for every instruction Opcode 6 rs1 5 rs2 5 rd 5 5 6 Shamt Func Register-register ALU operations “Func” is an opcode extension J-type instruction Opcode 6 26 Offset added to PC Jump, call (“jump and link”), trap and return from exception
25
ISA of MIPS R0 is permanent 0
64 R0 R0 is permanent 0 R1 R2 ... R31 PC 64 F0 Use one half of Fi for single precision ops F1 F2 ... F31 Load/store architecture Transfer sizes: B (byte), H (halfword), W (word), D (double word) No unaligned accesses allowed Only 3 addressing modes: register, immediate, displacement
26
MIPS example code DADDI R1,R0,10 Put 10 into R1 (R0 = 0)
LD R2,A Put A in R2 Loop L.D F0, 0(R2) Load double FP value into F0 ADD.D F4, F0, F2 Add F2 to F0 S.D 0(R2),F4 Store result back to memory DADDI R1,R1,-1 Decrement I DADDI R2,R2,8 Increment loop pointer BNE R1,Loop
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.