Chapter 5: ISAs In MARIE, we had simple instructions

Chapter 5: ISAs In MARIE, we had simple instructions
4 bit op code followed by either 12 bit address for load, store, add, subt, jump 2 bit condition code for skipcond 12 0s for instructions that did not need a datum However, most ISAs are much more complex so there are many more op codes and possibly more than 1 operand How do we specify the operation? Each operation will have a unique op code, although op codes might not be equal length (in MARIE, all were 4 bits, in some ISAs, op codes range from 8 bits to 16 or more) How do we specify the number of operands? This is usually determined by the op code, although it could also be specified in the instruction as an added piece of instruction information How do we specify the location of each operand? We need addressing information ISA – instruction set architecture. Computer Architects must make decisions not only on hardware, but the instruction set itself. This includes: What is the instruction format? Can instructions vary in length (number of bits to specify the instruction)? How many operands can an instruction have? Of those operands, can any require main memory accesses? What types of addressing modes are available for memory accesses? The decisions we make for our instruction set can negatively impact the performance of the processor when we have a pipeline and/or parallel processing. We study some of the issues here.

Instruction Formats PDP-10 – fixed length instructions, 9 bit op code (512 operations) followed by 2 operands one operand in a register, the other in memory Here we see two very different formats and ideologies. The PDP-10 used a rigid fixed-length instruction format. All instructions were 36 bits long and consisted of an op code (the operation), and two operands. The two operands are denoted by a register and a memory location. The memory location is the sum of the memory address as specified in the instruction, and an index register (which could be 0). The I bit denotes whether the memory location stores the datum (I=0) or a pointer (I=1, I stands for indirect). The PDP-11 instead used 13 different formats, the programmer selected the format that best fit the given instruction depending on whether the instruction needed 0, 1 or 2 operands and whether the operands should come from registers, a register and memory, a register and an immediate datum, or two memory locations. Different addressing modes were available as well, such as one that combined an index register with an address to compute the actual memory location. CC means condition codes (status flag tests). PDP-11 – variable length with 13 different formats Varies from 4 bit op code to 16 bit op code, 0, 1, 2 and 3 operands can be specified based on the format

2 More Formats The variable length Intel (Pentium) format is shown above, instructions can vary from 1 byte to 17 with op codes being 1 or 2 bytes long, and all instructions having up to 2 operands The fixed length PowerPC format is shown on the right, all instructions are 32 bits but there are five basic forms, and have up to 3 operands as long as the 3 operands are stored in registers The Intel x86 format is deeply convoluted. It can range from 1 byte to 17 bytes. The op code will be 1 or 2 bytes, followed by addressing information including the address mode and the address specification for the 1 or 2 operands. The operand can be an immediate datum, a register, an address that consists of an address stored in an addressing register added to a displacement, or some combination. In addition, to accommodate changes made to the architecture starting with the 386, prefixes could be added that could change the defaults such as change the segment (all memory accesses are to one of four implied segments, but this can be overridden) or the size of the displacement and/or immediate datum from 1 or 2 bytes to 4 bytes. RISC instruction sets purposefully limit instructions to one fixed size, usually 32 bits. The PowerPC is such a machine where there are numerous instruction formats, but they all must fit within the 32 bit limitation. Most consist of an op code, 2 registers and either a 3rd register or an immediate datum. Most have other bits to denote a specific type of operation. For instance, the op code may merely indicate “arithmetic operation” and then later bits specify whether it is an integer addition, integer subtraction, integer comparison, floating point addition, etc. It should be noted that nobody programs in machine language or assembly language these days, so a convoluted or hard to remember instruction format is immaterial. Compilers take care of this for us.

Instruction Format Decisions
Length decisions: Fixed length makes instruction fetching predictable (which helps out in pipelining) Variable length flexible instructions, can accommodate up to 3 operands including 3 memory references and length is determined by need, so does not waste memory space Number of addressing modes Fewer addressing modes makes things easier on the architect, but possibly harder on the programmer Simple addressing modes makes pipelining easier How many registers? Generally, the more the better but with more registers, there is less space available for other circuits or cache (more registers = more expense) We will visit, later in these notes, why a fixed length instruction is desirable. However, the fixed length instruction requires us to make some sacrifices. For instance, the Vax instruction set includes an operation that allows you to do ADD X, Y, Z That is, specify 3 memory addresses in one operation. The fixed length instruction usually limits the number of memory addresses in one operation to 0 unless the operation is a load or a store (this is known as a load-store instruction set). The reason – if we want to specify 3 operands, we don’t have enough space in a 32-bit fixed length instruction. In order to determine how many bits an operand specification will take up in an instruction, take log 2 number of options. Assume an instruction format is: Op code Num operands Operand Mode Op1 specification Op2 spec Op3 spec Assume 100 instructions (op codes), up to 3 operands, 8 addressing modes. This takes up log log log 2 8 = 12 bits. In a 32 bit instruction, we would only have 20 bits left for the 3 operand specifications. If we restricted all 3 to be in registers, this would give us 20 / 3 = 6 bits per operand, so we could have 26 = 64 registers. If we had 2 in registers and had 64 registers, the final operand specification would be limited to 32 – ( ) = 8 bits. If this was to be a memory address, we would only be able to specify an address from 0 to = Or, if it was an immediate datum in 2’s complement, then it would be a value from -128 to +127.

Alignment Another question is what alignment should be used?
Recall that most machines today have word sizes of 32 bits or 64 bits and the CPU fetches or stores 1 word at a time Yet memory is organized in bytes Should we allow the CPU to access something smaller than a word? If so, we have to worry about alignment Two methods used Big Endian – bytes are placed in order in the word Little Endian – bytes are placed in opposite order See below where the word is Different architectures use different alignments between these two This is a topic that we don’t spend time covering in detail even though the authors do. The basic idea is that we usually store information in a word (often 32 bits). But information is stored in individual bytes, so we group bytes together to make up a word (for 32 bits, that is 4 bytes). Some computers store the data in order, that is, the word is split into 4 segments and stored in consecutive bytes. This is known as big endian alignment. Some computers though reverse the order so that the most significant byte is stored in the last byte of the word and the least significant byte is stored in the first byte of the word. This is known as little endian mode. The above example shows how the hex value is stored in a word in both modes. This topic is irrelevant unless two computers use a different alignment and need to communicate with each other. For instance, if I store a bitmap in little endian mode and transfer that bitmap to a machine with big endian mode, we have to first alter each word’s storage. Intel uses little Endian, and bitmaps were developed this way, so a bitmap must be converted before it can be viewed on a big Endian machine!

Type of CPU Storage Although all architectures today use register storage, other approaches have been tried: Accumulator-based – a single data register, the accumulator (MARIE is like this) This was common in early computers when register storage was very expensive General-purpose registers – many data registers are available for the programmer’s use Most RISC architectures are of this form Special-purpose registers – many data registers, but each has its own implied use (e.g., a counter register for loops, an I/O register for I/O operations, a base register for arrays, etc) Pentium is of this form Stack-based – instead of general-purpose registers, storage is a stack, operations are rearranged to be performed in postfix order An early alternative to accumulator-based architectures, obsolete now An accumulator-based CPU uses the accumulator as an implied register for all operations. For instance, Add X means AC  AC + X. This is very primitive but simplifies instructions as you will only have to specify 1 operand in any instruction. The problem is that you have to do a lot of memory movements to store temporary values. Consider trying to compute A = B + C * D / (E – F). We would have to break this into C * D. Store the value (say as T1). E – F. Store the value (say as T2). Do T1 / T2, and finally add B to the result. The stack-based machine was an early competitor to the accumulator-based machine where loads would push data on to a stack, stores would pop off the top of the stack, and arithmetic operations would pop the top 2 items off the stack, perform the operation and push the result. So the arithmetic operations would be 0 operand instructions since all operands are implied to be the top (2) of the stack. Today, computers have multiple registers so the question is whether the registers can be used any way we want, or if each has an implied usage. We will see, when we visit the Intel architecture that it uses a special-purpose register set.

Load-Store Architectures
When deciding on the number of registers to make available, architects also decide whether to support a load-store instruction set In a load-store instruction set, the only operations that are allowed to reference memory are loads and stores All other operations (ALU operations, branches) must reference only values in registers or immediate data (data in the instruction itself) This makes programming more difficult because simple operations like inc X must now first cause X to be loaded to a register and stored back to X after the inc But it is necessary to support a pipeline, which ultimately speeds up processing! All RISC architectures are load-store instruction sets and require at least 16 registers (hopefully more!) Many CISC architectures permit memory-memory and memory-register ALU operations so these machines can get by with fewer registers Intel has 4 general purpose data registers We will see why a load-store instruction set is desirable for pipelining later in these notes. All RISC machines are load-store. This means that an operation like A = B + C must be performed like this: Load R1, B Load R2, C Add R3, R1, R2 Store R3, A Whereas in the Vax, we can do this by Add A, B, C and in Intel, it is slightly shorter than RISC: Mov Eax, B Add Eax, C Mov A, Eax

Number of Operands The number of operands that an instruction can specify has an impact on instruction sizes Consider the instruction Add R1, R2, R3 Add op code is 6 bits Assume 32 registers, each takes 5 bits This instruction is 21 bits long Consider Add X, Y, Z Assume 256 MBytes of memory Each memory reference is 28 bits This instruction is 90 bits long! However, we do not necessarily want to limit our instructions to having 1 or 2 operands, so we must either permit long instructions or find a compromise The load-store instruction set is a compromise – 3 operands can be referenced as long as they are all in registers, 1 operand can be reference in memory as long as it is in an instruction by itself (load or store use 1 memory reference only) Consider this format: op code operand mode 1 operand specification 2 op mode 2 op spec 2 The machine has 315 instructions (typical of a CISC instruction set) and 12 addressing modes. The shortest mode is a register and the longest is an index register + memory address. The machine has 32 registers and an addressable memory of 1G. How long will the instruction be? Op code = 9 bits Op mode = 4 bits Shortest operand specification = 5 bits (a register) Longest operand specification = 5 bits + 30 bits (register + address) = 35 bits Shortest instruction needs = 18 bits Longest instruction needs = 87 bits

1, 2 and 3 Operand Examples Instruction Comment Instruction Comment
SUB Y, A, B Y  A – B MPY T, D, E T  D * E ADD T, T, C T  T + C DIV Y, Y, T Y  Y / T Using three addresses Instruction Comment LOAD D AC  D MPY E AC  D * E ADD C AC  AC + C STOR Y Y  AC LOAD A AC  A SUB B AC  AC – B DIV Y AC  AC / Y STOR Y Y  AC Using one address Instruction Comment MOVE Y, A Y  A SUB Y, B Y  Y – B MOVE T, D T  D MPY T, E T  T * E ADD T, C T  T + C DIV Y, T Y  Y / T Using two addresses Here we compare the length of code if we have one address instructions, two address instructions and three address instructions, each computes Y = (A – B) / (C + D * E) Notice: one and two address operand instructions write over a source operand, thus destroying data See pages for another example

Addressing Modes In our instruction, how do we specify the data?
We have different modes to specify how to find the data Most modes generate memory addresses, some modes reference registers instead Below are the most common formats (we have already used Direct and Indirect in our MARIE examples of the last chapter) Note: the Indexed mode is also called Base-displacement or displacement.

Computing These Modes In Register, the operand is
stored in a register, and the register is specified in the Instruction Example: Add R1, R2, R3 In Immediate, the operand is in the instruction such as Add #5 This is used when the datum is known at compile time – this is the quickest form of addressing In Direct, the operand is in memory and the instruction contains a reference to the memory location – because there is a memory access, this method is slower than Register Examples: Add Y (in assembly) Add (in machine) The immediate addressing mode requires that the datum be specified as a literal in the instruction, such as x++; (x = x + 1;) or x = 5; This is the fastest addressing mode because the datum is already in the CPU (stored in part of the IR). The Register mode is next fastest and also permits short instructions as the number of bits required to specify a register is just log 2 number of registers, and most machines have 64 or fewer, so this only takes 6 or fewer bits. The disadvantages are that you must first move the datum from memory into a register (and later move it back to memory if needed) and that there are a limited number of registers. The direct addressing mode is the only one we used in MARIE (although you may have also used immediate). The main problems with direct are that memory access is time consuming and therefore the operation will be longer than one that uses only immediate or register, and that the address may require a lot of bits to specify. Indirect is the same as direct except that the memory location is storing a pointer, so it requires 2 memory accesses and is thus takes even longer to execute. Consider this operation: *x = *y + *z; This requires 6 total memory accesses (not including the original instruction fetch!) In Indirect, the memory reference is to a pointer, this requires two memory accesses and so is the slowest of all addressing modes

Continued Indexed or Based is like Direct except that the address
referenced is computed as a combination of a base value stored in a register and an offset in the instruction Example: Add R3(300) This is also called Displacement or Base Displacement Register Indirect mode is like Indirect except that the instruction references a pointer in a register, not memory so that one memory access is saved Notice that Register and Register Indirect can permit shorter instructions because the register specification is shorter than a memory address specification In Stack, the operand is at the top of the stack where the stack is pointed to by a special register called the Stack Pointer – this is like Register Indirect in that it accesses a register followed by memory Register indirect is a reasonable compromise over direct in that you can keep your instruction length down because you only need to specify a register. The register is storing a pointer to the datum in memory, so takes slightly longer than direct to execute. Index/Based looks up the address in the register and adds it to the address in the instruction. This is not the datum, but the address of the datum. We use this mode for obtaining an array value. For instance, a[i] requires that we take the starting address of the array, a, and add to it the offset of i (actually i*size where size is the size of each item in the array, for instance a + i*4 if a stores int values). The addition doesn’t take much time, so this mode is very much like register indirect in terms of the time it takes. There are two basic approaches here: the register stores an offset and the instruction stores the starting point (used for arrays) or the register stores the starting point and the instruction stores the offset (often used for obtaining a datum in relocatable code). Stack uses the implied stack pointer to obtain the datum at the top of the stack.

Example Assume our instruction is Load 800
Assume memory stores the values as shown To the left and register R1 stores 800 Assume our instruction is Load 800 The value loaded into the accumulator depends on the addressing mode used, see below: The example should be straight forward. Be warned though that to obtain the datum for Indexed means to add 800 to the location in R1 (800), that gives us This is not the datum but the address of the datum. That’s why the answer is 700. Data is 800 Data’s location is at 800 Data’s location is pointed to by value in 800 Data’s location is at R (1600)

Instruction Types Now that we have explored some of the issues in designing an instruction set, let’s consider the types of instructions: Data movement (load, store) + I/O + Arithmetic (+, -, *, /, %) * Boolean (AND, OR, NOT, XOR, Compare) * Bit manipulation (rotate, shift) * Transfer of control (conditional branch, unconditional branch, branch and link, trap) * Special purpose (halt, interrupt, others) Those marked with * use the ALU Note: branches add or subtract or change the PC, so these use the ALU Those marked with + use memory or I/O

Instruction-Level Pipelining
We have already covered the fetch-execute process It turns out that, if we are clever about designing our architecture, we can design the fetch-execute cycle so that each phase uses different hardware we can overlap instruction execution in a pipeline the CPU becomes like an assembly line, instructions are fetched from memory and sent down the pipeline one at a time the first instruction is at stage 2 when the second instruction is at stage 1 or, instruction j is at stage 1 when instruction j – 1 is at stage 2 and instruction j – 2 is at stage 3, etc… The length of the pipeline determines how many overlapping instructions we can have The longer the pipeline, the greater the overlap and so the greater the potential for speedup It turns out that long pipelines are difficult to keep running efficiently though, so smaller pipelines are often used

A 6-stage Pipeline Stage 1: Fetch instruction Stage 2: Decode op code
Stage 3: Calculate operand addresses Stage 4: Fetch operands (from registers usually) Stage 5: Execute instruction (includes computing new PC for branches or doing loads and stores) Stage 6: Store result (in register if ALU operation or load) We tune the clock cycle to trigger at rates equal to the time it takes for the longest pipeline stage to execute. This is typically the stage that performs a memory (cache) access, which above is stage 1 and sometimes stage 5 (load or store instructions). If an instruction cannot complete within the 1 cycle time limit, it stalls the pipeline – that is, the stages to the left have to wait 1 or more extra cycles. The reason for having fixed length instructions should now become clear. If we had a variable lengthened instruction, it might take > 1 clock cycle to fetch it all. This means that the next fetch cannot yet be performed at the next clock cycle because we are still waiting for the previous instruction to be fetched. Such a situation (delaying the pipeline) is called a stall. It arises for a number of reasons and impacts the performance of the pipeline. This is a pipeline timing diagram showing how instructions overlap

Pipeline Performance Assume a machine has 6 steps in the fetch-execute cycle A non-pipelined machine will take 6 * n clock cycles to execute a program of n instructions A pipelined machine will take n + (6 – 1) clock cycles to execute the same program! If n is 1000, the pipelined machine is 6000 / 1005 times faster, or almost a speed up of 6 times! In general, a pipeline’s performance is computed as Time = (k + n – 1) * tp where k = number of stages, n = number of instructions and tp is the time per stage (plus delays caused by moving instructions down the pipeline) The non-pipeline machine is Time = k * n So the speedup is k * n / (k + n – 1) * tp However, there are problems that a pipeline faces because it overlaps the execution of instructions that cause it to slow down You might notice from the previous slide that the 4 operations could complete in just 9 clock cycles (if there were no stalls). You can see above the formula for computing the time it takes for a pipeline of 6 stages to complete n operations. We actually need to multiply this formula by a pipeline delay. There are several reasons for the delay and this causes the pipeline time to be lengthened by a short amount. For instance, if the clock cycle is 1 ns, then the pipeline delay may be .1 ns long. For our examples, we will ignore this delay so that the time to execute n instructions on a k stage pipeline is n + k – 1. Compare that to k * n for a non-pipelined machine. For instance, if our machine has 6 stages for its fetch execute cycle (see the previous slide) and we want to compare a pipelined vs non-pipelined version for a program of 1000 instructions, then we see that the pipelined machine is 6 * 1000 / ( – 1) = 6000 / 1005 = times faster. A pipeline will be nearly k times faster than a non-pipelined machine if they both use a k-stage fetch execute cycle unless there are stalls…

Pipeline Problems Pipelines are impacted by Resource conflicts
If it takes more than 1 cycle to perform a stage, then the next instruction cannot move into that stage For instance, floating point operations often take 2-10 cycles to execute rather than the single cycle of most integer operations Data dependences Consider: Load R1, X Add R3, R2, R1 Since we want to add in the 5th stage, but the datum in R1 is not available until the previous instruction reaches the 6th stage, the Add must be postponed by at least 1 cycle Branches In a branch, the PC is changed, but in a pipeline, we may have already fetched one or more instructions before we reach the stage in the pipeline where the PC is changed! Each of the above will cause stalls. Aside from FP operations, stalls also arise from cache misses during the instruction fetch or a load or store. But mostly, we get stalls from data dependences and branches. Notice that the sequence Load R1, X Add R3, R1, R2 Will not cause a stall on an unpipelined machine because the Load is executed in its entirety before the Add is even fetched. But if you look back at the 6 stage pipeline, you can see that the Load would fetch the datum X from memory (cache) in its stage 5 and store the datum in R1 in stage 6 and yet the Add instruction would try to load the datum from R1 in stage 4 and therefore get the wrong value: Load Fetch Decode Calc Fetch Op Do Load Store Add Fetch Decode Calc can’t fetch op here… The next slide explores the problem with branches.

Impact of Branches In our 6-stage pipeline
we compute the new PC at the 5th stage, so we would have loaded 4 wrong instructions (these 4 instructions are the branch penalty) thus, every branch slows the pipeline down by 4 cycles because 4 wrong instructions were already fetched consider the four-stage pipeline below S1 – fetch instruction S2 – decode instruction, compute operand addresses S3 – fetch operands S4 – execute instruction, store result (this includes computing the PC value) So, every branch instruction is followed by 3 incorrectly fetched instructions, or a branch penalty of 3 Here we can see that the branch instruction in row 3 will compute that it is going to branch to a new location only at the end of stage 4. Unfortunately, the pipeline will have already fetched the next 3 instructions and have them in stages 3, 2 and 1 respectively. We want to branch to line 8, so first we have to remove the 3 incorrectly fetched instructions, this is done by flushing the pipeline. This causes us to waste 3 cycles. In general, if a pipeline computes the branch information in stage i, then this cases an i-1 branch penalty (the number of wasted cycles before the pipeline can resume branching at the proper location). Again, a non-pipelined machine does not have this problem because it will have completed the branch instruction in its entirety before moving on to the next instruction. So a pipeline offers great speedup and also new problems. How can we make a pipeline more efficient? RISC instruction sets are simplified in part just to ensure fewer stalls. The next slide discusses other ideas.

Other Ideas In order to improve performance, architects have come up with all kinds of interesting ideas to maintain a pipeline’s performance Superscalar – have multiple pipelines so that the CPU can fetch, decode, execute 2 or more instructions at a time Use Branch Prediction – when it comes to branching, try to guess if the branch is taken and if so where in advance to lower or remove the branch penalty – if you guess wrong, start over from where you guessed incorrectly Compiler Optimizations – let the compiler rearrange your assembly code so that data dependencies are broken up and branch penalties are removed by filling the slots after a branch with neutral instructions SuperPipeline – divide pipeline stages into substages to obtain a greater overlap without necessarily changing the clock speed We study these ideas in 462 We cover these in CSC 462/562 so we won’t go over these in any detail here.

Real ISAs Intel – 2 operands, variable length, register-memory operations (but not memory-memory), pipeline superscalar with speculation – but at the microcode level MIPS – fixed length, 3 operand instructions if operands are in registers, load-store otherwise, 8-stage superpipeline, very simple instruction set

Chapter 5: ISAs In MARIE, we had simple instructions

Similar presentations

Presentation on theme: "Chapter 5: ISAs In MARIE, we had simple instructions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 5: ISAs In MARIE, we had simple instructions

Similar presentations

Presentation on theme: "Chapter 5: ISAs In MARIE, we had simple instructions"— Presentation transcript:

Similar presentations

About project

Feedback