Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus.

Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus

Memory MAR Control unit MBR Figure 2 Data flow, indirect cycle Address Data Control Bus Bus Bus

It has been decided to go into the increasingly lucrative Sport Utility Vehicle (SUV) manufacturing business. After some intense research, we determined that there are five stages in the SUV building process, as follows: Stage 1: build the chassis. Stage 2: drop the engine in the chassis. Stage 3: put doors, a hood, and coverings on the chassis. Stage 4: attach the wheels. Stage 5: paint the SUV.

After stage one builds the chassis they pass it to stage 2 to insert the engine and then immediately start to build another chassis. This way all stages are working all of the time

Code Comments A = A +B Add the contents of registers A and B and place the result in A, overwriting whatever's there. The ALU in our simple computer would perform the following series of steps: 1. Read the contents of registers A and B. 2. Add the contents of A and B. 3. Write the result back to register A. The four stages in a classic RISC pipeline. Here are the four stages in their abbreviated form, the form in which you'll most often see them: Fetch Decode Execute Write

The amount of time that it takes to complete one pipeline stage is exactly one CPU clock cycle. Thus a faster clock means that each of the individual pipeline stages take less time. In terms of our assembly line analogy, we could imagine that we have a strictly enforced rule that each crew must take at most one hour and not a minute more to complete their work If we speed up the clock by a few minutes an hour, then if we shorten each "hour" by 10 mins and the assembly line will move faster and will produce one SUV every 50 minutes.

Pipelining and Branching (i) Fetch instruction (Fl): Read the next expected instruction into a buffer. (ii)Decode instruction (DI): Determine the opcode and the operand specifiers. (iii)Calculate operands (CO): Calculate the effective address of each source operand. This may involve displacement, register indirect, indirect, or other forms of address calculation. (vi)Fetch operands (FO): Fetch each operand from memory. Operands in registers need not be fetched. (v)Execute instruction (El): Perform the indicated operation and store the result, if any, in the specified destination operand location. (vi)Write operand (WO): Store the result in memory.

Time  1 234567891011121314 Instruction 1FIDICOFOEIWO Instruction 2FIDICOFOEIWO Instruction 3FIDICOFOEIWO Instruction 4FIDICOFOEIWO Instruction 5FIDICOFOEIWO Instruction 6FIDICOFOEIWO Instruction 7FIDICOFOEIWO Instruction 8FIDICOFOEIWO Instruction 9FIDICOFOEIWO Figure 6

Time   Branch penalty  1 234567891011121314 Instruction 1FIDICOFOEIWO Instruction 2FIDICOFOEIWO Instruction 3FIDICOFOEIWO Instruction 4FIDICOFO Instruction 5FIDICO Instruction 6FIDI Instruction 7FI Instruction 15FIDICOFOEIWO Instruction 16FIDICOCO FOFO EIWO

Superscalar architecture The term superscalar is used to describe CPU’s which have multiple execution units and are capable of completing more than one instruction during each clock cycle. The essence of the superscalar approach is the ability to execute instructions in different pipelines. The concept can be further exploited by allowing instructions to be executed in an order different from the program order.

Superscalar processing adds a bit of complexity to the processor's control unit because it's now tasked not only with fetching and decoding instructions, but with re-ordering the linear instruction stream so that some of its individual instructions could execute in parallel. Furthermore, once executed the instructions must be put back in the order in which they were originally fetched, so that both the programmer and the rest of the system have no idea that the instructions weren't executed in their proper sequence.

Limitations to Superscalar Execution Three categories of limitations are: 1. Resource conflicts:- They occur if two or more instructions compete for the same resource (register, memory, functional unit) at the same time. – Introducing several parallel pipelined units, superscalar architectures try to reduce a part of possible resource conflicts. 2. Control (procedural) dependency:- The presence of branches creates major problems in assuring an optimal parallelism. –Superscalar techniques work most efficiently on to RISC architectures, with fixed instruction length and format. 3. Data conflicts: - Data conflicts are produced by data dependencies between instructions in the program. –Because superscalar architectures provide a great liberty in the order in which instructions can be issued and completed, data dependencies have to be considered with much attention.

Register file

This interface consists of a data bus and two types of ports: the read ports and the write ports. In order to read a value from a single register in the register file, the ALU accesses the register file's read port and requests that the data from a specific register be placed on the special internal data bus that the register file shares with the ALU. Likewise, writing to the register file is done through the file's write port. Register file

Register Renaming You can’t execute the 1 st 2 instructions in parallel or the 3 rd and 4 th. Out of order execution in a superscalar CPU would normally allow the 1 st and 3 rd instructions to execute concurrently and then the 2 nd and 4 th instructions could also execute concurrently. However, a data hazard, of sorts, also exists between the first and third instructions since they use the same register – could use different register CPU could support an array of EAX registers; e.g EAX[0], EAX[1], EAX[2], etc.and EBX[0]..EBX[n], ECX[0]..ECX[n], etc. CPU can automatically choose a different register array element if doing so would not change the overall computation and doing so could speed up the execution of the program. mov( 0, eax ); mov( eax, i ); mov( 50, eax ); mov( eax, j );

mov( 0, eax[0] ); mov( eax[0], i ); mov( 50, eax[1] ); mov( eax[1], j ); So Since EAX[0] and EAX[1] are different registers, the CPU can execute the first and third instructions concurrently. Likewise, the CPU can execute the second and fourth instructions concurrently Very long instruction word (VLIW) allows the compiler or pre-processor to break program instruction down into basic operations that can be performed by the processor in parallel These operations are put into a very long instruction word which the processor can then take apart without further analysis, handing each operation to an appropriate functional unit. Very Long Instruction Word Architecture

VLIW chips combine two or more instructions into a single bundle or packet. The compiler prearranges the bundles so the VLIW chip can quickly execute the instructions in parallel, freeing the microprocessor from having to perform the complex and continual runtime analysis that superscalar RISC and CISC chips must do. The compilers are responsible for arranging the instructions in the most efficient manner.

But even with the best compilers, there are limits to how much parallelism a VLIW processor can exploit. This principle works fine in more special processors such as DSPs. These chip perform the same operations over and over again. A good RISC or CISC design might do just as well with the software that most users run.

Dealing with branches One of the major problems in designing an instruction pipeline is assuring the steady flow of instructions to the initial stages of the pipeline. One of the main problems is the conditional branch. A variety of approaches have been taken for dealing with conditional branches: Multiple streams Prefetch branch target Loop buffer Branch prediction Delayed branch

Multiple Streams A simple pipeline suffers a penalty for a branch instruction because it must choose one of two instructions to fetch next and may make the wrong choice. A brute-force approach is to replicate the initial portions of the pipeline and allow the pipeline to fetch both instructions, making use of two streams. There are two problems with this approach: (i)With multiple pipelines there are contention delays for access to the registers and to memory. (ii)Additional branch instructions may enter the pipeline (either stream) before the original branch decision is resolved. Each such instruction needs an additional stream. Despite these drawbacks, this strategy can improve performance.

Prefetch Branch Target When a conditional branch is recognised, the target of the branch is prefetched, in addition to the instruction following the branch. This target is then saved until the branch instruction is executed. If the branch is taken, the target has already been prefetched. Loop Buffer A loop buffer is a small, very-high-speed memory maintained by the instruction fetch stage of the pipeline and containing the n most recently fetched instructions, in sequence. If a branch is to be taken, the hardware first checks whether the branch target is within the buffer. If so, the next instruction is fetched from the buffer. The loop buffer has three benefits: 1.With the use of prefetching, the loop buffer will contain some instructions sequentially ahead of the current instruction fetch address. Thus, instructions fetched in sequence will be available without the usual memory access time. 2.If a branch occurs to a target just a few locations ahead of the address of the branch instruction, the target will already be in the buffer. This is useful for the rather common occurrence of IF-THEN and IF-THEN-ELSE sequences. 3. This strategy is particularly well suited to dealing with loops, or iterations; hence the name loop buffer. If the loop buffer is large enough to contain all the instructions in a loop, then those instructions need to be fetched from memory only once, for the first iteration. For subsequent iterations, all the needed instructions are already in the buffer.

Branch Prediction Various techniques can be used to predict whether a branch will be taken. Among the more common are the following : Predict never taken Predict always taken Predict by op-code Taken/not taken switch Branch history table

The End

Figure 5: Basic 4-stage pipeline

The back end corresponds roughly to the ALU and registers in the programming model.

The front end roughly corresponds to the control and I/O units.

Pipeline stages are not always 4 stages – this is about the minimum for a RISC machine Note that the number of pipeline stages is referred to as the pipeline depth. Simple Advanced Front End 1Fetch 1Fetch1 2Fetch2 2Decode/Dispatch 3 4Issue Back End 3Execute5 4Complete/Write 6Complete 7Write Table 1

Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus.

Similar presentations

Presentation on theme: "Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus.

Similar presentations

Presentation on theme: "Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus."— Presentation transcript:

Similar presentations

About project

Feedback