Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Computer Organization and Architecture
Processor structure and function
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Pipelining By Toan Nguyen.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Group 5 Tony Joseph Sergio Martinez Daniel Rultz Reginald Brandon Haas Emmanuel Sacristan Keith Bellville.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CPU Design and PipeliningCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading: Stallings,
Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:
Functions of Processor Operation Addressing modes Registers i/o module interface Memory module interface Interrupts.
DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.
Advanced Architectures
Computer Organization and Architecture + Networks
Computer Architecture Chapter (14): Processor Structure and Function
Computer Organization CS224
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Central Processing Unit CPU
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Central Processing Unit
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
Computer Organization and ASSEMBLY LANGUAGE
Instruction Level Parallelism and Superscalar Processors
MARIE: An Introduction to a Simple Computer
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Control unit extension for data hazards
ARM ORGANISATION.
Created by Vivi Sahfitri
Control unit extension for data hazards
Chapter 11 Processor Structure and function
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

Memory MAR Control unit MBR PC IR Figure 1 Data flow, fetch cycle Address Data Control Bus Bus Bus

Memory MAR Control unit MBR Figure 2 Data flow, indirect cycle Address Data Control Bus Bus Bus

It has been decided to go into the increasingly lucrative Sport Utility Vehicle (SUV) manufacturing business. After some intense research, we determined that there are five stages in the SUV building process, as follows: Stage 1: build the chassis. Stage 2: drop the engine in the chassis. Stage 3: put doors, a hood, and coverings on the chassis. Stage 4: attach the wheels. Stage 5: paint the SUV.

After stage one builds the chassis they pass it to stage 2 to insert the engine and then immediately start to build another chassis. This way all stages are working all of the time

Code Comments A = A +B Add the contents of registers A and B and place the result in A, overwriting whatever's there. The ALU in our simple computer would perform the following series of steps: 1. Read the contents of registers A and B. 2. Add the contents of A and B. 3. Write the result back to register A. The four stages in a classic RISC pipeline. Here are the four stages in their abbreviated form, the form in which you'll most often see them: Fetch Decode Execute Write

The amount of time that it takes to complete one pipeline stage is exactly one CPU clock cycle. Thus a faster clock means that each of the individual pipeline stages take less time. In terms of our assembly line analogy, we could imagine that we have a strictly enforced rule that each crew must take at most one hour and not a minute more to complete their work If we speed up the clock by a few minutes an hour, then if we shorten each "hour" by 10 mins and the assembly line will move faster and will produce one SUV every 50 minutes.

Pipelining and Branching (i) Fetch instruction (Fl): Read the next expected instruction into a buffer. (ii)Decode instruction (DI): Determine the opcode and the operand specifiers. (iii)Calculate operands (CO): Calculate the effective address of each source operand. This may involve displacement, register indirect, indirect, or other forms of address calculation. (vi)Fetch operands (FO): Fetch each operand from memory. Operands in registers need not be fetched. (v)Execute instruction (El): Perform the indicated operation and store the result, if any, in the specified destination operand location. (vi)Write operand (WO): Store the result in memory.

Time  Instruction 1FIDICOFOEIWO Instruction 2FIDICOFOEIWO Instruction 3FIDICOFOEIWO Instruction 4FIDICOFOEIWO Instruction 5FIDICOFOEIWO Instruction 6FIDICOFOEIWO Instruction 7FIDICOFOEIWO Instruction 8FIDICOFOEIWO Instruction 9FIDICOFOEIWO Figure 6

Time   Branch penalty  Instruction 1FIDICOFOEIWO Instruction 2FIDICOFOEIWO Instruction 3FIDICOFOEIWO Instruction 4FIDICOFO Instruction 5FIDICO Instruction 6FIDI Instruction 7FI Instruction 15FIDICOFOEIWO Instruction 16FIDICOCO FOFO EIWO

Superscalar architecture The term superscalar is used to describe CPU’s which have multiple execution units and are capable of completing more than one instruction during each clock cycle. The essence of the superscalar approach is the ability to execute instructions in different pipelines. The concept can be further exploited by allowing instructions to be executed in an order different from the program order.

Superscalar processing adds a bit of complexity to the processor's control unit because it's now tasked not only with fetching and decoding instructions, but with re-ordering the linear instruction stream so that some of its individual instructions could execute in parallel. Furthermore, once executed the instructions must be put back in the order in which they were originally fetched, so that both the programmer and the rest of the system have no idea that the instructions weren't executed in their proper sequence.

Limitations to Superscalar Execution Three categories of limitations are: 1. Resource conflicts:- They occur if two or more instructions compete for the same resource (register, memory, functional unit) at the same time. – Introducing several parallel pipelined units, superscalar architectures try to reduce a part of possible resource conflicts. 2. Control (procedural) dependency:- The presence of branches creates major problems in assuring an optimal parallelism. –Superscalar techniques work most efficiently on to RISC architectures, with fixed instruction length and format. 3. Data conflicts: - Data conflicts are produced by data dependencies between instructions in the program. –Because superscalar architectures provide a great liberty in the order in which instructions can be issued and completed, data dependencies have to be considered with much attention.

Register file

This interface consists of a data bus and two types of ports: the read ports and the write ports. In order to read a value from a single register in the register file, the ALU accesses the register file's read port and requests that the data from a specific register be placed on the special internal data bus that the register file shares with the ALU. Likewise, writing to the register file is done through the file's write port. Register file

Register Renaming You can’t execute the 1 st 2 instructions in parallel or the 3 rd and 4 th. Out of order execution in a superscalar CPU would normally allow the 1 st and 3 rd instructions to execute concurrently and then the 2 nd and 4 th instructions could also execute concurrently. However, a data hazard, of sorts, also exists between the first and third instructions since they use the same register – could use different register CPU could support an array of EAX registers; e.g EAX[0], EAX[1], EAX[2], etc.and EBX[0]..EBX[n], ECX[0]..ECX[n], etc. CPU can automatically choose a different register array element if doing so would not change the overall computation and doing so could speed up the execution of the program. mov( 0, eax ); mov( eax, i ); mov( 50, eax ); mov( eax, j );

mov( 0, eax[0] ); mov( eax[0], i ); mov( 50, eax[1] ); mov( eax[1], j ); So Since EAX[0] and EAX[1] are different registers, the CPU can execute the first and third instructions concurrently. Likewise, the CPU can execute the second and fourth instructions concurrently Very long instruction word (VLIW) allows the compiler or pre-processor to break program instruction down into basic operations that can be performed by the processor in parallel These operations are put into a very long instruction word which the processor can then take apart without further analysis, handing each operation to an appropriate functional unit. Very Long Instruction Word Architecture

VLIW chips combine two or more instructions into a single bundle or packet. The compiler prearranges the bundles so the VLIW chip can quickly execute the instructions in parallel, freeing the microprocessor from having to perform the complex and continual runtime analysis that superscalar RISC and CISC chips must do. The compilers are responsible for arranging the instructions in the most efficient manner.

But even with the best compilers, there are limits to how much parallelism a VLIW processor can exploit. This principle works fine in more special processors such as DSPs. These chip perform the same operations over and over again. A good RISC or CISC design might do just as well with the software that most users run.

Dealing with branches One of the major problems in designing an instruction pipeline is assuring the steady flow of instructions to the initial stages of the pipeline. One of the main problems is the conditional branch. A variety of approaches have been taken for dealing with conditional branches: Multiple streams Prefetch branch target Loop buffer Branch prediction Delayed branch

Multiple Streams A simple pipeline suffers a penalty for a branch instruction because it must choose one of two instructions to fetch next and may make the wrong choice. A brute-force approach is to replicate the initial portions of the pipeline and allow the pipeline to fetch both instructions, making use of two streams. There are two problems with this approach: (i)With multiple pipelines there are contention delays for access to the registers and to memory. (ii)Additional branch instructions may enter the pipeline (either stream) before the original branch decision is resolved. Each such instruction needs an additional stream. Despite these drawbacks, this strategy can improve performance.

Prefetch Branch Target When a conditional branch is recognised, the target of the branch is prefetched, in addition to the instruction following the branch. This target is then saved until the branch instruction is executed. If the branch is taken, the target has already been prefetched. Loop Buffer A loop buffer is a small, very-high-speed memory maintained by the instruction fetch stage of the pipeline and containing the n most recently fetched instructions, in sequence. If a branch is to be taken, the hardware first checks whether the branch target is within the buffer. If so, the next instruction is fetched from the buffer. The loop buffer has three benefits: 1.With the use of prefetching, the loop buffer will contain some instructions sequentially ahead of the current instruction fetch address. Thus, instructions fetched in sequence will be available without the usual memory access time. 2.If a branch occurs to a target just a few locations ahead of the address of the branch instruction, the target will already be in the buffer. This is useful for the rather common occurrence of IF-THEN and IF-THEN-ELSE sequences. 3. This strategy is particularly well suited to dealing with loops, or iterations; hence the name loop buffer. If the loop buffer is large enough to contain all the instructions in a loop, then those instructions need to be fetched from memory only once, for the first iteration. For subsequent iterations, all the needed instructions are already in the buffer.

Branch Prediction Various techniques can be used to predict whether a branch will be taken. Among the more common are the following : Predict never taken Predict always taken Predict by op-code Taken/not taken switch Branch history table

The End

Figure 5: Basic 4-stage pipeline

The back end corresponds roughly to the ALU and registers in the programming model.

The front end roughly corresponds to the control and I/O units.

Pipeline stages are not always 4 stages – this is about the minimum for a RISC machine Note that the number of pipeline stages is referred to as the pipeline depth. Simple Advanced Front End 1Fetch 1Fetch1 2Fetch2 2Decode/Dispatch 3 4Issue Back End 3Execute5 4Complete/Write 6Complete 7Write Table 1