Principles of pipelining The two major parametric considerations in designing a parallel computer architecture are: –executing multiple number of instructions.

Principles of pipelining The two major parametric considerations in designing a parallel computer architecture are: –executing multiple number of instructions in parallel, –increasing the efficiency of processors. There are various methods by which instructions can be executed in parallel –Pipelining is one of the classical and effective methods to increase parallelism where different stages perform repeated functions on different operands. –Vector processing is the arithmetic or logical computation applied on vectors whereas in scalar processing only one data item or a pair of data items is processed.

Superscalar processing : For improving the processor’s speed by having multiple instructions per cycle is known as Superscalar processing. Multithreading : used for increasing processor utilization which is also used in parallel computer architecture.

OBJECTIVES Principles of linear pipelining. Classification of pipeline processor Instruction and arithmetic pipeline Principles of designing pipeline processors vector processing requirements.

PARALLEL PROCESSING Levels of Parallel Processing - Job or Program level - Task or Procedure level - Inter-Instruction level - Intra-Instruction level Execution of Concurrent Events in the computing process to achieve faster Computational Speed

PARALLEL COMPUTERS Architectural Classification Number of Data Streams Number of Instruction Streams Single Multiple SingleMultiple SISD SIMD MISDMIMD –Flynn's classification Based on the multiplicity of Instruction Streams and Data Streams Instruction Stream –Sequence of Instructions read from memory Data Stream –Operations performed on the data in the processor

COMPUTER ARCHITECTURES FOR PARALLEL PROCESSING Von-Neuman based Dataflow Reduction SISD MISD SIMD MIMD Superscalar processors Superpipelined processors VLIW Nonexistence Array processors Systolic arrays Associative processors Shared-memory multiprocessors Bus based Crossbar switch based Multistage IN based Message-passing multicomputers Hypercube Mesh Reconfigurable

PIPELINING R1  A i, R2  B i Load A i and B i R3  R1 * R2, R4  C i Multiply and load C i R5  R3 + R4 Add A technique of decomposing a sequential process into sub operations, with each sub process being executed in a partial dedicated segment that operates concurrently with all other segments. A i * B i + C i for i = 1, 2, 3,..., 7 AiAi R1R2 Multiplier R3 R4 Adder R5 Memory Pipelining BiBi CiCi Segment 1 Segment 2 Segment 3

OPERATIONS IN EACH PIPELINE STAGE Clock Pulse Segment 1 Segment 2Segment 3 Number R1 R2 R3 R4 R5 1 A1 B1 2 A2 B2 A1 * B1 C1 3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6 9 A7 * B7 + C7 Pipelining

GENERAL PIPELINE General Structure of a 4-Segment Pipeline SR 11 SR 22 SR 33 SR 44 Input Clock Space-Time Diagram 123456789 T1 T2 T3 T4 T5 T6 Clock cycles Segment 1 2 3 4

PIPELINE PROCESSING Pipelining is a method to realize, overlapped parallelism in the proposed solution of a problem, on a digital computer in an economical way. To introduce pipelining in a processor P, the following steps must be followed: Sub-divide the input process into a sequence of subtasks. These subtasks will make stages of pipeline, which are also known as segments. Each stage Si of the pipeline according to the subtask will perform some operation on a distinct set of operands. When stage Si has completed its operation, results are passed to the next stage Si+1 for the next operation. The stage Si receives a new set of input from previous stage Si-1.

parallelism in a pipelined processor can be achieved such that m independent operations can be performed simultaneously in m segments as shown

pipeline processor A pipeline processor can be defined as a processor that consists of a sequence of processing circuits called segments and a stream of operands (data) is passed through the pipeline. In each segment partial processing of the data stream is performed and the final output is received when the stream has passed through the whole pipeline. An operation that can be decomposed into a sequence of well-defined sub tasks is realized through the pipelining concept.

Classification of Pipeline Processors Level of Processing Pipeline configuration Type of Instruction and data

Classification according to level of processing Instruction pipeline Arithmetic pipeline

Instruction Pipeline An instruction cycle may consist of many operations like, fetch opcode, decode opcode, compute operand addresses, fetch operands, and execute instructions. These operations of the instruction execution cycle can be realized through the pipelining concept. Each of these operations forms one stage of a pipeline. The overlapping of execution of the operations through the pipeline provides a speedup over the normal execution. Thus, the pipelines used for instruction cycle operations are known as instruction pipelines.

Instruction Pipelines The stream of instructions in the instruction execution cycle, can be realized through a pipeline where overlapped execution of different operations are performed. The process of executing the instruction involves the following major steps: Fetch the instruction from the main memory Decode the instruction Fetch the operand Execute the decoded instruction

INSTRUCTION CYCLE Six Phases* in an Instruction Cycle [1] Fetch an instruction from memory [2] Decode the instruction [3] Calculate the effective address of the operand [4] Fetch the operands from memory [5] Execute the operation [6] Store the result in the proper place * Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] FI: Fetch an instruction from memory [2] DA: Decode the instruction and calculate the effective address of the operand [3] FO: Fetch the operand [4] EX: Execute the operation

INSTRUCTION PIPELINE Execution of Three Instructions in a 4-Stage Pipeline FIDAFOEX FIDAFOEX FIDAFOEX i i+1 i+2 Conventional (sequential) Pipelined FI DA FOEX FIDAFOEX FI DA FO EX i i+1 i+2

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE 12345678910121311 FIDAFOEX 1 FIDAFOEX FIDAFOEX FIDAFOEX FIDAFOEX FIDAFOEX FIDAFOEX 2 3 4 5 6 7 FI Step: Instruction (Branch) Fetch instruction from memory Decode instruction and calculateeffective address Branch? Fetch operand from memory Execute instruction Interrupt? Interrupt handling Update PC Empty pipe no yes no Segment1: Segment2: Segment3: Segment4:

Instruction buffers For taking the full advantage of pipelining, pipelines should be filled continuously. Instruction fetch rate should be matched with the pipeline consumption rate. To do this, instruction buffers are used. Instruction buffers in CPU have high speed memory for storing the instructions. The instructions are pre-fetched in the buffer from the main memory. Another alternative for the instruction buffer is the cache memory between the CPU and the main memory. The advantage of cache memory is that it can be used for both instruction and data. But cache requires more complex control logic than the instruction buffer.

Arithmetic Pipeline The complex arithmetic operations like multiplication, and floating point operations consume much of the time of the ALU. These operations can also be pipelined by segmenting the operations of the ALU and as a consequence, high speed performance may be achieved. Thus, the pipelines used for arithmetic operations are known as arithmetic pipelines.

Arithmetic Pipelines The technique of pipelining can be applied to various complex and slow arithmetic operations to speed up the processing time. Arithmetic pipelines based on arithmetic operations. Arithmetic pipelines are constructed for simple fixed-point and complex floating-point arithmetic operations. These arithmetic operations are well suited to pipelining as these operations can be efficiently partitioned into subtasks for the pipeline stages. For implementing the arithmetic pipelines we generally use following two types of adder:

Carry propagation adder (CPA): It adds two numbers such that carries generated in successive digits are propagated. Carry save adder (CSA): It adds two numbers such that carries generated are not propagated rather these are saved in a carry vector.

Fixed Arithmetic pipelines Ex: Multiplication of fixed numbers. –Two fixed-point numbers are added by the ALU using add and shift operations. –This sequential execution makes the multiplication a slow process. –Multiplication is the process of adding the multiple copies of shifted multiplicands as shown

The first stage generates the partial product of the numbers, which form the six rows of shifted multiplicands. In the second stage, the six numbers are given to the two CSAs merging into four numbers. In the third stage, there is a single CSA merging the numbers into 3 numbers. In the fourth stage, there is a single number merging three numbers into 2 numbers. In the fifth stage, the last two numbers are added through a CPA to get the final product

Floating point Arithmetic pipelines Floating point computations are the best candidates for pipelining. example :Addition of two floating point numbers. – Following stages are identified for the addition of two floating point numbers First stage will compare the exponents of the two numbers. Second stage will look for alignment of mantissas. In the third stage, mantissas are added. In the last stage, the result is normalized

ARITHMETIC PIPELINE Floating-point adder [1] Compare the exponents [2] Align the mantissa [3] Add/sub the mantissa [4] Normalize the result X = A x 2 a Y = B x 2 b R Compare exponents by subtractn ab R Choose exponent Exponents R A B Align mantissa Mantissas Difference R Add or subtract mantissas R Normalize result R R Adjust exponent R Segment 1: Segment 2: Segment 3: Segment 4:

Classification according to pipeline configuration Unifunction Pipelines: When a fixed and dedicated function is performed through a pipeline, it is called a Unifunction pipeline. Multifunction Pipelines: When different functions at different times are performed through the pipeline, this is known as Multifunction pipeline. –Multifunction pipelines are reconfigurable at different times according to the operation being performed

Classification according to type of instruction and data Scalar Pipelines: This type of pipeline processes scalar operands of repeated scalar instructions. Vector Pipelines: This type of pipeline processes vector instructions over vector operands.

Performance and Issues in Pipelining Speedup : How much speed up performance we get through pipelining. –n: Number of tasks to be performed Conventional Machine (Non-Pipelined) –tn: Clock cycle –t1: Time required to complete the n tasks –t1 = n * tn Pipelined Machine (k stages) –tp: Clock cycle (time to complete each sub operation) –tk: Time required to complete the n tasks –tk = (k + n - 1) * tp Speedup –Sk: Speedup – Sk = n*tn / (k + n - 1)*tp n   S k = tntn tptp ( = k, if t n = k * t p )lim

PIPELINE AND MULTIPLE FUNCTION UNITS Multiple Functional Units Example - 4-stage pipeline - sub operation in each stage; t p = 20nS - 100 tasks to be executed - 1 task in non-pipelined system; 20*4 = 80nS Pipelined System (k + n - 1)*t p = (4 + 99) * 20 = 2060nS Non-Pipelined System n*k*t p = 100 * 80 = 8000nS Speedup S k = 8000 / 2060 = 3.88 4-Stage Pipeline is basically identical to the system with 4 identical function units Pipelining

Efficiency: The efficiency of a pipeline can be measured as the ratio of busy time span to the total time span including the idle time. Let c be the clock period of the pipeline, the efficiency E can be denoted as: E = (n. m. c) / m. [m. c + (n-1).c] = n / [(m + (n-1 ) ] As n-> ∞, E becomes 1.

Throughput: Throughput of a pipeline can be defined as the number of results that have been achieved per unit time. It can be denoted as: –T = (n / [m + (n-1)]) / c = E / c Throughput denotes the computing power of the pipeline. Maximum speedup, efficiency and throughput are the ideal cases.

Limitations to speed up Data dependency between successive tasks: There may be dependencies between the instructions of two tasks used in the pipeline. For example: –One instruction cannot be started until the previous instruction returns the results, as both are interdependent. –Another instance of data dependency will be when that both instructions try to modify the same data object. These are called data hazards.

Resource Constraints: When resources are not available at the time of execution then delays are caused in pipelining. For example: 1)If one common memory is used for both data and instructions and there is need to read/write and fetch the instruction at the same time, then only one can be carried out and the other has to wait. 2)Limited resource like execution unit, which may be busy at the required time.

Branch Instructions and Interrupts in the program: A program is not a straight flow of sequential instructions. There may be branch instructions that alter the normal flow of program, which delays the pipelining execution and affects the performance. Similarly, there are interrupts that postpones the execution of next instruction until the interrupt has been serviced. Branches and the interrupts have damaging effects on the pipelining.

PRINCIPLES OF DESIGNING PIPELINE PROCESSORS

CONTENTS  INSTRUCTION-PREFETCH AND BRANCH HANDLING.  DATA BUFFERING AND BUSSING STRUCTURES.  INTERNAL FORWARDING AND REGISTER TAGGING.  HAZARD DETECTION AND RESOLUTION.

INSTRUCTION PREFETCH AND BRANCH HANDLING FOR DESIGNING PIPELINED INSTRUCTION UNITS : Interrupts and branch, produce damaging effects on the performance of pipeline computers. Two possible path for conditional branch operation: 1)Yes path.2) No path

Five segments of Instruction Pipeline Fetch Instruc- tion Decode Fetch Operands Execute Store Results

Overlapped Execution of Instruction Without Branching 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8

Effect of Branching on performance of Instruction pipeline 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8

Timing Diagram for Instruction Pipeline Operation

The Effect of a Conditional Branch on Instruction Pipeline Operation Instruction 3 is a conditional branch to instruction 15:

Alternative Pipeline Depiction Instruction 3 is conditional branch to instruction 15:

Instruction Prefetching Strategy Instruction words ahead of one currently being decoded are fetched from the memory before the instruction decoding units requests them. 2 prefetch buffers : –Sequential prefetch buffer ( s size) Holds instruction fetched during sequential run of pgm. When a branch is successful, the contents of this buffer are invalidated. –Target prefetch buffer ( r size) Holds instruction fetched from the target of a conditional branch When the conditional branch is unsuccessful, the contents of this buffer are invalidated

Unconditional branch (Jump): –The instruction word at the target of instruction is requested immediately by decoder and decoding ceases until the target inst. Returns from memory. conditional branch –Sequential prefetching is suspended. –Instructions are prefetched from the target memory address of conditional branch instruction –If branch is successful the target instruction stream becomes the sequential stream Instruction prefetching reduces the damaging effect of branching.

An Instruction Pipeline with Both Sequential and Target Pre fetch Buffer Memory system (Access Time T) Sequential Prefetch Buffer (s words) Target Prefetch Buffer (t words) Decoder (r Time units) 1 2 E } Execution Pipeline

DATA BUFFERING AND BUSING STRUCTURES The processing speeds of pipeline segments are usually unequal. The throughput of the pipeline is inversely proportional to the bottleneck. It is desirable to remove the bottleneck which causes the unnecessary congestion.

Segment 2 is The bottleneck S1S2S3 T1 = T3 = T, T2 = 3T T1 T2 T3 Segments s1, s2, s3 having delays T1, T2, T3

Subdivision of Segment 2 T T T T T S1 S2 S3 S2 S1 S2 S3 S2 T T 2T T SUBDIVIDE THE BOTTLENECK INTO 2 DIFFERENT DIVISIONS OF S2

Replication of segment 2 S1 S2 S3 3T T T If bottleneck is not sub divisible, use duplicate of bottleneck in Parallel to smooth congestion.

Data and Instruction Buffers To smooth the traffic flow in a pipeline is to use buffers to close up the speed gap between the memory accesses for either instructions or operands. Buffering can avoid unnecessary idling to the processing stages caused by memory-access conflicts or by unexpected branching or interrupts.

Busing Structures Ideally,the subfunction being executed by one stage should be independent of the other subfunctions being executed by the remaining stages;otherwise some processes in the pipeline must be halted until the dependency is removed. These problems cause additional time delays.An efficient internal busing structure is desired to route results to the requesting stations with minimum time delays.

Internal Forwarding and Register Tagging Internal forwarding refers to a “short circuit” technique for replacing unnecessary memory accesses by register-to-register transfers in a sequence of fetch-arithmetic-store operations. Register tagging refers to the use of tagged registers, buffers,and reservation stations for exploiting concurrent activities among multiple arithmetic units.

Internal Forwarding Examples Mi R1R2 Mi R1R2 Mi R1R2 Mi R2 Mi R1R2 Mi R1R2 a)store-fetch forwarding b)Fetch-Fetch forwarding c)Store-Store Forwarding Mi  R1( store) R2  Mi (Fetch) 2 memory access Mi  R1( store) R2  R1 (register transfer) R1  Mi( fetch) R2  Mi (Fetch) 2 memory access R1  Mi( fetch) R2  R1( Register Transfer) 1 memory access Mi  R1( Store ) Mi  R2( Store ) 2 memory access Mi  R2( Store ) 1 memory access

HAZARD DETECTION AND RESOLUTION Pipeline hazards are caused by resource-usage conflicts among various instruction in the pipeline. Such hazards are triggered by inter instruction dependencies. Three classes of data dependencies hazards, according to various data update patterns: 1)write after read(WAR) 2)read after write(RAW) 3)write after write(WAW)

Continued……. Hazard detection can be done in the instruction-fetch stage of a pipeline processor by comparing the domain and range of the incoming instruction with those of the instructions being processed in the pipe. A warning signal can be generated to prevent the hazard from taking place.

MAJOR HAZARDS IN PIPELINED EXECUTION Structural hazards(Resource Conflicts) Hardware Resources required by the instructions in simultaneous overlapped execution cannot be met Data hazards (Data Dependency Conflicts) An instruction scheduled to be executed in the pipeline requires the result of a previous instruction, which is not yet available JMPIDPC+ bubbleIFIDOF OEOS Branch address dependency Hazards in pipelines may make it necessary to stall the pipeline Pipeline Interlock: Detect Hazards Stall until it is cleared ADDDAB,C+ INCDA+1R1bubble Data dependency R1 <- B + C R1 <- R1 + 1 Control hazards Branches and other instructions that change the PC make the fetch of the next instruction to be delayed

STRUCTURAL HAZARDS Structural Hazards Occur when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute Example: With one memory-port, a data and an instruction fetch cannot be initiated in the same clock The Pipeline is stalled for a structural hazard <- Two Loads with one port memory -> Two-port memory will serve without stall FIDAFOEX i i+1 i+2 FIDAFOEX FIDAFOEX stall

DATA HAZARDS Data Hazards Occurs when the execution of an instruction depends on the results of a previous instruction ADDR1, R2, R3 SUBR4, R1, R5 Hardware Technique Interlock - hardware detects the data dependencies and delays the scheduling of the dependent instruction by stalling enough clock cycles Forwarding (bypassing, short-circuiting) - Accomplished by a data path that routes a value from a source (usually an ALU) to a user, bypassing a designated register. This allows the value to be produced to be used at an earlier stage in the pipeline than would otherwise be possible Software Technique Instruction Scheduling (compiler) for delayed load Data hazard can be dealt with either hardware techniques or software technique

FORWARDING HARDWARE Register file Result write bus Bypass path ALU result buffer MUX ALU R4 MUX Example: ADDR1, R2, R3 SUBR4, R1, R5 3-stage Pipeline I: Instruction Fetch A: Decode, Read Registers, ALU Operations E: Write the result to the destination register IAE ADD SUB I AE Without Bypassing IAE SUB With Bypassing

INSTRUCTION SCHEDULING a = b + c; d = e - f; Unscheduled code: Delayed Load A load requiring that the following instruction not use its result Scheduled Code: LW Rb, b LW Rc, c LW Re, e ADD Ra, Rb, Rc LW Rf, f SW a, Ra SUB Rd, Re, Rf SW d, Rd LW Rb, b LW Rc, c ADD Ra, Rb, Rc SW a, Ra LW Re, e LW Rf, f SUB Rd, Re, Rf SW d, Rd

CONTROL HAZARDS Branch Instructions - Branch target address is not known until the branch instruction is completed - Stall -> waste of cycle times FI DA FO EX Branch Instruction Next Instruction Target address available Dealing with Control Hazards * Prefetch Target Instruction * Branch Target Buffer * Loop Buffer * Branch Prediction * Delayed Branch

CONTROL HAZARDS Prefetch Target Instruction –Fetch instructions in both streams, branch not taken and branch taken –Both are saved until branch is executed. Then, select the right instruction stream and discard the wrong stream Branch Target Buffer (BTB; Associative Memory) –Entry: Address of previously executed branches; Target instruction and the next few instructions –When fetching an instruction, search BTB. –If found, fetch the instruction stream in BTB; –If not, new stream is fetched and update BTB

Loop Buffer (High Speed Register file) – Storage of entire loop that allows to execute a loop without accessing memory Branch Prediction –Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty. Delayed Branch –Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction

DELAYED LOAD Three-segment pipeline timing Pipeline timing with data conflict clock cycle 1 2 3 4 5 6 Load R1 I A E Load R2 I A E Add R1+R2 I A E Store R3 I A E Pipeline timing with delayed load clock cycle 1 2 3 4 5 6 7 Load R1 I A E Load R2 I A E NOP I A E Add R1+R2 I A E Store R3 I A E LOAD: R1  M[address 1] LOAD: R2  M[address 2] ADD: R3  R1 + R2 STORE: M[address 3]  R3 The data dependency is taken care by the compiler rather than the hardware

DELAYED BRANCH Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps Using no-operation instructions Rearranging the instructions

Pipeline Throughput The average number of task initiations per clock cycle

Dynamic pipeline and Reconfigurability The dynamic pipeline may initiate task from different reservation table simultaneously to allow multiple number of initiation of different function in the same pipeline.

It is assumed that any computation step can be delay by inserting non-compute stage. Pipeline with perfect cycle can be better utilize than those with non perfect initiation cycle.

Reconfigurability:- Reconfigurable pipelines with different function types are more desirable. Such an approach requires extensive resource sharing among different functions. To achieve this, more complicated structure of pipeline segment and their interconnection control is needed. Bypass technique can be used to avoid unwanted stages. This may caused a collision when one instructions, as a result of bypassing,attempts to used operand fetched for preceding instructions.

UNIVERSITY Question Bank 1.Discuss key design problems of a pipeline processor? 2.Discuss various instruction pre fetch and branch control strategies with there effect on performance of pipeline processor? 3.Explain internal forwarding and register tagging technique? 4.Explain causes, detection,avoidance and resolution of pipeline hazards? 5.For pipeline processor system explain i) Instruction pre fetching ii) Data dependency hazards. 6.What are the factors affecting the performance of pipeline computers? 7.What are the different hazards in pipeline processor? How are they detected and resolved?

Principles of pipelining The two major parametric considerations in designing a parallel computer architecture are: –executing multiple number of instructions.

Similar presentations

Presentation on theme: "Principles of pipelining The two major parametric considerations in designing a parallel computer architecture are: –executing multiple number of instructions."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Principles of pipelining The two major parametric considerations in designing a parallel computer architecture are: –executing multiple number of instructions.

Similar presentations

Presentation on theme: "Principles of pipelining The two major parametric considerations in designing a parallel computer architecture are: –executing multiple number of instructions."— Presentation transcript:

Similar presentations

About project

Feedback