Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter One Introduction to Pipelined Processors.

Similar presentations


Presentation on theme: "Chapter One Introduction to Pipelined Processors."— Presentation transcript:

1 Chapter One Introduction to Pipelined Processors

2 Superscalar Processors

3 Scalar processors: one instruction per cycle Superscalar : multiple instruction pipelines are used. Purpose: To exploit more instruction level parallelism in user programs. Only independent instructions can be executed in parallel.

4 Superscalar Processors The fundamental structure (m=3) is as follows:

5 Superscalar Processors Here, the instruction decoding and execution resources are increased Example: A dual pipeline superscalar processor

6 Superscalar Processor - Example

7 Can issue two instructions per cycle There are two pipelines with four processing stages : fetch, decode, execute and store Two instruction streams are from a single I- cache. Assume each stage requires one cycle except execution stage.

8 Superscalar Processor - Example The four functional units of execution stage are: Functional units are shared on dynamic basis Look-ahead Window: for out-of-order instruction issue Functional UnitNumber of stages Adder02 Multiplier03 Logic01 Load01

9 Superscalar Performance Time required by scalar base machine is T(1,1) = k+N-1 The ideal execution time required by an m- issue superscalar machine is k – time required to execute first m instructions (N-m)/m – time required to execute remaining (N-m) instructions

10 Superscalar Performance The ideal speedup of the superscalar machine is = ?

11 Superscalar Performance The ideal speedup of the superscalar machine is AsN  ∞, the speedup S(m,1) =?

12 Superscalar Performance The ideal speedup of the superscalar machine is AsN  ∞, the speedup S(m,1)  m.

13 Superpipeline Processors In a superpipelined processor of degree n, the pipeline cycle time is 1/n of base cycle.

14 Superpipeline Performance Time to execute N instructions for a superpipelined machine of degree n with k stages is T(1,n) = k + (N-1)/n Speedup is given as As N  ∞, S(1,n)  n

15 Superpipelined Superscalar Processors This machine executes m instructions every cycle with a pipeline cycle 1/n of base cycle.

16 Superpipelined Superscalar Performance Time taken to execute N independent instructions on a superpipelined superscalar machine of degree (m,n) is The speedup over base machine is As N  ∞, S(m,n)  mn

17 Superscalar Processors Rely on spatial parallelism Multiple operations running on separate hardware concurrently Achieved by duplicating hardware resources such as execution units and register file ports Requires more transistors Superpipelined Processors Rely on temporal parallelism Overlapping multiple operations on a common hardware Achieved through more deeply pipelined execution units with faster clock cycles Requires faster transistors

18 Systolic Architecture

19 Conventional architecture operate on load and store operations from memory. This requires more memory references which slows down the system as shown below:

20 Systolic Architecture In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:

21 Systolic Architecture The basic architecture constitutes processing elements (PEs) that are simple and identical in behavior at all instants. Each PE may have some registers and an ALU. PEs are interlinked in a manner dictated by the requirements of the specific algorithm. E.g. 2D mesh, hexagonal arrays etc.

22 Systolic Architecture PEs at the boundary of structure are connected to memory Data picked up from memory is circulated among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic Example : Multiplication of two n x n matrices

23 Every element in input is picked up n times from memory as it contributes to n elements in the output. To reduce this memory access, systolic architecture ensures that each element is pulled only once Consider an example where n = 3

24 Matrix Multiplication a11 a12 a13 a21 a22 a23 a31 a32 a33 * b11 b12 b13 b21 b22 b23 b31 b32 b33 = c11 c12 c13 c21 c22 c23 c31 c32 c33 Conventional Method: O(n 3 ) For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

25 Systolic Method This will run in O(n) time! To run in n time we need n x n processing units, in our example n = 9. P9P8P7 P6P5P4 P1P2P3

26 For systolic processing, the input data need to be modified as: a13 a12 a11 a23 a22 a21 a33 a32 a31 b31 b32 b33 b21 b22 b23 b11 b12 b13 Flip columns 1 & 3 Flip rows 1 & 3 and finally stagger the data sets for input.

27 At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register. a13 a12 a11 a23 a22 a21 a33 a32 a31 b31 b21 b11 b32 b22 b12 b33 b23 b13 P9P8P7 P6P5P4 P1P2P3

28 3 4 2 2 5 3 3 2 5 * = 3 4 2 2 5 3 3 2 5 23 36 28 25 39 34 28 32 37 Using a systolic array. 2 4 3 3 5 2 5 2 3 323323 254254 532532 P9P8P7 P6P5P4 P1P2P3

29 P19 P20 P30 P40 P50 P60 P70 P80 P90 2 4 3 5 2 5 2 3 3232 254254 532532 P9P8P7 P6P5P4 3*33*3P2P3 Clock tick : 1

30 P19+8=17 P212 P30 P46 P50 P60 P70 P80 P90 2 3 5 5 2 3 3 2525 532532 P9P8P7 P6P52*32*3 4*24*23*43*4P3 Clock tick : 2

31 P117+6=23 P212+20=32 P36 P46+10=16 P58 P60 P79 P80 P90 3 5 2 2 5353 P9P83*33*3 P62*42*45*25*2 2*32*34*54*53*23*2 Clock tick : 3

32 P123 P232+4=36 P36+12=18 P416+9=25 P58+25=33 P64 P79+4=13 P812 P90 5 5 3*43*42*22*2 2*22*25*55*53*33*3 232*22*24*34*3 Clock tick : 4

33 P123 P236 P318+10=28 P425 P533+6=39 P64+15=19 P713+15=28 P812+10=22 P96 3*23*22*52*55*35*3 5*35*33*23*225 23362*52*5 Clock tick : 5

34 P123 P236 P328 P425 P539 P619+15=34 P728 P822+10=32 P96+6=12 2*32*35*25*228 3*53*53925 233628 Clock tick : 6

35 P123 P236 P328 P425 P539 P634 P728 P832 P912+25=37 5*55*53228 343925 233628 Clock tick : 7

36 P123 P236 P328 P425 P539 P634 P728 P832 P937 3228 343925 233628 End

37 Samba: Systolic Accelerator for Molecular Biological Applications This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.


Download ppt "Chapter One Introduction to Pipelined Processors."

Similar presentations


Ads by Google