Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Similar presentations

Presentation on theme: "Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations."— Presentation transcript:

1 Vector Processing

2 Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations are: –processing one or more vectors to produce a scalar result, –combining two vectors to produce a third vector, –combining a scalar and a vector to produce a vector, and –a combination of the above.

3 Vector Processor Models Keeping up the bandwidth of C : = A + B Problem: RAM can only support 1 word/cycle 3 memory reference per cycle for operands/result

4 Vector Processor Models When dealing with scalar operations such as was shown on the previous slide little can be gained in a vector processor, but vector or non-scalar operations can take advantage of vector processors. C i : = A i + B i 1 ≤ i ≤ N On a SISD system we would code this as for(i=1; i <=N;i++) C[i]=A[i]+B[i];

5 Vector Processor Models On a SISD system we would code this as for(i=1; i <=N;i++) C[i]=A[i]+B[i]; Assuming two machine instructions for loop control and four machines instructions to implement the assignment statement (Read A, Read B, Add, Write C) the execution time is; (6 x N x T) where T is average instruction cycle time.

6 Vector Processor Models If memory could be accessed directly without requiring loop control there could be one instruction (add). The figure shows a fours stage add pipeline resulting in one add per cycle.

7 Vector Processor Model The pipeline execution time is (4 + N – 1)T Therefore the speedup is;

8 Vector Processor Models We can generalize the previous vector model as follows;

9 Vector Processor Models Further Improvements can be made;

10 This figure shows the register-oriented (memory hierarchy) vector processor. Registers are used as buffers for vector operations. Loading registers from memory and pipeline operations are performed simultaneously Scalar Functional Pipelines Scalar Control Unit Main Memory (Program and Data) Vector Control Unit Vector Registers Vector Func. Pipe. Vector Instructions Vector Data Control Scalar Processor Scalar Instructions Instruction Scalar Data Mass Storage Host Computer I/O (User) Vector Processor

11 Vector Processors Vector processors are supercomputers optimized for fast execution (main criteria for design and implementation) of vectorizable scientific code that operates on large data sets. Vector processors are extensively pipelined to operate on array-oriented data. The CPU is highly pipelined and with a large set of registers. Memory is also pipelined and interleaved to match CPU demands.

12 Memory Design Note that if the pipe provides a result every d cycles (i.e., w = 1/d), then memory must supply a pair of operands (a i and b i ) every d cycles. Note that we need to fetch (read) two values and write a result simultaneously (within d cycles).

13 Memory Design If d = 1, then the memory system must have at least a bandwidth 3 times that of a conventional memory. To meet memory bandwidth requirements, two approaches have been implemented in commercial machines: 1. Use of multiple independent memory modules. 2. Use of intermediate high speed memory to: –shorten the access cycle. –use data several times between cpu and intermediate memory. –provide for certain desirable patterns of data access (i.e., rows, columns, diagonals, etc.).

14 MEMORY intermediate “buffer” memory Arithmetic pipeline Multiple use per data is favorable for bandwidth Must avoid bottleneck here!

15 Memory Design Multiple memory modules - 3-port memory modules used with a pipeline arithmetic. Only one port per module is active at one time but all 3 streams can be active simultaneously.

16 Memory Design Care must be taken when laying out data in memory modules other wise simultaneous access is denied as is seen here.

17 Memory Design The following RT shows the effect of 2-cycle memory access timing. Note the output conflict and resultant delays. Note Conflict!

18 Memory Design

19 Performance Evaluation Major characteristics affecting supercomputer performance –Clock speed –Instruction issue rate –Memory Size –Number of concurrent paths to memory –Ability to fetch/store vectors efficiently –Number of duplicate arithmetic functional units –Chaining –Indirect addressing capabilities –Handling conditional blocks of code

20 Performance Evaluation High performance of vector architectures can be attributed to the following characteristics: 1. Pipelined functional units 2. Multiple functional units operating in parallel 3. Chaining of functional units 4. Large number of programmable registers 5. Block load/store capabilities with buffer registers 6. Multiprocessors operating in parallel in a coarse-grained parallel mode 7. Instructions buffers

21 Performance Evaluation Sustained computation rates (as opposed to peak computation rates obtained under ideal circumstances) depend on factor such as: 1. Level of vectorization (fraction of the code that is vectorizable) 2. Average vector length 3. Possibility of vector chaining 4. Possible overlap of scalar, vector, and memory load/store operations 5. Mechanisms to resolve memory contention

22 Performance Evaluation What is Amdhal’s Law?

23 Performance Evaluation Amdhal’s Law Given that the fraction of serial work in a given problem is small, say s, the maximum speedup obtainable from even an infinite number of parallel processors is only 1/s.

24 Performance Evaluation Ideally speedup is Ideally parallel execution time is Speedup is then ideally P

25 Performance Evaluation Amdhal’s Law changes this speedup analysis to include the serial component that cannot be parallelized.

26 Performance Evaluation Let P denote an application program, T scalar the time to execute P in scalar mode (serial execution) s is the maximum speedup Ideally the time to execute P on the vector computer is T scalar /s

27 Performance Evaluation The problem Amdhal pointed out is that there is always some fraction of P, (f) that can be executed in parallel and some fraction that cannot (1-f) Therefore the actual parallel execution time is T actual =(1- f)T scalar +f ·T scalar /s

28 Performance Evaluation The speedup now becomes

29 Performance Evaluation So if f = 1 speedup is s, the ideal speedup, and for f = 0 speedup is 1.

30 Performance Evaluation For number of processors = 10

31 Performance Evaluation Time to execute loops can be used to estimate peak and sustained performance. Let

32 Performance Evaluation Then;

33 Programming Vector Processors the hardware structure that makes vector processors powerful also makes the assembler code difficult.

34 Programming Vector Processors Programming tools: –Languages: to express parallelism inherent in the algorithm –Compilers: to recognize vectorizable code –Combination of the above optimizes parallelism

35 Programming Vector Processors Vector pipelining is obviously one benefit that is exploited when executing a program.

36 Programming Vector Processors Chaining is another important characteristic of some vector processors. Chaining is the ability to activate additional independent functional units as soon as intermediate results are known.

37 Chaining

38 Consider the following

39 Chaining Simultaneous

40 Scalar Renaming How might this be improved?

41 Scalar Renaming This becomes this This renaming makes the code segments independent allowing for better vectorization

42 Scalar Expansion How might this be improved?

43 Scalar Expansion If scalar x is expanded into a vector the two statements become independent

44 Loop Unrolling The loop becomes this

45 What about this? Loop fusion

46 Note that each loop would be equivalent to a vector instruction. X is stored back into memory by the first instruction and then retrieved by the second. If these loops are fused as follows, then memory traffic is reduced: What else might be done to improve this? Loop fusion

47 Note that this is possible if there are enough registers available to retain X. If chaining is supported then the loop can be reduced to:

Download ppt "Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations."

Similar presentations

Ads by Google