Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Slides:



Advertisements
Similar presentations
PIPELINE AND VECTOR PROCESSING
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Programmability Issues
The University of Adelaide, School of Computer Science
Computer Organization and Architecture
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Advanced Computer Architectures
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Basics and Architectures
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
6-1 Chapter 6 - Languages and the Machine Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Principles of Linear Pipelining
Pipelining and Parallelism Mark Staveley
Chapter One Introduction to Pipelined Processors
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
CISC and RISC 12/25/ What is CISC? acronym for Complex Instruction Set Computer Chips that are easy to program and which make efficient use of memory.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:
Chapter One Introduction to Pipelined Processors
Chapter One Introduction to Pipelined Processors.
Vector computers.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.
Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.
Advanced Architectures
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
COMP4211 : Advance Computer Architecture
Pipelining and Vector Processing
Array Processor.
Superscalar Processors & VLIW Processors
Instruction Level Parallelism and Superscalar Processors
Multivector and SIMD Computers
Computer Architecture
Computer Architecture
Mastering Memory Modes
Memory System Performance Chapter 3
COMPUTER ORGANIZATION AND ARCHITECTURE
Pipelining.
Presentation transcript:

Vector Processing

Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations are: –processing one or more vectors to produce a scalar result, –combining two vectors to produce a third vector, –combining a scalar and a vector to produce a vector, and –a combination of the above.

Vector Processor Models Keeping up the bandwidth of C : = A + B Problem: RAM can only support 1 word/cycle 3 memory reference per cycle for operands/result

Vector Processor Models When dealing with scalar operations such as was shown on the previous slide little can be gained in a vector processor, but vector or non-scalar operations can take advantage of vector processors. C i : = A i + B i 1 ≤ i ≤ N On a SISD system we would code this as for(i=1; i <=N;i++) C[i]=A[i]+B[i];

Vector Processor Models On a SISD system we would code this as for(i=1; i <=N;i++) C[i]=A[i]+B[i]; Assuming two machine instructions for loop control and four machines instructions to implement the assignment statement (Read A, Read B, Add, Write C) the execution time is; (6 x N x T) where T is average instruction cycle time.

Vector Processor Models If memory could be accessed directly without requiring loop control there could be one instruction (add). The figure shows a fours stage add pipeline resulting in one add per cycle.

Vector Processor Model The pipeline execution time is (4 + N – 1)T Therefore the speedup is;

Vector Processor Models We can generalize the previous vector model as follows;

Vector Processor Models Further Improvements can be made;

This figure shows the register-oriented (memory hierarchy) vector processor. Registers are used as buffers for vector operations. Loading registers from memory and pipeline operations are performed simultaneously Scalar Functional Pipelines Scalar Control Unit Main Memory (Program and Data) Vector Control Unit Vector Registers Vector Func. Pipe. Vector Instructions Vector Data Control Scalar Processor Scalar Instructions Instruction Scalar Data Mass Storage Host Computer I/O (User) Vector Processor

Vector Processors Vector processors are supercomputers optimized for fast execution (main criteria for design and implementation) of vectorizable scientific code that operates on large data sets. Vector processors are extensively pipelined to operate on array-oriented data. The CPU is highly pipelined and with a large set of registers. Memory is also pipelined and interleaved to match CPU demands.

Memory Design Note that if the pipe provides a result every d cycles (i.e., w = 1/d), then memory must supply a pair of operands (a i and b i ) every d cycles. Note that we need to fetch (read) two values and write a result simultaneously (within d cycles).

Memory Design If d = 1, then the memory system must have at least a bandwidth 3 times that of a conventional memory. To meet memory bandwidth requirements, two approaches have been implemented in commercial machines: 1. Use of multiple independent memory modules. 2. Use of intermediate high speed memory to: –shorten the access cycle. –use data several times between cpu and intermediate memory. –provide for certain desirable patterns of data access (i.e., rows, columns, diagonals, etc.).

MEMORY intermediate “buffer” memory Arithmetic pipeline Multiple use per data is favorable for bandwidth Must avoid bottleneck here!

Memory Design Multiple memory modules - 3-port memory modules used with a pipeline arithmetic. Only one port per module is active at one time but all 3 streams can be active simultaneously.

Memory Design Care must be taken when laying out data in memory modules other wise simultaneous access is denied as is seen here.

Memory Design The following RT shows the effect of 2-cycle memory access timing. Note the output conflict and resultant delays. Note Conflict!

Memory Design

Performance Evaluation Major characteristics affecting supercomputer performance –Clock speed –Instruction issue rate –Memory Size –Number of concurrent paths to memory –Ability to fetch/store vectors efficiently –Number of duplicate arithmetic functional units –Chaining –Indirect addressing capabilities –Handling conditional blocks of code

Performance Evaluation High performance of vector architectures can be attributed to the following characteristics: 1. Pipelined functional units 2. Multiple functional units operating in parallel 3. Chaining of functional units 4. Large number of programmable registers 5. Block load/store capabilities with buffer registers 6. Multiprocessors operating in parallel in a coarse-grained parallel mode 7. Instructions buffers

Performance Evaluation Sustained computation rates (as opposed to peak computation rates obtained under ideal circumstances) depend on factor such as: 1. Level of vectorization (fraction of the code that is vectorizable) 2. Average vector length 3. Possibility of vector chaining 4. Possible overlap of scalar, vector, and memory load/store operations 5. Mechanisms to resolve memory contention

Performance Evaluation What is Amdhal’s Law?

Performance Evaluation Amdhal’s Law Given that the fraction of serial work in a given problem is small, say s, the maximum speedup obtainable from even an infinite number of parallel processors is only 1/s.

Performance Evaluation Ideally speedup is Ideally parallel execution time is Speedup is then ideally P

Performance Evaluation Amdhal’s Law changes this speedup analysis to include the serial component that cannot be parallelized.

Performance Evaluation Let P denote an application program, T scalar the time to execute P in scalar mode (serial execution) s is the maximum speedup Ideally the time to execute P on the vector computer is T scalar /s

Performance Evaluation The problem Amdhal pointed out is that there is always some fraction of P, (f) that can be executed in parallel and some fraction that cannot (1-f) Therefore the actual parallel execution time is T actual =(1- f)T scalar +f ·T scalar /s

Performance Evaluation The speedup now becomes

Performance Evaluation So if f = 1 speedup is s, the ideal speedup, and for f = 0 speedup is 1.

Performance Evaluation For number of processors = 10

Performance Evaluation Time to execute loops can be used to estimate peak and sustained performance. Let

Performance Evaluation Then;

Programming Vector Processors the hardware structure that makes vector processors powerful also makes the assembler code difficult.

Programming Vector Processors Programming tools: –Languages: to express parallelism inherent in the algorithm –Compilers: to recognize vectorizable code –Combination of the above optimizes parallelism

Programming Vector Processors Vector pipelining is obviously one benefit that is exploited when executing a program.

Programming Vector Processors Chaining is another important characteristic of some vector processors. Chaining is the ability to activate additional independent functional units as soon as intermediate results are known.

Chaining

Consider the following

Chaining Simultaneous

Scalar Renaming How might this be improved?

Scalar Renaming This becomes this This renaming makes the code segments independent allowing for better vectorization

Scalar Expansion How might this be improved?

Scalar Expansion If scalar x is expanded into a vector the two statements become independent

Loop Unrolling The loop becomes this

What about this? Loop fusion

Note that each loop would be equivalent to a vector instruction. X is stored back into memory by the first instruction and then retrieved by the second. If these loops are fused as follows, then memory traffic is reduced: What else might be done to improve this? Loop fusion

Note that this is possible if there are enough registers available to retain X. If chaining is supported then the loop can be reduced to: