CPE 631: Vector Processing (Appendix F in COA4)

Slides:



Advertisements
Similar presentations
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Prakash Prabhu : Colossus 2  Used for breaking encrypted codes  Not Turing Complete  Vaccum Tubes to optically read paper tape & apply programmable.
Computer Organization and Architecture
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Lecture 6: Pipelining MIPS R4000 and More Kai Bu
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Appendix A Pipelining: Basic and Intermediate Concepts
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Krste Asanovic Electrical Engineering and Computer Sciences
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Vector computers.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
14: Vector Computers: an old-fashioned approach
Static Compiler Optimization Techniques
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CPE 631: Vector Processing (Appendix F in COA4)
COMP4211 : Advance Computer Architecture
Pipelining: Advanced ILP
Out of Order Processors
Pipelining Multicycle, MIPS R4000, and More
Ka-Ming Keung Swamy D Ponpandi
Multivector and SIMD Computers
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Static Compiler Optimization Techniques
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
CPE 631 Lecture 23: Vector Processing
Topic 2: Vector Processing and Vector Architectures
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
CPE 631 Lecture 24: Vector Processing
Lecture 11: Machine-Dependent Optimization
Static Compiler Optimization Techniques
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

CPE 631: Vector Processing (Appendix F in COA4) Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenković milenka@ece.uah.edu http://www.ece.uah.edu/~milenka

Outline Properties of Vector Processing Components of a Vector Processor Vector Execution Time Real-world Problems: Vector Length and Stride Vector Optimizations: Chaining, Conditional Execution, Sparse Matrices

Why Vector Processors? Instruction level parallelism (Ch 3&4) Deeper pipeline and wider superscalar machines to extract more parallelism more register file ports, more registers, more hazard interlock logic In dynamically scheduled machines instruction window, reorder buffer, rename register files must grow to have enough capacity to keep relevant information about in-flight instructions Difficult to build machines supporting large number of in-flight instructions => limit the issue width and pipeline depths => limit the amount parallelism you can extract Commercial versions long before ILP machines

Vector Processing Definitions Vector - a set of scalar data items, all of the same type, stored in memory Vector processor - an ensemble of hardware resources, including vector registers, functional pipelines, processing elements, and register counters for performing vector operations Vector processing occurs when arithmetic or logical operations are applied to vectors + v1 v2 r1 r2 SCALAR (1 operation) VECTOR (N operations) + v3 add r3, r1, r2 r3 add.vv v3, v1, v2 vector length

Properties of Vector Processors 1) Single vector instruction specifies lots of work equivalent to executing an entire loop fewer instructions to fetch and decode 2) Computation of each result in the vector is independent of the computation of other results in the same vector deep pipeline without data hazards; high clock rate 3) Hw checks for data hazards only between vector instructions (once per vector, not per vector element) 4) Access memory with known pattern elements are all adjacent in memory => highly interleaved memory banks provides high bandw. access is initiated for entire vector => high memory latency is amortised (no data caches are needed) 5) Control hazards from the loop branches are reduced nonexistent for one vector instruction

Properties of Vector Processors (cont’d) Vector operations: arithmetic (add, sub, mul, div), memory accesses, effective address calculations Multiple vector instructions can be in progress at the same time => more parallelism Applications to benefit Large scientific and engineering applications (car crash simulations, whether forecasting, …) Multimedia applications

Basic Vector Architectures Vector processor: ordinary pipelined scalar unit + vector unit Types of vector processors Memory-memory processors: all vector operations are memory-to-memory (CDC) Vector-register processors: all vector operations except load and store are among the vector registers (CRAY-1, CRAY-2, X-MP, Y-MP, NEX SX/2(3), Fujitsu) VMIPS – Vector processor as an extension of the 5-stage MIPS processor

Components of a vector-register processor Vector Registers: each vector register is a fixed length bank holding a single vector has at least 2 read and 1 write ports typically 8-32 vector registers, each holding 64-128 64 bit elements VMIPS: 8 vector registers, each holding 64 elements (16 Rd ports, 8 Wr ports) Vector Functional Units (FUs): fully pipelined, start new operation every clock typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit VMIPS: 5 FUs (FP add/sub, FP mul, FP div, FP integer, FP logical)

Components of a vector-register processor (cont’d) Vector Load-Store Units (LSUs) fully pipelined unit to load or store a vector; may have multiple LSUs VMIPS: 1 VLSU, bandwidth is 1 word per cycle after initial delay Scalar registers single element for FP scalar or address VMIPS: 32 GPR, 32 FPRs they are read out and latched at one input of the FUs Cross-bar to connect FUs, LSUs, registers cross-bar to connect Rd/Wr ports and FUs

VMIPS: Basic Structure Main Memory 8 64-element vector registers 5 FUs; each unit is fully pipelined, can start a new operation on every clock cycle Load/store unit - fully pipelined Scalar registers Vector Load/Store FP add/subtract FP multiply FP divide Vector registers Integer Logical Scalar registers

VMIPS Vector Instructions Instr. Operands Operation Comment ADDV.D V1,V2,V3 V1=V2+V3 vector + vector ADDSV.D V1,F0,V2 V1=F0+V2 scalar + vector MULV.D V1,V2,V3 V1=V2xV3 vector x vector MULSV.D V1,F0,V2 V1=F0xV2 scalar x vector LV V1,R1 V1=M[R1..R1+63] load, stride=1 LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 LVI V1,R1,V2 V1=M[R1+V2(i),i=0..63] indir.("gather") CeqV.D VM,V1,V2 VMASKi = (V1i=V2i)? comp. setmask MTC1 VLR,R1 Vec. Len. Reg. = R1 set vector length MFC1 VM,R1 R1 = Vec. Mask set vector mask See table F3 for the VMIPS vector instructions.

VMIPS Vector Instructions (cont’d) Instr. Operands Operation Comment SUBV.D V1,V2,V3 V1=V2-V3 vector - vector SUBSV.D V1,F0,V2 V1=F0-V2 scalar – vector SUBVS.D V1,V2,F0 V1=V2- F0 vector - scalar DIVV.D V1,V2,V3 V1=V2/V3 vector / vector DIVSV.D V1,F0,V2 V1=F0/V2 scalar / vector DIVVS.D V1,V2,F0 V1=V2/F0 vector / scalar .. POP R1, VM Count the 1s in the VM register CVM Set the vector-mask register to all 1s See table F3 for the VMIPS vector instructions.

DAXPY: Double aX + Y Assuming vectors X, Y are length 64 L.D F0,a ;load scalar a LV V1,Rx ;load vector X MULVS V2,V1,F0 ;vector-scalar mult. LV V3,Ry ;load vector Y ADDV.D V4,V2,V3 ;add SV Ry,V4 ;store the result Assuming vectors X, Y are length 64 Scalar vs. Vector L.D F0,a DADDIU R4,Rx,#512 ;last address to load loop: L.D F2, 0(Rx) ;load X(i) MULT.D F2,F0,F2 ;a*X(i) L.D F4, 0(Ry) ;load Y(i) ADD.D F4,F2,F4 ;a*X(i) + Y(i) S.D F4,0(Ry) ;store into Y(i) DADDIU Rx,Rx,#8 ;increment index to X DADDIU Ry,Ry,#8 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,loop ;check if done Operations: 578 (2+9*64) vs. 321 (1+5*64) (1.8X) Instructions: 578 (2+9*64) vs. 6 instructions (96X) Hazards: 64X fewer pipeline hazards

Vector Execution Time 4 convoys, 1 lane, VL=64 Time = f(vector length, data dependencies, struct. hazards) Initiation rate: rate at which a FU consumes vector elements (= number of lanes; usually 1 or 2 on Cray T-90) Convoy: set of vector instructions that can begin execution in same clock (no struct. or data hazards) Chime: approx. time to execute a convoy m convoys take m chimes; if each vector length is n, then they take approx. m x n clock cycles (ignores overhead; good approximation for long vectors) 4 convoys, 1 lane, VL=64 => 4 x 64 ­ 256 clocks (or 4 clocks per result) 1: LV V1,Rx ;load vector X 2: MULVS.D V2, V1,F0 ;vector-scalar mult. LV V3,Ry ;load vector Y 3: ADDV.D V4,V2,V3 ;add 4: SV Ry,V4 ;store the result

VMIPS Start-up Time Start-up time: pipeline latency time (depth of FU pipeline); another sources of overhead Operation Start-up penalty (from CRAY-1) Vector load/store 12 Vector multiply 7 Vector add 6 Assume convoys don't overlap; vector length = n: Convoy Start 1st result last result 1. LV 12 11+n (12-1+n) 2. MULVS.D, LV 12+n 12+n+12 23+2n load start-up 3. ADDV.D 24+2n 24+2n+6 29+3n wait convoy 2 4. SV 30+3n 30+3n+12 41+4n wait convoy 3

VMIPS Execution Time Time 1: LV V1,Rx 2: MULV V2,F0,V1 LV V3,Ry 3: ADDV V4,V2,V3 4: SV Ry,V4 12 n 12 n 6 n 12 n

Vector Load/Store Units & Memories Start-up overheads usually longer for LSUs Memory system must sustain (# lanes x word) /clock cycle Many Vector Procs. use banks (vs. simple interleaving): support multiple loads/stores per cycle => multiple banks & address banks independently support non-sequential accesses Note: No. memory banks > memory latency to avoid stalls m banks => m words per memory latency l clocks if m < l, then gap in memory pipeline: may have 1024 banks in SRAM

Real-World Issues: Vector Length What to do when vector length is not exactly 64? N can be unknown at compile time? Vector-Length Register (VLR): controls the length of any vector operation, including a vector load or store (cannot be > the length of vector registers) What if n > Max. Vector Length (MVL)? => Strip mining for(i=0; i<n, i++) {Y(i)=a*X(i)+Y(i)}

Strip Mining Strip mining: generation of code such that each vector operation is done for a size to the MVL 1st loop do short piece (n mod MVL), rest VL = MVL Overhead of executing strip-mined loop? i = 0; VL = n mod MVL; for (j=0; j<n/MVL; j++){ for(i<VL; i++) {Y(i)=a*X(i)+Y(i)} VL = MVL; }

Vector Stride Suppose adjacent elements not sequential in memory (e.g. matrix multiplication) Matrix C accesses are not adjacent (800 bytes between) Stride: distance separating elements that are to be merged into a single vector => LVWS (load vector with stride) instruction Strides can cause bank conflicts (e.g., stride=32 and 16 banks) (1,1) : (1,2) for(i=0; i<100; i++) for(j=0; j<100; j++) { A(i,j)=0.0; for(k=0; k<100; k++) A(i,j)=A(i,j)+B(i,k)*C(k,j); } (1,100) (2,1) : (2,100)

Vector Opt #1: Chaining Suppose: MULV.D V1,V2,V3 ADDV.D V4,V1,V5 ; separate convoy? Chaining: vector register (V1) is not as a single entity but as a group of individual registers, then pipeline forwarding can work on individual elements of a vector Flexible chaining: allow vector to chain to any other active vector operation => more read/write port As long as enough HW, increases convoy size

DAXPY Chaining: CRAY-1 Time CRAY-1 has one memory access pipe either for load or store (not for both at the same time) 3 chains Chain 1: LV V3 Chain 2: LV V1 + MULV V2,F0,V1 + ADDV V4,V2,V3 Chain 3: SV V4 Time Chain 1: Chain 2: Chain 3: 12 n 25 n 12 n

3 Chains DAXPY for CRAY-1 R/W port R/W port Memory V4 Memory : : : Access pipe Access pipe V1 : : V3 Access pipe : F0 : Memory Multiply pipe V2 R/W port : V3 : : Add pipe V4 :

DAXPY Chaining: CRAY X-MP CRAY X-MP has 3 memory access pipes, two for vector load and one for vector store 1 chain: LV V3, LV V1 + MULV V2,F0,V1 + ADDV V4,V2,V3 + SV V4 Time Chain 1: 12 n 12 n 25 n 12 n

One Chain DAXPY for CRAY X-MP R port R port Memory V3 V1 : : Access pipe Access pipe : : F0 Multiply pipe : V2 : V4 : Add pipe : Memory W port

Vector Opt #2: Conditional Execution Consider: Vector-mask control takes a Boolean vector: when vector-mask register is loaded from vector test, vector instructions operate only on vector elements whose corresponding entries in the vector-mask register are 1 Requires clock even for the elements where the mask is 0 Some VP use vector mask only to disable the storing of the result and the operation still occurs; zero division exception is possible? => mask operation do 100 i = 1, 64 if (A(i) .ne. 0) then A(i) = A(i) – B(i) endif 100 continue

Vector Mask Control LV V1, Ra ;load A into V1 LV V2, Rb ;load B into V2 L.D F0, #0 ;load FP zero to F0 SNESV.D F0,V1 ;sets VM register if V1(i)<>0 SUBV.D V1,V1,V2 ;subtract under VM CVM ;set VM to all 1s SV Ra,V1 ;store results in A

Vector Opt #3: Sparse Matrices Sparse matrix: elements of a vector are usually stored in some compacted form and then accessed indirectly Suppose: Mechanism to support sparse matrices: scatter-gather operations Gather (LVI) operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector => a nonsparse vector in a vector register After these elements are operated on in dense form, the sparse vector can be stored in expanded form by a scatter store (SVI), using the same index vector do 100 i = 1, n 100 A(K(i))=A(K(i))+C(M(i))

Sparse Matrices Example do 100 i = 1, n 100 A(K(i))=A(K(i))+C(M(i)) LV Vk, Rk ; load K LVI Va,(Ra+Vk) ; load A(K(i)) LV Vm,Rm ; load M LVI Vc,(Rc+Vm) ; load C(M(i)) ADDV.D Va,Va,Vc ; add them SVI (Ra+Vk),Va ; store A(K(i)) Can't be done by compiler since can't know Ki elements distinct

Sparse Matrices Example (cont’d) LV V1,Ra ;load A into V1 L.D F0,#0 ;load FP zero into F0 SNESV.D F0,V1 ;sets VM to 1 if V1(i)<>F0 CVI V2,#8 ;generates indices in V2 POP R1,VM ;find the number of 1s MTC1 VLR,R1 ;load vector-length reg. CVM ;clears the mask LVI V3,(Ra+V2) ;load the nonzero As LVI V4,(Rb+V2) ;load the nonzero Bs SUBV.D V3,V3,V4 ;do the subract SVI (Ra+V2),V3 ;store A back Use CVI to create index 0, 1xm, ..., 63xm (compressed index vector whose entries correspond to the positions with a 1 in the mask register

Things to Remember Properties of vector processing Each result independent of previous result Vector instructions access memory with known pattern Reduces branches and branch problems in pipelines Single vector instruction implies lots of work (­ loop) Components of a vector processor: vector registers, functional units, load/store, crossbar.... Strip mining technique for long vectors Optimisation techniques: chaining, conditional execution, sparse matrices