Presentation on theme: "Department of Computer Science University of the West Indies."— Presentation transcript:
Department of Computer Science University of the West Indies
Architecture Classification qThe Flynn taxonomy (proposed in 1966!) qFunctional taxonomy based on the notion of streams of information: data and instructions qPlatforms are classified according to whether they have a single (S) or multiple (M) stream of data or instructions.
SISD qClassic von Neumann machine qBasic components: CPU (control unit, ALU) and Main Memory (RAM) qConnected via Bus (aka von Neumann bottleneck) qExamples: standard desktop computer, laptop
SISD CP M IS DS
SIMD qPure SIMD machine: q single CPU devoted exclusively to control q collection of subordinate ALUs each w/small amount of memory qInstruction cycle: CPU broadcasts, ALUs execute or idle q lock-step progress (effectively a global clock) qKey point: completely synchronous execution of statements qVector and matrix computation lend themselves to an SIMD implementation qExamples of SIMD computers: Illiac IV, MPP, DAP, CM-2, and MasPar MP-2
Data Parallel Systems qProgramming model q Operations performed in parallel on each element of data structure q Logically single thread of control, performs sequential or parallel steps q Conceptually, a processor associated with each data element qArchitectural model q Array of many simple, cheap processors with little memory each q Processors dont sequence through instructions q Attached to a control processor that issues instructions q Specialized and general communication, cheap global synchronization qOriginal motivations q Matches simple differential equation solvers q Centralize high cost of instruction fetch/sequencing
Data Parallel Programming In this approach, we must determine how large amounts of data can be split up. In other words, we need to identify small chunks of data which require similar processing. qThese chunks of data are than assigned to different sites where they can be processed. The computations at each node may require some intermediate results from peer nodes. qThe same executable could be running on each processing site, but each processing site would have different datasets. qFor data parallelism to work best the volume of communicated values should be small compared with the volume of locally computed results.
Data Parallel Programming Data Parallel decomposition can be implemented using a SPMD (single program multiple data) programming model. One processing element is regarded as "first among equals: qThis processor starts up the program and initialises the other processors. It then works as an equal to these processors. qEach PE is doing approximately the same calculation on different data.
Data Parallel Programming Data-parallel architectures introduced the new programming-language concept of a distributed or parallel array. Typically the set of semantic operations allowed on a distributed array was somewhat different to the operations allowed on a sequential array Unfortunately, each data parallel language had features tied to a particular manufacturer's parallel computer architecture e.g. *LISP, C* and CM Fortran for Thinking Machines Corporations Connection Machine series of computers. In the 1980s and 1990s microprocessors grew in power and availability, and fell in price. Building SIMD computers out of simple but specialized compute nodes gradually became less economical than putting a general purpose commodity microprocessor at every node. Eventually SIMD computers were displaced almost completely by Multiple Instruction Multiple Data (MIMD) parallel computer architectures.
Example - ILLIAC IV ILLIAC IV was the first large system to employ semiconductor primary memory, built in 1974 at the University of Illinois. The ILLIAC IV was a SIMD computer for array processing. It consisted of: qa control unit (CU) and q64 processing elements (PEs). Each processing element had two thousand 64-bit words of memory associated with it. The CU could access all 128K words of memory through a bus, but each PE could only directly access its local memory.
Example - ILLIAC IV An 8 by 8 grid interconnect joined each PE to 4 neighbours. The CU interpreted program instructions scattered across the memory, and broadcast them to the PEs. Neither the PEs nor the CU were general-purpose computers in the modern sense--the CU had quite limited arithmetic capabilities. Between 1975 and 1981 it was the world's fastest computer.
Example - ILLIAC IV The ILLIAC IV had thirteen rotating fixed head disks which comprised part of the central system memory. The ILLIAC IV, one of the first computers to use all semiconductor main memories.
Example - ILLIAC IV
Data Parallel Languages CFD was a data parallel language developed in the early 70s at the Computational Fluid Dynamics Branch of Ames Research Center. CFD was a ``FORTRAN-like'' language, rather than a FORTRAN dialect. The language design was extremely pragmatic. No attempt was made to hide the hardware peculiarities from the user; in fact, every attempt was made to give programmers access and control of all of the ILLIAC hardware so they could construct an efficient program. CFD had five basic datatypes: q CU INTEGER q CU REAL q CU LOGICAL q PE REAL q PE INTEGER.
Data Parallel Languages The type of a variable statically encoded its home: q either on the control unit or on the processing elements. Apart from restrictions on their home, the two INTEGER and REAL types behave like the corresponding types in ordinary FORTRAN. The CU LOGICAL type was more idiosyncratic: qit had 64 independent bits that acted as flags controlling activity of the PEs.
Data Parallel Languages Scalars and arrays of the five types could be declared as in FORTRAN. qAn ordinary variable or array of type CU REAL, for example, would be allocated in the (very small) control unit memory. qAn ordinary variable or array of type PE REAL would be allocated somewhere in the collective memory of the processing elements (accessible by the control unit over the data bus) e.g. CU REAL A, B(100) PE INTEGER I PE REAL D(25), E(1000) The last data structure available in CFD was a new kind of array called a vector- aligned array.
Data Parallel Languages Only the first dimension could be distributed, and the extent of that dimension had to be exactly 64. A vector-aligned array would be of PE INTEGER or PE REAL type, and the syntax for the distributed dimension involved an asterisk: PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8) These are parallel arrays. J(1) is stored on the first PE J(2) is stored on the second PE, and so on. Similarly X(1,1), X(1,2), X(1,3), X(1,4) are stored on PE 1 X(2,1), X(2,2), X(2,3), X(2,4) are stored on PE 2, etc.
Data Parallel Languages A vector expression was a vector-aligned array with a (*) subscript in the first dimension. Communication between neighbouring PEs was captured by allowing the (*) to have some shift added, as in: DIFP(*) = P(* + 1) - P(* - 1) All shifts were cyclic (end-around) shifts, so this parallel statement is equivalent to the sequential statements: DIFP(1) = P(2) - P(64) DIFP(2) = P(3) - P(1)... DIFP(64) = P(1) - P(63)
Data Parallel Languages Essential flexibility was added by allowing vector assignments to be executed conditionally with a vector test, e.g. IF(A(*).LT. 0) A(*) = -A(*) Less structured methods of masking operations by explicitly assigning PE activity flags in CU LOGICAL variables were also available; qthere were special primitives for restricting activity to simply- specified ranges of PEs. qPEs could concurrently access different addresses in their local memory by using vector subscripts: DIAG(*) = RHO(*, X(*))
Connection Machine (Tucker, IEEE Computer, Aug. 1988)
CM-5 qRepackaged SparcStation q 4 per board qFat-Tree network qControl network for global synchronization
Whither SIMD machines? Trade-off individual processor performance for collective performance: q CM-1 had 64K PEs each 1-bit! Problems with SIMD q Inflexible - not all problems can use this style of parallelism q cannot leverage off microprocessor technology => cannot be general-purpose architectures Special-purpose SIMD architecture still viable (array processors, DSP chips)
Vector Processors Definition: a processor that can do element-wise operations on entire vectors with a single instruction, called a vector instruction q These are specified as operations on vector registers q A processor comes with some number of such registers A vector register holds ~32-64 elements q The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 The hardware performs a full vector operation in q #elements-per-vector-register / #pipes r1r2 r3 + + … vr2 … vr1 … vr3 (logically, performs #elts adds in parallel) … vr2 … vr1 (actually, performs #pipes adds in parallel) ++++
A processor that is capable of adding two vectors by streaming the two sectors through a pipelined adder Pipelined Adder Multiport Memory System Stream A Stream B Stream C = A + B Concept of Vector Processing
The Architecture of a Vector Computer Scalar Functional Pipelines Scalar Control Unit Main Memory (Program and Data) Vector Control Unit Vector Registers Vector Func. Pipe. Vector Instructions Vector Data Control Scalar Processor Scalar Instructions Instruction Scalar Data Mass Storage Host Computer I/O (User) Vector Processor
Vector Processors Advantages q quick fetch and decode of a single instruction for multiple operations q the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion q The compiler does the work for you of course Memory-to-memory q no registers q can process very long vectors, but startup time is large q appeared in the 70s and died in the 80s Examples: Cray, Fujitsu, Hitachi, NEC
Vector Processors What about: for (j = 0; j < 100; j++) A[j] = B[j] * C[j] Scalar code: load, operate, store for each iteration Both instructions and data consume memory bandwidth The solution: A vector instruction
Vector Processors A[0:99] = B[0.99] * C[0:99] qSingle instruction requires memory bandwidth for data only. qNo control overhead for loops Pitfalls q extension to instruction set, vector fus, vector registers, memory subsystem changes for vectors
Vector Processors Merits of vector processor 1.Very deep pipeline without data hazard q The computation of each result is independent of the computation of previous results 2.Instruction bandwidth requirement is reduced q A vector instruction specifies a great deal of work 3.Control hazards are nonexistent q A vector instruction represents an entire loop. q No loop branch
Vector Processors (Contd) The high latency of initiating a main memory access is amortized q A single access is initiated for the entire vector rather than a single word q Known access pattern q Interleaved memory banks Vector operations is faster than a sequence of scalar operations on the same number of data items!
Vector Programming Example LD F0, a ADDI R4, Rx, #512; last address to load Loop:LD F2, 0(Rx); load X(i) MULTD F2, F0, F2; a x X(i) LD F4, 0(Ry); load Y(i) ADDD F4, F2, F4; a x X(i) + Y(i) SD F4, 0(Ry); store into Y(i) ADDI Rx, Rx, #8; increment index to X ADDI Ry, Ry, #8; increment index to Y SUB R20, R4, Rx; compute bound BNZ R20, loop; check if done RISC machine Repeat 64 times Y = a * X + Y
Vector Programming Example (Contd) LD F0, a; load scalar LV V1, Rx; load vector X MULTSVV2, F0, V1; vector-scalar multiply LVV3, Ry; load vector Y ADDVV4, V2, V3; add SVRy, V4; store the result Vector machine 6 instructions (low instruction bandwidth) Y = a * X + Y
A Vector-Register Architecture (DLXV) Main Memory Vector Load-store FP add/subtract Vector registers Scalar registers Crossbar
Vector Machines CRAY-1 CRAY-2 CRAY X-MP CRAY C-90 NEC SX/2 NEC SX/4 Fujitsu VP200 Hitachi S820 Convex C Registers Elements per register 1 1 2Ld/1St Load Store Functional units CRAY Y-MP 8642Ld/1St8
MISD qMultiple instruction, single data qDoesnt really exist, unless you consider pipelining an MISD configuration