Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Przemysław Grzesik & Anna Poczobutt

Similar presentations


Presentation on theme: "By Przemysław Grzesik & Anna Poczobutt"— Presentation transcript:

1 By Przemysław Grzesik & Anna Poczobutt
SIMD Machines By Przemysław Grzesik & Anna Poczobutt

2 Historic background 100 100 - p s + p/N s´ + Np´
Introduced in 60’s ( ILIAC IV,BSP ) Problems: not cost effective effective only for certain class of problems serial fraction and Amdahl’s law I/O bottle neck Overshowed by vector processors Resurrected in 80’s ( MPP form Goodyear, connection Machine from Thinking Machines Inc., MP-1 from MasPar ) Didn’t survive because of high cost The history of SIMD machines began with the ILLIAC IV project, started in The machine was the first large-scale multiprocessor, composed of bit processors. The project itself was pretty infamous for its failure; estimated costs of $8 million ballooned to $31 million by The actual performance of 15 MFLOPS was far below the original estimates of 1000 MFLOPS, partially because only a quarter of the planned machine was ever constructed. In addition, the machine took another 3 years of engineering to actually work following its delivery to NASA in Needless to say, the project slowed interest and investigation of SIMD architectures for quite a while. Eventually, Danny Hillis resurrected the SIMD architecture in 1985 with his Connection Machine. However, following a short stint in the 80’s by several commercial companies such as Thinking Machines and MasPar, SIMD has once again fallen by the wayside in the arena of commercial general-purpose computing. Amdahl’s Law Overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel, no matter how many processors are applied. If p is the percentage of the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, the theoretical speedup limit is 100 p If s is the percentage of amount of time spent (by a serial processor) on serial parts of a program (thus s+p=100 %), speedup is s + p/N Amdahl’s Law Reevaluated If we use s´ and p´ to represent percentages of serial and parallel time spent on the parallel system then we arrive at an alternative to Amdahl's law: Scaled speedup is s´ + Np´

3 Spring 1977 – Burroughs Scientific Processor
Shared memory architecture 16 large granularity 48 bits processing elements of power from to 4 MEGAFLOPS 8 MB words of memory – partitioned into 17 modules 2 interconnection network Vector computation with clock period of 160 ns Executing up to 50 MEGAFLOS Data & program file transfers 15 Mbytes/sec B7700 / B system manager Scientific applications Prototype- never got out the door The Burroughs Scientific Processor (BSP) [KS 82) was an attempt by Burroughs Corporation to improve on the Illiac IV design.shared memory-> the individual PE's to share their memory without accessing the control unit. It has 16 arithmetic elements and 17 (prime number) memory modules interconnected by two alignment networks: full crossbar switch with broadcasting and conflict resolving ability. This permits general-purpose interconnectivity between the arithmetic array and the memory-storage modules. It is the combined function of the memory-storage scheme and the alignment networks that supports the conflict-free capabilities of the parallel memory. The parallel processors perform vector computation with a clock period of 160 ns. The control processor provides the supervisory - - interface to the system manager in addition to controlling the parallel processor. The scalar processor processes all operating system and user-program instructions which are stored in the control memory. It executes some serial or scalar portions of user programs with a clock rate of 12 MHz and is able to perform up to 1.5 megaflops. The BSP is capable of executing up to 50 megaflops and is used mainly for scientific applications.

4 Blockdiagram of BSP

5 1979 – Massivelly Parallel Processor
Built by Goodyear Aerospace Coporation to be high speed satellite image processing system 16,896 bit-serial processing elements in 128-row by 132-row rectangular array with nearest-neighbour connections ( plane, cylinder, torus, spiral, linear string ) Addition-430 MOPS , multiplication-216 MOPS on 32-bit floating-point data Like STARAN, Massively Parallel Processor (MPP) [Bat 821 was also designed and built by Goodyear Aerospace Corporation starting from 1979 to be a high speed satellite image processing system. The processor has 16,896 bit-serial processing elements (PE’s) arranged in a 128-row by 132-row (4 redundant rows for fault tolerance) rectangular array with strictly nearest-neighbor connections. The edge connection is programmable so that the array may look like a plane, a cylinder, a torus, a spiral, or a linear string. On 32-bit floating-point data, addition occurs at 430 MOPS and multiplication at 216 MOPS. The staging memory in the input-output path of the array unit acts both as a buffer between the array unit and the outside world, and also to reformat data so both the array unit (bitserial) and the outside world (word-serial) can transfer data in the optimum format. MPP is a SIMD machine and all PE’s perform the same instruction on every machine clock cycle.

6 Application of MMP: Satellite imagery processing
General Image processing Weather simulation Aerodynamic studies Radar processing Reactor diffusion analysis Computer image generation Although built for satellite imagery processing, preliminary application studies indicate that MPP can also support general image processing, weather simulation, aerodynamic studies, radar processing, reactor diffusion analysis, and computer image generation.

7 1988 - Connection Machine CM-1 & CM-2 Thinking Machines
Danny Hills Massachusetts Institute of Technology's Massively 12 dimension parallel hypercube The Connection Machine was a series of supercomputers that grew out of Danny Hillis's research in the early 1980s at MIT on alternatives to the traditional von Neumann architecture of computation. The CM-1, developed at MIT, was a "massively parallel" hypercube arrangement of thousands of very simple processors, each with their own RAM. Hillis and Sheryl Handler founded Thinking Machines in Waltham, Massachusetts and assembled a team to develop the CM-2, which depending on the configuration had as many as 64k processors. A later modification added numeric co-processors to the system, with some fixed number of the original simple processors sharing each numeric processor. With the CM-5, Thinking Machines switched from the CM-2's hypercube architecture of simple processors to a fat tree network of RISC processors (Sun SPARCs). The full list of Connection Machine models, in order of when they were introduced: CM-1, CM-2, CM-200, CM-5, CM-5E. The final design, used for both the CM-1 and its faster successor, the CM-2, was a massive, 5 feet tall cube formed in turn of smaller cubes, representing the 12-dimensional hypercube structure of the network that connected the processors together. This hard geometric object, black, the non-color of sheer, static mass, was transparent, filled with a soft,constantly changing cloud of lights from the processor chips, red, the color of life and energy. It was the archetype of an electronic brain, a living, thinking machine.

8 CM-2 General Specifications
Processors , bit 12-cube of 4´4 meshes 8K bytes of RAM per PE Memory 512 Mbytes Memory Bandwidth Gbits/Sec I/O Channels 8 Capacity per Channel Mbytes/Sec Max. Transfer Rate Mbytes/Sec floating point performance above 2500 MFlops Thinking Machines Corporation produced a family of high performance computer systems. The largest member of the family in the late 1980's was the 64,000 processer CM-2 with performance in excess of 2500 MIPs, and floating point performance above 2500 MFlops.... It's successor, the CM-5, appeared in the movie "Jurassic Park."MIMD Connection machine systems were used in a range of research and "big science" applications including data base retrieval, image processing, computer-aided design and floating-point intensive scientific calculations. Each processor has access to a private 4k memory, allowing the programmer to view the machine as a distributed memory rather than just distributed processors. The processors are all very simple, providing single bit ADD, AND, OR, MOVE, and SWAP operations to and from memory or one of the single bit Flag registers available. This simplicity results in great speed - 32 bit integer addition has a peak rate close to 2000 MOPS. To make the machine more usable, a micro-controller front-ends the machine, accepting larger "macro-instructions" and breaking these down into "micro-instructions" that it executes to send the "nano-instructions" to the relevant processors which actually do the work.

9 The main idea Parallel computer with
a structure closer to that of human brain Extremely flexible and fast communication network between the processors Problems involving separate but interrelated actions of many similar objects or units – movment of atoms , fluid flow , information retrieval , computer graphics The graphic of a 3-D hypercube represents the "hard" electrical connections of part of the 12-D network, but inside these hard rectangular boxes are the "fuzzy" software connections that can be changed independently of the physical wires and traces.

10 CM-2 12-D Hypercube The chips are wired together in a network having shape of 12-D hypercube – a cube with 212 corners, or 4096 chips, each connected to 12 other chips First 3 dimensions of hypercube: The Connection Machine was composed of 65,536 bit processors. Each die consisted of 16 processors with each processor capable of communicating with each other via a switch. These 4,096 dies formed the nodes of a 12th dimension hypercube network. Thus, a processor was guaranteed to be within 12 hops of any other processor in the machine. The hypercube network also facilitated communication by providing alternative routes from source processor to destination. Each node was given a 12-bit node ID, and different paths between two nodes in the network could be traversed based on how the node ID was read. The network allowed for both packet and circuit-based communication for flexibility.

11 4th dimension of hypercube
the 4th dimension cube to the side of the 3rd dimensional cube and then apply a radical graphic simplification: I represented all dimensions greater than 3 as thick "hyperlines," and drew cubes as solid objects in order to visually simplify the resultant structures: This symbolic representation has the added benefit of showing in a glance that the structures always repeat themselves: a 4-D hypercube looks just like a 1-D line, except that it connects two cubes rather than two chips. A 5- D hypercube is a square of cubes: and a 6-D hypercube is a cube of cubes:

12

13 And finally 12-th dimension hypercube
meaning that each computer chip would be directly wired to 12 other chips in such a way that any two chips - - and thereby the 16 processors contained in each chip -- could communicate with each other in 12 or less steps. This network would enable the rapid and flexible communication between processors that made the CM-2 so effective Nearest-neighbour communication is provided among the processors in each chip, as well as among the chips.

14 Processing unit The basic operation of the processing element is to read two bits from the external memory and one flag, combine them according to a specified logical operation producing two bits of results, and to write the resulting bits into the external memory and an internal flag, respectively To handle general communication inside the Connection Machine, special purpose routers are used. Routers handles messages for 16 processing cells. The Connection Machine is hosted by a front-end computer (usually a SUN-4). The host talks to the processing cells through a microcontroller that acts as a bandwidth amplifier between the host and the processing cells. Because the processing cells are able to execute instructions at a higher rate then the host is able to specify, the host specifies higher-level macroinstructions, that are interpreted by the microcontroller to produce the nanoinstructions for the processing cells. Upon receiving an instruction a processing unit can choose to execute it or not, depending on the current state of its flags.

15 Programming CM-2 When solving large problems on the CM-2, the
programmer deals with virtual processors that are mapped into real processors. Achieved by dividing the storage of a physical processor into smaller portions so that many virtual processors share one physical processor. The CM-2 can be programmed in FORTRAN, C*, LISP*, and CM assembly language. Several languages have been ported to the Connection Machine. Among these, LISP and C have been extended to LISP* and C* [41] which allow parallel constructs. C* allows data structures to be spread over a set of processors, so operations on the data structures can be done concurrently, as in the case for matrix summation. C* has proved to be so useful for parallel programming with its high-level constructs that it is now in the process of being ported to other parallel architectures. One of its most useful features is to allow the programmer to virtualize a general matrix machine with an arbitrary number of processors.

16 MasPar MP-1 Maximum of 16,384 PEs PEs contain a 4-bit ALU
Each PE is a RISC processor with dedicated data memory and execution logic, and operates at 1.6 MIPS (32-bit integer add) and 73 KFLOPS (average of 32-bit floating point multiply and add) The aggregate PE Array data memory ranges from 16 MB to 1 GB Performance 1, ,000 integer MIPS (32-bit add) ,200 single precision MFLOPS (average of add and multiply) 128´128 array (8-connected) Multistage router Compilers : MPL: (C with extensions) , MPF: (Fortan 90-like with extensions) Digital Equipment Corporation MasPar MP-1 computer1 with 4096 processors is used for software implementation of various types of computer arithmetic for integer, rational, real and complex arithmetic. The systems implemented (or, in some cases, to be implemented) include both conventional and novel number representations and arithmetic systems. The MasPar system is a SIMD array of 4096 processors configured as a square array with toroidal wraparound in both directions. The individual processors are just 4-bit processors so that all arithmetic is implemented in software. Like any SIMD architecture,at any instant all processors are either performing thesame instruction or are inactive . Clearly, for example,adding two matrices is a particularly simpleinstruction for this machine. Matrix multiplication isless straightforward but is still well-suited to the array. The MasPar Programming Language (MPL) is an extendedversion of ANSI C allowing for plural variableswhich are variables for which there is an instance oneach processor—or, more precisely in each processor’sindividual memory. Communication between the various processors and their memories is achieved eitherthrough the Xnet (which is designed for neighboring communication in each of the North, South, East and West directions) or the router which handles more distant communications. The bandwidth of the Xnet is 16 times that of the router. MPF (MasPar Fortran) is a version of high-performance Fortran, HPF, which again includes the appropriate array constructs and communication instructions.

17 MasPar MP-1System block diagram
There are 16,384 PEs (Processing Elements) in a fully configured MasPar MP-1 or MP-2. Each PE has an ALU, a large register file, and either 16K or 64K bytes of memory. The system is microcoded so that each PE appears to be a RISC processor supporting operations on 1, 8, 16, 32, and 64-bit data. Communication between PEs is supported by direct connections from each PE to its 8 neighbors in the 128 by 128 array of PEs -- but there is also a global router multistage interconnection network, so that each PE can directly communicate with any other PE. In addition, the array has access through the router and a shared I/O memory buffer to a variety of external I/O devices ranging from HiPPI interfaces to RAID storage. Controlling the PE array is the ACU (Array Control Unit), a high-performance 32-bit RISC processor which also can perform scalar computations. The ACU is, in turn, controlled by the front-end machine, either an R3000-based DEC 5000 running Ultrix or a DEC Alpha running OSF. To the front-end machine, the MasPar is essentially seen as a peripheral. Hence, the front-end serves primarily as a user interface -- editing, compiling, debugging, etc., are supported by the front-end. The MP-1 supports programming in C and Fortran.

18 1993 CAM-8 MIT Laboratory for Computer Science Cambridge Massachusetts
Cellular automata machine Simulating physical systems using lattice-gas-like dynamics 2-D & 3-D image processing, large logic simulations Indefinitely scalable 3-D mesh-network multiprocessor optimized for large inexpensive simulations U. S. Air Force 3-D elastic solid simulation developed on CAM-8. The simulation space is 1/8 billion lattice sites. the use of this CA machine for physical simulations (e.g., fluid flow, chemical reactions, polymer dynamics), two and three dimensional image processing (e.g., document reading, medical imaging from 3D data), and large logic simulations (including the simulation of highly parallel CA machines).

19 System diagram The final machine of the talk was for the CAM-8 cellular automata machine, a machine that mimics the basic spatial locality of physics. Thus, each processing element would get it’s own "piece" of the physical entity that is being modeled, such as a location in space or a particle, etc. The processing elements are connected in a 3D mesh, a natural topology for describing microphysical simulations. Each processing element consisted of a programmable lookup-table with an associated local memory. Since there are usually not enough processing elements to assign one to each cell of the system that was to be simulated, each PE was assigned a specific region and performed updates to each cell virtually. On the left is a single hardware module—the elementary “chunk” of the architecture. On the right is an indefinitely extendable array of modules (drawn for convenience as two-dimensional, the array is normally three-dimensional). In the diagram, the solid lines between modules indicate a local mesh interconnection. These wires are used for spatial data movement. There is also a tree network (not shown) connecting all modules to the front-end workstation that controls the CA machine. The workstation uses this tree to broadcast simulation parameters to some or all modules, and to read back data from selected modules.

20 Data movement Each PE consisted of a programmmable lookup-table
with an associated local memory Updating by table lookup – data comes out of the cell-array is passed through a lookup table and put back exactly where it cames from Time-sharing of communication resources reduced the number of interprocessor wires Instead of having a fixed set of neighbourhood data are shifted around in space The time-sharing of communication resources reduces the number of interprocessor wires dramatically The time-sharing of processors allows a highly efficient “assembly-line” processing of spatial data, in which exactly the same operations are repeated for every spatial site in a predetermined order. From the viewpoint of the programmer, this virtualization of the spatial sites is not apparent: you simply program the local dynamics in a uniform CA space. Updating is by table lookup. Data comes out of the cell-array, is passed through a lookup table, and put back exactly where it came from (Figure 1a). The lookup tables are double buffered, so that the front-end workstation can send a new table while the cammodules are busy using the previous table to update the space. and basing our machine on the kind of data partitioning characteristic of lattice gas models. Instead of having a fixed set of neighborhood data visible from each site, we shift data around in our space in order to communicate information from one place to another Information fields move uniformly in various directions, each carrying corresponding bits from every spatial site along with them—in two dimensions think of uniformly shifting bit-planes, in higher dimensions bit-hyperplanes.2 Interactions act separately on the data that land at each lattice point, transforming this block of data into a new set of bits. If some piece of data needs to be sent in two directions, the interaction must make two copies.

21 Some more informations
Can directly accumulate and format data for a real-time video display with CA rules Uses sun workstation , the programming environment is based on a commercially available Forth interpreter X-Windows utility XCAM allows users to display VGA output from CAM8 in a window on the host workstation

22 Systolic Arrays Computers
Definition: A systolic array is a network of processors that rhythmically compute and pass data through the system. analogy with the human circulatory system: heart => global memory, network of veins => array of processors and links

23 Main features one host computer to provide control information and access to peripherals determined by pipelining data concurrently with the (multi)processing - data and computational pipelining. only the processors at the beginning and the end of the chains are able to access the main-memory none of the processing elements have any local memory

24 Chain of Processors in a systolic array
VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern replace single processor with array of regular processing elements Orchestrate data flow for high throughput with less memory access

25 Types of technology made up of single type of processing element
usually wired into two dimensional grids with only the adjacent elements being able to communicate directly

26 How does it work ? name derived from „systole” – medical term describing blood flow on every beat of global system clock, each processor passes its results to the next processor in the chain and receives another data item from the previous processor in the chain no complex multi-cycle instructions

27 synchronous no master controller (control effectively distributed across the network) systolic arrays have to be suitable for algorithms with relatively good regularity like some of those used in : some of those used in image processing, signal processing, certain matrix operations, etc.

28 system management software development
their use can greatly increase the overall speed of computing tasks for which they are well matched. fixed connections among the component elements of systolic arrays limit the scope of their applications they are not well suited to doing general computations: system management software development

29 Systolic array programming
Each operation includes: the type of computation (addition, multiplication, division, etc.) the input data link (north, south, east, west, or internal register), the output data link, an additional specification i.e. time scheduling, when an operation in PE actually occurs (not required in wavefront array)

30 Programming languages
Occam Multiplication example: CHAN vertical [n*(n+1)]: CHAN horizontal[n*(n+1)]: PAR i=[0 FOR n] PAR j=[0 FOR N] mult (vertical[(n*i)+j], vertical[(n*i)+j+1], horizontal[(n*i)+j], horizontal[(n*(i+1))+j]) MDFL - Matrix Data Flow Language

31 Examples of systolic arrays
CMU'S IWARP PROCESSOR manufactured by Intel in 1986 linear array processors connected by data buses going in both directions

32 MATRIX-1 COMPUTER Developed by Saxpy in 1987 Linear array
composed of 32-elements with 32-bit peak of 1 GFLOPS

33 980-STAR COMPUTER China's first systolic computer
Developed in 1989 by China State Ship-building Corp's Wuhan Institute 709 2-dimensional programmable array composed of 16 el. (4x4)

34 Warp computers use Unix workstations as a host connected via a network (TCP/IP)
Matrix-1 used a special high-speed channel for connection to a VAX The 980-STAR's host is an Intel System 310 running iRMX86 and is connected via Multichannel


Download ppt "By Przemysław Grzesik & Anna Poczobutt"

Similar presentations


Ads by Google