The Case for Architectural Diversity

The Case for Architectural Diversity
Burton Smith Cray Inc.

We’ve had some diversity in the past
ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elxsi ETA Systems Evans and Sutherland Computer Division Floating Point Systems Galaxy Goodyear Aerospace Gould Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Saxpy Scientific Computer Systems Supertek Supercomputer Systems Inc Thinking Machines Vitesse

Today, there is less of it
Cray NEC Hitachi Cluster suppliers Do-it-yourself cluster builders Do-it-yourself grid builders

Two basic types of supercomputers
Cluster and grid systems (Type T) Prices based on Transistor cost Performance characterized by Linpack Low bandwidth interconnection networks Off-the-shelf processors Tightly coupled systems (Type C) Prices based on Connection cost Performance characterized by sparse MV multiply High bandwidth interconnection networks Custom processors Each type is adapted to its ecological niche What are these niches? What are these adaptations?

Supercomputer niches Type T: Type C:
Local data access Well-balanced workloads Dense linear algebra Explicit methods Domain decomposition Non-adaptive meshes Regular meshes Slowly varying data bases Type C: Global data access Poorly balanced workloads Sparse linear algebra Implicit methods Operator decomposition Adaptive meshes Irregular meshes Rapidly varying data bases Many disciplines span both columns They may want to employ both types of system

Supercomputer adaptations
Adaptation is visible in several areas, including Latency tolerance Cooling and packaging Message passing styles But first, a few words about a few words

Bandwidth, overhead, and latency
In the LogP model, well-known in computer science: L is the network transport latency o is the processor overhead g is the reciprocal bandwidth (the “gap”) P is the number of processors Time(size) = sizeg + 2o + L overhead space time o L network transport sizeg

“Latency” has several meanings
It means 2o + L for some, L for others Each is a legitimate latency, but for different subsystems Some want it to mean sizeg + 2o + L This is not so useful We should at least try to get our names straight I will use the LogP definitions

Latency tolerance (latency hiding)
Latency can be tolerated by using parallelism A new transmission can start after waiting max(sizeg, o) LTTime(n, size) = (n - 1)max(sizeg, o) + sizeg + 2o + L overhead space time network transport ···

What latency tolerance buys us
It depends on the relative magnitudes of sizeg, o, and L nTime(size) = n(sizeg + 2o + L) LTTime(n, size) = (n - 1)max(sizeg, o) + sizeg + 2o + L If sizeg >> 2o + L we are “bandwidth bound” n-fold latency tolerance saves a mere (n - 1)(2o + L) This only gets significant for large n If o >> sizeg + L we are “overhead bound” n-fold latency tolerance saves about (n - 1)o This will roughly halve the time Unequal overheads at sender and receiver make it worse If L >> sizeg + 2o we are “latency bound” n-fold latency tolerance saves approximately (n - 1)L This is roughly an n-fold time improvement

Aside: does message size vary with P?
Let’s take PDEs as an example, and assume: We have three space dimensions and one time dimension We need 16 times the processors to double the resolution Each processor gets half as many spatial mesh points If the processors are also faster, maybe somewhat more For nearest-neighbor communication, the size shrinks Perhaps to 0.52/3 = 0.63 or 0.51/3 = 0.79 For all-to-all communication, e.g. in a spectral method, the size shrinks to 1/32 of its former value There are half as many points per processor and sixteen times as many processors to distribute it among Your mileage will vary, and it will probably get worse Supercomputer users usually spend P for time, not space

Latency tolerance in summary
It uses parallelism to reduce total transmission time It is basically just pipelined data transport It is most needed when sizeg is relatively small either because of small size or small g (high bandwidth) When sizeg is large, it tolerates latency without help It is not particularly effective when overhead is high When both o and sizeg are small, it works well Vector memory references Multithreaded memory references Long-haul, high speed ATM packet transmission Highway traffic (without toll booths, customs, etc.) The bottom line: latency tolerance is a type C thing It doesn’t matter so much for type T systems

Cooling and packaging All type T supercomputers are air cooled
This makes them voluminous Access for service is simple Unfortunately, the interconnecting cables are long Most type C supercomputers are liquid cooled This lets them be compact Interconnecting cables are shorter At high bandwidth, cable volume varies as length3 Unfortunately, access for service is more complex Each type is pretty well adapted Environmental forces that might cause re-adaptation: Higher power in future off-the-shelf chips Low-cost optical interconnect

Message passing styles
In the usual type T system, software builds and unbuilds the messages and hardware transports them o is several microseconds, typically much greater than L A small g is futile unless size is pretty large The user program is involved at both ends The software can adapt to pretty much any old hardware In most type C systems, hardware can build and unbuild the messages as well as transport them o is small, typically less than L A small g is therefore worthwhile The user program need only be involved at one end The hardware must be suited to the messaging interface There are a few single-sided messaging interfaces

MPI-2 PUT and GET to remote WINDOWs in a process group
The WINDOW is typically atop an array or common block Each WINDOW instance can have a different origin and size Window handle, processor number, and offset are args. WIN_FENCE is the barrier operation Stride and gather/scatter are controlled by MPI types e.g. CONTIGUOUS, VECTOR, INDEXED The type must be set up and made known beforehand Types can be represented differently in heterogeneous systems and MPI will (hopefully) take care of it There are several atomic memory accumulate functions There are many collective communication functions

Shmem Remote data must be SYMMETRIC, i.e. the virtual address must be the same in all nodes (TASK)COMMON and C statics are OK Stack variables can be forced SYMMETRIC There are BARRIER and BARRIER_ALL operations Types or explicit widths (8-128) specify transfer quanta Vector transfers may be unit stride or constant stride Gather/scatter is only available on UNICOS/mk There are some incompatibilities among UNICOS, UNICOS/mk, and IRIX There are several atomic memory accumulate functions There are a few collective communication functions

Co-array Fortran 95 (and UPC)
Roughly, UPC is to C as co-array Fortran is to Fortran These languages have two kinds of subscripts A(i)[J] roughly means A(I) on image J If J exceeds P, the locations are distributed cyclically There may be any number of threads per image There are SYNC_ operations for images There are nameless critical sections Fortran 95 has reductions and other collective ops Fortran 95 already has a forall, and UPC added one

Single-sided implementations
Several builders of type T systems are getting on board IBM for shmem, several for UPC DOE’s Office of Science is funding open-source versions Why is this adaptive type T behavior? Off-the-shelf network hardware now has some support Reducing ovehead saves time in type T systems

Conclusions There are two principal types of supercomputer
Should there be more? Will there be? These two types are adapted to different niches And the niches are important Picking the wrong type of supercomputer wastes money by paying for unused transistors or connectivity The great supercomputer debate of the 90’s is over so let’s move on

The Case for Architectural Diversity

Similar presentations

Presentation on theme: "The Case for Architectural Diversity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Case for Architectural Diversity

Similar presentations

Presentation on theme: "The Case for Architectural Diversity"— Presentation transcript:

Similar presentations

About project

Feedback