©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University.

©2003 Dror Feitelson What is a Parallel System? Chandy: it is related to concurrency. In distributed computing, concurrency is part of the problem. In parallel computing, concurrency is part of the solution.

©2003 Dror Feitelson Distributed Systems Concurrency because of physical distribution –Desktops of different users –Servers across the Internet –Branches of a firm –Central bank computer and ATMs Need to coordinate among autonomous systems Need to tolerate failures and disconnections

©2003 Dror Feitelson Parallel Systems High-performance computing: solve problems that are to big for a single machine –Get the solution faster (weather forecast) –Get a better solution (physical simulation) Need to parallelize algorithm Need to control overhead Can assume friendly system?

©2003 Dror Feitelson The Convergence Use distributed resources for parallel processing Networks of workstations – use available desktop machines within organization Grids – use available resources (servers?) across organizations Internet computing – use personal PCs across the globe (SETI@home)

©2003 Dror Feitelson The Illiac IV in Numbers 64 processors arranged as 8  8 grid Each processor has 10 4 ECL transistors Each processor has 2K 64-bit words (total is 8 Mbit) Arranged in 210 boards Packed in 16 cabinets 500 Mflops peak performance Cost: $31 million

©2003 Dror Feitelson Sustained vs. Peak Peak performance: product of clock rate and number of functional units Sustained rate: what you actually achieve on a real application Sustained is typically much lower than peak –Application does not require all functional units –Need to wait for data to arrive from memory –Need to synchronize –Best for dense matrix operations (Linpack) A rate that the vendor guarantees will not be exceeded

©2003 Dror Feitelson Early HPC Parallel systems in academia/research –1974: C.mmp –1974: Illiac IV –1978: Cm* –1983: Goodyear MPP Vector systems by Cray and Japanese firms –1976: Cray 1 rated at 160 Mflops peak –1982: Cray X-MP, later Y-MP, C90, … –1985: Cray 2, NEC SX-2

©2003 Dror Feitelson Cray’s Achievements Architectural innovations –Vector operations on vector registers –All memory is equally close: no cache –Trade off accuracy and speed Packaging –Short and equally long wires –Liquid cooling systems Style

©2003 Dror Feitelson Vector Supercomputers Vector registers store vectors of fast access Vector instructions operate on whole vectors of values –Overhead of instruction decode only once per vector –Pipelined execution of instruction on vector elements: one result per clock tick (at least after pipeline is full) –Possible to chain vector operations: start feeding second functional unit before finishing first one

©2003 Dror Feitelson The MPP Boom 1985: Thinking Machines introduces the Connection Machine CM-1 –16K single-bit processors, SIMD –Followed by CM-2, CM-200 –Similar machines by MasPar mid ’80s: hypercubes become successful Also: Transputers used as building blocks Early ’90s: big companies join –IBM, Cray

©2003 Dror Feitelson SIMD Array Processors ’80 favorites –Connection machine –Maspar Very many single-bit processors with attached memory – proprietary hardware Single control unit: everything is totally synchronized (SIMD = single instruction multiple data) Massive parallelism even with “correct counting” (i.e. divide by 32)

©2003 Dror Feitelson Hypercubes Early ’80s: Caltech 64-node Cosmic Cube Mid to late ’80s: Commercialized by several companies –Intel iPSC, iPSC/2, iPSC/860 –nCUBE, nCUBE 2 (later turned into a VoD server…) Early ’90s: replaced by mesh/torus –Intel Paragon – i860 processors –Cray T3D, T3E – Alpha processors

©2003 Dror Feitelson Transputers A microprocessor with built-in support for communication Programmed using Occam Used in Meiko and other systems PAR SEQ x := 13; c ! x; SEQ c ? y; z := y; -- z is 13 Synchronous communication: an assignment across processes

©2003 Dror Feitelson Attack of the Killer Micros Commodity microprocessors advance at a faster rate than vector processors Takeover point was around year 2000 Even before that, using many together could provide lots of power –1992: TMC uses SPARC in CM-5 –1992: Intel uses i860 in Paragon –1993: IBM SP uses RS/6000, later PowerPC –1993: Cray uses Alpha in T3D –Berkeley NoW project

©2003 Dror Feitelson Not Everybody is Convinced… Japan’s computer industry continues to build vector machines NEC –SX series of supercomputers Hitachi –SR series of supercomputers Fujitsu –VPP series of supercomputers Albeit with less style

©2003 Dror Feitelson More Recent History 1994 – 1995 slump –Cold war is over –Thinking machines files for chapter 11 –KSR Research files for chapter 11 Late ’90s much better –IBM, Cray retain parallel machine market –Later also SGI, Sun, especially with SMPs –ASCI program is started 21 st century: clusters take over –Based on SMPs

©2003 Dror Feitelson SMPs Machines with several CPUs Initially small scale: 8-16 processors Later achieved large scale of 64-128 processors Global shared memory accessed via a bus Hard to scale further due to shared memory and cache coherence

©2003 Dror Feitelson Architectural Convergence Shared memory used to be uniform (UMA) –Based on bus or crossbar –Conventional load/store operations Distributed memory used message passing Newer machines support remote memory access –Nonuniform (NUMA): access to remote memory costs more –Put/get operations (but handled by NIC) –Cray T3D/T3E, SGI Origin 2000/3000

©2003 Dror Feitelson The ASCI Program 1996: nuclear test ban leads to need for simulation of nuclear explosions Accelerated Strategic Computing Initiative: Moore’s law not fast enough… Budget of a billion dollars

©2003 Dror Feitelson The ASCI Red Machine 9260 processors – PentiumPro 200 Arranged as 4-way SMPs in 86 cabinets 573 GB memory total 2.25 TB disk space total 2 miles of cables 850 KW peak power consumption 44 tons (+300 tons air conditioning equipment) Cost: $55 million

©2003 Dror Feitelson Clusters vs. MPPs Mix and match approach –PCs/SMPs/blades used as processing nodes –Fast switched network for interconnect –Linux on each node –MPI for software development –Something for management Lower cost to set up Non-trivial to operate effectively

©2003 Dror Feitelson SMP Nodes PCs, workstations, or servers with several CPUs Small scale (4-8) used as nodes in MPPs or clusters Access to shared memory via shared L2 cache SMP support (cache coherence) built into modern microprocessors

©2003 Dror Feitelson Myrinet 1995 Switched gigabit LAN –As opposed to Ethernet that is a broadcast medium Programmable NIC –Offload communication operations from the CPU Allows clusters to achieve communication rates of MPPs Very expensive Later: gigabit Ethernet

Blades PCs/SMPs require resources –Floor space –Cables for interconnect –Power supplies and fans This is meaningful if you have thousands Blades provide dense packaging With vertical mounting get < 1U on average The hot new thing in 2002

©2003 Dror Feitelson The Cray Name 1972: Cray Research founded –Cray 1, X-MP, Cray 2, Y-MP, C90… –From 1993: MPPs T3D, T3E 1989: Cray Computer founded –GaAs efforts, closed 1996: SGI Acquires Cray Research –Attempt to merge T3E and Origin 2000: sold to Tera –Use name to bolster MTA 2002: Cray sells Japanese NEC SX-6 2002: Announces new X1 supercomputer

©2003 Dror Feitelson The Top500 List List of the 500 most powerful computer installations in the world Separates academic chic from real impact Measured using Linpack –Dense matrix operations –Might not be representative of real applications

The Competition How to achieve a rank: Few vector processors –Maximize power per processor –High efficiency Many commodity processors –Ride technology curve –Power in numbers –Low efficiency

©2003 Dror Feitelson Vector Programming Conventional Fortran Automatic vectorization of loops by compiler Autotasking uses processors that happen to be available at runtime to execute chunks of loop iterations Easy for application writers Very high efficiency

©2003 Dror Feitelson MPP Programming Library added to programming language –MPI for distributed memory –OpenMP for shared memory Applications need to be partitioned manually Many possible efficiency losses –Fragmentation in allocating processors –Stalls waiting for memory and communication –Imbalance among threads Hard for programmers

©2003 Dror Feitelson Also National Competition Japan Inc. is “more Cray than Cray” –Computers based on few but powerful proprietary vector processors –Numerical wind tunnel at rank 1 from 1993 to 1995 –CP-PACS at rank 1 in 1996 –Earth simulator at rank 1 in 2003 US industry switched to commodity microprocessors –Even Cray did –ASCI machines at rank 1 in 1997—2002

©2003 Dror Feitelson Real Usage Control functions for telecomm companies Reservoir modeling for oil companies Graphic rendering for Hollywood Financial modeling for Wall Street Drug design for pharmaceuticals Weather prediction Airplane design for Boeing and Airbus Hush-hush activities

The Earth Simulator Operational in late 2002 Top rank in Top500 list Result of 5-year design and implementation effort Equivalent power to top 15 US machines (including all ASCI machines) Really big

The Earth Simulator in Numbers 640 nodes 8 vector processors per node, 5120 total 8 Gflops per processor, 40 Tflops total 16 GB memory per node, 10 TB total 2800 km of cables 320 cabinets (2 nodes each) Cost: $ 350 million

Exercise Look at 10 years of Top500 lists and try to say something non-trivial about trends Are there things that grow? Are there things that stay the same? Can you make predictions?

©2003 Dror Feitelson Power with Time Rmax of last machine doubles each year This is 8-fold in three years Degree of parallelism doubles every three years So power of each processor increases 4-fold in three years (=doubles in 18 months) Which is Moore’s Law…

©2003 Dror Feitelson Distribution of Power Rank of machines doubles each year Power of rank 500 machine doubles each year So rank 250 machine this year has double the power of rank 500 machine this year And rank 125 machine has double the power of rank 250 machine In short, power decreases exponentially with rank

©2003 Dror Feitelson Summary Invariants of the last few years: Power grows exponentially with time Parallelism grows exponentially with time But maximal usable parallelism is ~10000 Power drops polynomially with rank Age in the list drops exponentially About 300 new machines each year About 50% of machines in industry About 15% of power due to vector processors

©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University.

Similar presentations

Presentation on theme: "©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University.

Similar presentations

Presentation on theme: "©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University."— Presentation transcript:

Similar presentations

About project

Feedback