Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gordon Bell Microsoft ISHPC International Symposium on High-Performance Computing 26 May 1999.

Similar presentations


Presentation on theme: "Gordon Bell Microsoft ISHPC International Symposium on High-Performance Computing 26 May 1999."— Presentation transcript:

1

2 Gordon Bell http://www.research.microsoft.com/users/gbell Microsoft ISHPC International Symposium on High-Performance Computing 26 May 1999

3 What a difference spending >10X/system & 25 years makes! 150 Mflops CDC 7600+ Cray 1 LLNL c1978 40 Tflops ESRDC c2002 (Artist’s view)

4 Supercomputers(t) Time$Mstructure example 19501mainframesmany... 19603instruction //smIBM / CDC mainframe SMP 197010pipelining7600 / Cray 1 198030vectors; SCI“Crays” 1990250MIMDs: mC, SMP, DSM“Crays”/MPP 20001,000ASCI, COTS MPPGrid, Legion

5 Supercomputing: speed at any price, using parallelism Intra processor Memory overlap & instruction lookahead Functional parallelism (2-4) Pipelining (10) SIMD ala ILLIAC 2d array of 64 pe vs vectors Wide instruction word (2-4) MTA (10-20) with parallelization of a stream MIMDs… multiprocessors… parallelization allows programs to stay with ONE stream SMP (4-64) Distributed Shared Memory SMPs 100 MIMD… multicomputers force multi-streams Multicomputers aka MPP aka clusters (10K) Grid: 100K

6 Growth in Computational Resources Used for UK Weather Forecasting 1950 2000 10T 1T 100G 10G 1G 100M 10M 1M 100K 10K 1K 100 10 Leo Mercury KDF9 195 205 YMP 10 10 / 50 yrs = 1.58 50

7 Talk plan The very beginning: “build it yourself” Supercomputing with one computer… the Cray era 1960-1995 Supercomputing with many computers… parallel computing 1987- SCI: what was learned? Why I gave up to shared memory… From the humble beginnings Petaflops: when, … how, how much New ideas: Now, Legion, Grid, Globus … Beowulf: “build it yourself”

8 Supercomputer: old definition(s) In the beginning everyone built their own computer Largest computer of the day Scientific and engineering apps Large government defense, weather, aero, laboratories and centers are first buyers Price is no object: $3M … 30M, 50M, 150 … 250M Worldwide market: 3-5, xx, or xxx?

9 Supercomputing: new definition Was a single, sequential program Has become a single, large scale job/program composed of many programs running in parallel Distributed within a room Evolving to be distributed in a region and globe Cost, effort, and time is extraordinary Back to the future: build your own super with shrink-wrap software!

10 Manchester: the first computer. Baby, Mark I, and Atlas

11 von Neumann computers: Rand Johniac When laboratories built their own computers

12 Cray 1925 -1996 see gbell home page

13 CDC 1604 & 6600

14 CDC 7600: pipelining

15 CDC STAR… ETA10 Scalar matters

16 Cray 1 #6 from LLNL. Located at The Computer Museum History Center, Moffett Field

17 Cray 1 150 Kw. MG set & heat exchanger

18 Cray XMP/4 Proc. c1984

19 A look at the beginning of the new beginning

20 SCI (Strategic Computing Initiative) funded by DARPA and aimed at a Teraflops! Era of State computers and many efforts to build high speed computers… lead to HPCC Thinking Machines, Intel supers, Cray T3 series

21 Minisupercomputers: a market whose time never came. Alliant, Convex, Ardent+Stellar= Stardent = 0,

22 Cydrome and Multiflow: prelude to wide word parallelism in Merced Minisupers with VLIW attack the market Like the minisupers, they are repelled It’s software, software, and software Was it a basically good idea that will now work as Merced?

23 KSR 1: first commercial DSM NUMA (non-uniform memory access) aka COMA (cache-only memory architecture)

24 Intel’s ipsc 1 & Touchstone Delta

25 “ ” In Dec. 1995 computers with 1,000 processors will do most of the scientific processing. Danny Hillis 1990 (1 paper or 1 company)

26 The Bell-Hillis Bet Massive Parallelism in 1995 TMC World-wide Supers TMC World-wide Supers TMC World-wide Supers Applications Revenue Petaflops / mo.

27 Thinking Machines: CM1 & CM5 c1983-1993

28 Bell-Hillis Bet: wasn’t paid off! My goal was not necessarily to just win the bet! Hennessey and Patterson were to evaluate what was really happening… Wanted to understand degree of MPP progress and programmability

29 SCI (c1980s): Strategic Computing Initiative funded ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, Thinking Machines (Connection Machine),

30 Those who gave their lives in the search for parallellism Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC, Chen Systems, CHOPP, Cogent, Convex (now HP), Culler, Cray Computers, Cydrome, Dennelcor, Elexsi, ETA, E & S Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, KSR, MasPar, Multiflow, Myrias, Ncube, Pixar, Prisma, SAXPY, SCS, SDSA, Supertek (now Cray), Suprenum, Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Thinking Machines, Vitec, Vitesse, Wavetracer.

31 What can we learn from this? The classic flow: university research to product development worked SCI: ARPA-funded product development failed. No successes. Intel prospered. ASCI: DOE-funded product purchases creates competition First efforts in startups… all failed. – Too much competition (with each other) – Too little time to establish themselves – Too little market. No apps to support them – Too little cash Supercomputing is for the large & rich … or is it? Beowulf, shrink-wrap clusters

32 Humble beginning: In 1981… would you have predicted this would be the basis of supers?

33 Innovation The Virtuous Economic Cycle that drives the PC industry Volume Competition Standards Utility/value

34 Platform Economics Computer type 0.01 0.1 1 10 100 1000 10000 100000 Mainframe WSBrowser Price (K$) Volume (K) Application price Traditional computers: custom or semi-custom, high-tech and high-touch New computers: high-tech and no-touch

35 Computer ops/sec x word length / $

36 Intel’s ipsc 1 & Touchstone Delta

37 GB with NT, Compaq, HP cluster

38 192 HP 300 MHz 64 Compaq 333 MHz Andrew Chien, CS UIUC-->UCSD Rob Pennington, NCSA Myrinet Network, HPVM, Fast Msgs Microsoft NT OS, MPI API “Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft The Alliance LES NT Supercluster

39 Are we at a new beginning? “Now, this is not the end. It is not even the beginning of the end, but it is, perhaps, the end of the beginning.” 1999 Salishan HPC Conference from W. Churchill 11/10/1942 “You should not focus NSF CS Research on parallelism. I can barely write a correct sequential program.” Don Knuth 1987 (to Gbell) “I’ll give a $100 to anyone who can run a program on more than 100 processors.” Alan Karp (198x?) “I’ll give a $2,500 prize for parallelism every year.” Gordon Bell (1987)

40 Bell Prize and Future Peak Gflops (t) Petaflops study target XMP NCube CM2

41 Predicted 1 TFlops PAP 1995. Actual 1996. Very impressive progress! (RAP<1 TF) More diversity =>less software progress! – Predicted: SIMD, mC (incl. W/S), scalable SMP, DSM, supers would continue as significant – Got: SIMD disappeared, 2 mC, 1-2 SMP/DSM, 4 supers, 2 mCv with one address space 1 SMP became larger and clusters, MTA, workstation clusters, GRID $3B (unprofitable?) industry; 10+ platforms PCs and workstations diverted users MPP apps market DID/could NOT materialize 1989 Predictions vs 1999 Observations

42 Intel/Sandia: 9000 Pentium Pro LLNL/IBM: 488x8x3 PowerPC (SP2) LNL/Cray: 6144 P in DSM clusters Maui Supercomputer Center – 512x1 SP2 U. S. Tax Dollars At Work. How many processors does your center have?

43 ASCI Blue Mountain 3.1 Tflops SGI Origin 2000 12,000 sq. ft. of floor space 1.6 MWatts of power 530 tons of cooling 384 cabinets to house 6144 CPU’s with 1536 GB (32GB / 128 CPUs) 48 cabinets for metarouters 96 cabinets for 76 TB of raid disks 36 x HIPPI-800 switch Cluster Interconnect 9 cabinets for 36 HIPPI switches about 348 miles of fiber cable

44 Half of LASL

45 Comments from LLNL Program manager Lessons Learned with “Full-System Mode” – It is harder than you think – It takes longer than you think – It requires more people than you can believe Just as in the very beginning of computing, leading edge users are building their own computers.

46 NEC Supers

47 40 Tflops Earth Simulator R&D Center c2002

48 Fujitsu VPP5000 multicomputer: (not available in the U.S.) Computing nodes speed: 9.6 Gflops vector, 1.2 Gflops scalar primary memory: 4-16 GB memory bandwidth: 76 GB/s (9.6 x 64 Gb/s) inter-processor comm: 1.6 GB/s non-blocking with global addressing among all nodes I/O: 3 GB/s to scsi, hippi, gigabit ethernet, etc. 1-128 computers deliver 1.22 Tflops

49 C1999 Clusters of computers. It’s MPP when processors/cluster >1000 WhoΣP.pap ΣP. P.pap ΣP.pap/C Σp/.C ΣMp./C ΣM.s T. fps #.KG. fps G. fps #GBTB LLNL3.95.9.665.382.562 (IBM) LANL3.16.1.564128.3276 (SGI) S andia 2.79.1.3.62- (Intel) B eowulf 0.52.04 F ujitsu 1.2.139.69.614.-16 NEC4.0.5812816128 ESRDC 405.12864816

50 High performance architecture/program timeline 1950.1960.1970.1980.1990.2000 VtubesTrans.MSI(mini) Micro RISCnMicr Sequential programming---->------------------------------ <SIMD Vector--//--------------- Parallelization--- Parallel programming <--------------- multicomputers <--MPP era------ ultracomputers 10X in size & price!10x MPP “in situ” resources 100x in //sm NOWVLSC Grid

51 Yes… we are at a new beginning! Single jobs, composed of 1000s of quasi- independent programs running in parallel on 1000s of processors (or computers). Processors (or computers) of all types are distributed (I.e. connected) in every fashion from a collection using a single shared memory to globally disperse computers.

52 Future

53 2010 component characteristics 100x improvement @60% growth Chip Density500. Mt Bytes/chip8. GB On chip clock2.5 GHz Inter-system clock0.5 Disk1. TB Fiber speed (1 ch)10. Gbps

54 1999: buyers, users, ISVs,? Technical: supers dying; DSM (and SMPs) trying – Mainline: user & ISV apps ported to PCs & workstations – Supers (legacy code) market lives on... – Vector apps (e.g ISVs) ported to parallelized SMPs – ISVs adopt MPI for a few apps at their own peril – Leading edge: One-of-a-kind apps on clusters of 16, 256,...1000s built from uni, SMP, or DSM at great expense! Commercial: SMP mainframes and minis and clusters are interchangeable (control is the issue) – Dbase & tp: SMPs compete with mainframes if central control is an issue else clusters – Data warehousing: may emerge… just a Dbase – High growth, web and stream servers: Clusters have the advantage

55 General purpose, non- parallelizable codes (PCs have it!) Vectorizable & //able (Supers & all SMPs) Hand tuned, one-of MPP course grain. MPP embarrassingly // (Clusters of anythings) Database Database/TP Web Host Stream Audio/Video Technical Commercial Application Taxonomy If real rich then IBM Mainframes or large SMPs else PC Clusters If real rich then SMP clusters else PC Clusters (U.S. only)

56 Xpt SMPs (mainframes) Xpt-SMPvector Xpt-multithread (Tera) “multi” as a component Xpt-”multi” hybrid DSM-(commodity-SCI)? DSM (scalar) DSM (vector) Commodity “multis” Clusters of: “multis” Clusters of: DSMs (scalar & vector) SMP Multicomputers aka Clusters. MPP when n>1000 processors mainline C2000+ Architecture Taxonomy

57 Questions that will get answered How long will Moore’s Law continue? MPP (Clusters of >1K proc,) vs SMP (incl. DSM)? How much time and money for programming? How much time and money for execution? When or will DSM be pervasive? Is the issue of processor architecture (scalar, MTA, VLIW/MII, vector important? Commodity vs proprietary chips? Commodity, Proprietary, or Net interconnections? Unix vs VendorIX vs NT? Commodity vs proprietary systems? Can we find a single, all pervasive programming model for scalable parallelism to support apps? When will computer science teach parallelism?

58 Switching from a long-term belief in SMPs (e.g. DSM, NUMA) to Clusters 1963-1993 SMP => DSM inevitability after 30 years of belief in & building mPs 1993 clusters are inevitable 2000+ commodity clusters, improved log(p) SMPs => DSM

59 SNAP Systems circa ­ 2000 Local & global data comm world ATM & Ethernet: to PC, workstation, & servers Wide-area global ATM network Legacy mainframe & minicomputer servers & terminals Centralized & departmental servers built from PCs scalable computers built from PCs & SANs TC=TV+PC home... (CATV or ATM or satellite) ??? Portables A space, time (bandwidth), generation, and reliability scalable environment Person servers (PCs) Mobile Nets Telecomputers aka Internet Terminals

60 Scaling dimensions include: reliability… including always up number of nodes – most cost-effective system built from best nodes… PCs with NO backplane – highest throughput distributes disks to each node versus into a single node location within a region or continent time-scale I.e. machine generations

61 Why did I switch to clusters aka multicomputers aka MPP? Economics: commodity components give a 10-100x advantage in price performance – Backplane connected processors (incl. DSMs) vs board-connected processors Difficulty of making large SMPs (and DSM) – Single system image… clearly needs more work SMPs (and DSMs) fail ALL scalabilities! – size and lumpiness – reliability – cross-generation – spatial We need a single programming model Clusters are the only structure that scales!

62 Technical users have alternatives PCs work fine for smaller problems “Do it yourself clusters” e.g. Beowulf works! MPI: lcd? programming model doesn’t exploit shared memory ISVs have to use lcd to survive SMPs are expensive Clusters required for scalabilities or apps requiring extra-ordinary performance...so DSM only adds to the already complex parallelization problem Non-U.S. users continue to use vectors

63 Commercial users don’t need them Highest growth is & will be web servers delivering pages, audio, and video Apps are inherently, embarrassingly parallel Databases and TP parallelized and transparent A single SMP handles traditional apps Clusters required for reliability, scalabilities

64 2010 architecture Not much different… I see no surprises, except at the chip level. Good surprises would drive performance more rapidly SmP (m<16) will be the component for clusters. Most cost-effective systems are made from best nodes. Clusters will be pervasive. Interconnection networks log(p) continue to be the challenge

65 Computer (P-Mp) system Alternatives Node size: most cost-effective SMPs – Now 1-2 on a single board – Evolves based on n processor per chip Continued use of single bus SMP “multi” Large SMP provide a single system image for small systems, but not cost or space efficient for use as cluster component SMPs evolving to weak coherency DSMs

66 Cluster system Alternatives System in a room: SAN connected e.g. NOW, Beowulf System in the building: LAN connected System across the continent or globe: Inter- / intra-net connected networks

67 NCSA Cluster of 8 x 128 processors SGI Origin

68 Architects & architectures… clusters aka (MPP if p>1000) clusters => NUMA/DSM iff commodity interconnects supply them U.S. vendors = 9 x scalar processors – HP, IBM, and SUN: minicomputers aka servers to attack mainframes are the basic building blocks – SMPs with 100+ processors per system – surprise: 4-16 processors / chip… MTA? Intel-based: desktop & small servers – commodity supercomputers ala Beowulf Japanese vendors = vector processors – NEC continue driving NUMA approach – Fujitsu will evolve to NUMA/DSM

69 1994 Petaflops Workshop c2007-2014. Clusters of clusters. Something for everybody SMPClustersActive Memory 400 P4-40K P400K P 1 Tflops*10-100 Gflops1 Gflops 400 TB SRAM400 TB DRAM0.8 TB embed 250 Kchips60K-100K chips4K chips 1 ps/result10-100 ps/result *100 x 10 Gflops threads 100,000 1 Tbyte discs => 100 Petabytes. 10 failures / day

70 HT-MT: What’s 0.5 5 ?

71 HT-MT… Mechanical: cooling and signals Chips: design tools, fabrication Chips: memory, PIM Architecture: MTA on steroids Storage material

72 HTMT & heuristics for computer builders Mead 11 year rule: time between lab appearance and commercial use Requires >2 break throughs Team’s first computer or super It’s government funded… albeit at a university

73 Global interconnection Our vision... is a system of millions of hosts… in a loose confederation. Users will have the illusion of a very powerful desktop computer through which they can manipulate objects. Grimshaw, Wulf, et al “Legion” CACM Jan. 1997 “ ” “ ”

74 Utilize in situ workstations! NoW (Berkeley) set sort record, decrypting Grid, Globus, Condor and other projects Need “standard” interface and programming model for clusters using “commodity” platforms & fast switches Giga- and tera-bit links and switches allow geo-distributed systems Each PC in a computational environment should have an additional 1GB/9GB!

75 Or more parallelism… and use installed machines 10,000 nodes in 1999 or 10x Increase Assume 100K nodes 10 Gflops/10GBy/100GB nodes or low end c2010 PCs Communication is first problem… use the network Programming is still the major barrier Will any problems fit it?

76 The Grid: Blueprint for a New Computing Infrastructure Ian Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999 Published July 1998; ISBN 1-55860-475-8 22 chapters by expert authors including: – Andrew Chien, – Jack Dongarra, – Tom DeFanti, – Andrew Grimshaw, – Roch Guerin, – Ken Kennedy, – Paul Messina, – Cliff Neuman, – Jon Postel, – Larry Smarr, – Rick Stevens, – Charlie Catlett – John Toole – and many others http://www.mkp.com/grids “A source book for the history of the future” -- Vint Cerf

77 The Grid “Dependable, consistent, pervasive access to [high-end] resources” Dependable: Can provide performance and functionality guarantees Consistent: Uniform interfaces to a wide variety of resources Pervasive: Ability to “plug in” from anywhere

78 Alliance Grid Technology Roadmap: It’s just not flops or records/se User Interface Tango Webflow Habanero Workbenches NetMeeting H.320/323 RealNetworks Middleware Globus LDAP QoS Java vBNS Abilene ActiveX MREN Clusters Compute Condor JavaGrande HPVM/FM Symera (DCOM) DSM HPF MPI OpenMP Clusters Data ODBC Emerge (Z39.50) SRB HDF-5 SANs svPablo DMFXML Virtual Director CAVERNsoft Java3D SCIRun Visualization Cave5D VRML

79 Summary 1000x increase in PAP has not always been accompanied with RAP, insight, infrastructure, and use. Much remains to be done. “The PC World Challenge” is to provide commodity, clustered parallelism to commercial and technical communities Only becomes true if software vendors e.g. Microsoft deliver “shrink-wrap” software ISVs must believe that clusters are the future Computer science has to get with the program Grid etc. using world-wide resources, including in situ PCs is the new idea

80 2004 Computer Food Chain ??? Portable Computers Mainframe Vector Super Networks of Workstation/PCs Mini Massively Parallel Processors Dave Patterson, UC/Berkeley

81 The end

82 “ ” Moore’s Law 100x But how fast can the clock tick? Are there any surprises? Increase parallelism 10K>100K10x Spend more ($100M  $500M) 5x Centralize center or fast network3x Commoditization (competition)3x Gordon Bell, ACM 1997 When is a Petaflops possible? What price?

83 Processor Alternatives commodity aka Intel micros – Does VLIW work better as a micro than it did as Cydrome & Multiflow minis? vector processor multiple processors per chip or multi-threading MLIW? a.k.a. signal processing FPGA chip-based special processors

84 Russian Elbrus E2K Micro WhoE2KMerced Clock GHz1.20.8 Spec i/fp135./35045./70 Size mm 2 126.300. Power35.60. Pin B/W GB15. Cache (KB)64./256 PAP Gflps10.2 System shipQ4./2001

85 What Is The Processor Architecture? VECTORS OR Comp. Sci. View MISC >> CISC Language directed RISC Super-scalar MTA Extra-Long Instruction Word Super Computer View RISC VCISC (vectors) multiple pipes

86 Observation: CMOS supers replaced ECL in Japan 10 Gflops vector units have dual use – In traditional mPv supers – as basis for computers in mC Software apps are present Vector processors out-perform n (n=10) micros for many apps. It’s memory bandwidth, cache prediction, inter-communication, and overall variation

87 Weather model performance

88 Observation: MPPs 1, Users <1 MPPs with relatively low speed micros with lower memory bandwidth, ran over supers, but didn’t kill ‘em. Did the U.S. industry enter an abyss? - Is crying “unfair trade” hypocritical? - Are U. S. users being denied tools? - Are users not “getting with the program” Challenge we must learn to program clusters... - Cache idiosyncrasies - Limited memory bandwidth - Long Inter-communication delays - Very large numbers of computers - NO two computers are alike => NO Apps

89 The Law of Massive Parallelism (mine) is based on application scaling There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other. A... any parallel problem can be scaled to run efficiently on an arbitrary network of computers, given enough memory and time… but it may be completely impractical Challenge to theoreticians and tool builders: How well will or will an algorithm run? Challenge for software and programmers: Can package be scalable & portable? Are there models? Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just total flop or flops? Challenge to funders: Is the cost justified? Gordon’s WAG

90 GB's Estimate of Parallelism in Engineering & Scientific Applications granularity & degree of coupling (comp./comm.) scalar 60% vector 15% Vector & // 5% One-of >>// 5% Embarrassingly & perfectly parallel 15% log (# apps) new or scaled-up apps dusty decks for supers Supers PCs WSs Clusters aka MPPs aka multicomputers ----scalable multiprocessors----- Gordon’s WAG


Download ppt "Gordon Bell Microsoft ISHPC International Symposium on High-Performance Computing 26 May 1999."

Similar presentations


Ads by Google