Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gordon Bell Microsoft SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years.

Similar presentations


Presentation on theme: "Gordon Bell Microsoft SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years."— Presentation transcript:

1

2 Gordon Bell http://www.research.microsoft.com/users/gbell Microsoft SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years ahead”

3 What a difference 25 years and spending >10x makes! LLNL center 150 Mflops 7600 & Cray1 c1978 ESRDC c2002 40 Tflops. 5120 Proc. 640 Computers

4 Talk plan We are at a new beginning… many views: installations, parallelism, machine intros(t), timeline, cost to get results, and scalabilities SCI c1985, the beginning: 1K processors (MPP) ASCI c1998, new beginning: 10K processors Why I traded places with Greg Papadapolous re. Clusters and SmPs Questions that users & architects will resolve New structures: Beowulf and NT eqivalent, Condor, Cow, Legion, Globus, Grid …

5 Comments from LLNL Program manager Lessons Learned with “Full-System Mode” – It is harder than you think – It takes longer than you think – It requires more people than you can believe Just as in the very beginning of computing, leading edge users are building their own computers.

6 Are we at a new beginning? “Now, this is not the end. It is not even the beginning of the end, but it is, perhaps, the end of the beginning.” 1999 Salishan HPC Conference from W. Churchill 11/10/1942 “You should not focus NSF CS Research on parallelism. I can barely write a correct sequential program.” Don Knuth 1987 (to Gbell) “Parallel processing is impossible for people to create well, much less debug?’ Ken Thompson 1987 “I’ll give a $100 to anyone who can run a program on more than 100 processors.” Alan Karp (198x?) “I’ll give a $2,500 prize for parallelism every year.” Gordon Bell (1987)

7 Yes… we are at a new beginning! Based on clustered computing Single jobs, composed of 1000s of quasi- independent programs running in parallel on 1000s of processors. Processors (or computers) of all types are distributed and inter-connected) in every fashion from a collection using a single shared memory to globally disperse computers.

8

9 Intel/Sandia: 9000 Pentium Pro LLNL/IBM SP2: 3x(488x8) PowerPC LNL/Cray: 6144 P in 48x128 DSM clusters U. S. Tax Dollars At Work. How many processors does your center have?

10 Supercomputing Architectures: speed at any price, parallel nodes Intra processor Memory overlap & instruction lookahead Functional parallelism (2-4) Pipelining (10) SIMD ala ILLIAC 2d array of 64 pe leading to: vectors Wide instruction word (2-4) MTA (10-100) for parallelization of a stream MIMD… multiprocessors… parallelization allows programs to stay with ONE stream SMP (4-64) Distributed Shared Memory SMPs 100 MIMD… clustered computing > multi-streams MPP aka clusters aka multicomputers (10K) Grid: 100K

11 High performance architectures timeline 1950.1960.1970.1980.1990.2000 VtubesTrans.MSI(mini) Micro RISCnMicr “IBM PC” Processoroverlap, lookahead “killer micros” Cray era66007600Cray1X Y C T FuncPipeVector-----SMP----------------> SMPmainframes--->“multis”-----------> DSM??Mmax.KSR SGI----> ClustersTandmVAXIBM UNIX-> MPP if n>1000 Ncube IntelIBM-> Local NOW and Global Networks n>10,000 Grid

12 High performance architectures timeline 1950.1960.1970.1980.1990.2000 VtubesTrans.MSI(mini) Micro RISCnMicr “IBM PC” Sequential programming---->------------------------------ (single execution stream e.g. Fortran) Processoroverlap, lookahead “killer micros” Cray era66007600Cray1X Y C T FuncPipeVector-----SMP----------------> SMPmainframes--->“multis”-----------> DSM??Mmax.KSR DASHSGI---> <SIMD Vector--//--------------- Parallelization--- -----------------THE NEW BEGINNING----------------------- Parallel programs aka Cluster Computing <--------------- multicomputers <--MPP era------ ClustersTandmVAXIBM UNIX-> MPP if n>1000 Ncube IntelIBM-> Local NOW Beowlf and Global Networks n>10,000 Grid

13 High performance architecture/program timeline 1950.1960.1970.1980.1990.2000 VtubesTrans.MSI(mini) Micro RISCnMicr Sequential programming---->------------------------------ (single execution stream) <SIMD Vector--//--------------- Parallelization--- Parallel programs aka Cluster Computing <--------------- multicomputers <--MPP era------ ultracomputers 10X in size & price!10x MPP “in situ” resources 100x in //sm NOWVLSCC geographically dispersedGrid

14 Computer types Netwrked Supers… GRID Legion Condor Beowulf NT clusters VPPuni T3E SP2 (mP) NOW NEC mP SGI DSM clusters & SGI DSM NEC super Cray X…T (all mPv) Mainframes Multis WSs PCs -------- Connectivity-------- WAN/LAN SAN DSM SM micros vector Clusters

15 Technical computer types Netwrked Supers… GRID Legion Condor Beowulf VPPuni SP2 (mP) NOW NEC mP T series SGI DSM clusters & SGI DSM NEC super Cray X…T (all mPv) Mainframes Multis WSs PCs WAN/LAN SAN DSM SM micros vector Old World ( one program stream) New world: Clustered Computing (multiple program streams)

16 Technical computer types Netwrked Supers… GRID Legion Condor Beowulf VPPuni SP2 (mP) NOW NEC mP T series SGI DSM clusters & SGI DSM NEC super Cray X…T (all mPv) Mainframes Multis WSs PCs WAN/LAN SAN DSM SM micros vector Vectorize Parallellelize MPI, Linda, PVM, ??? Distributed Computing Parallellelize

17 Technical computer types: Pick of: 4 nodes, 2-3 interconnects Fujitsu Hitachi IBM ?PC? SGI cluster Beow/NT NEC SGI DSM T3 HP? NEC super Cray ??? Fujitsu Hitachi HP IBM Intel SUN plain old PCs SAN DSM SMP micros vector

18 Bell Prize and Future Peak Tflops (t) Petaflops study target NEC XMP NCube CM2

19 Bell Prize: 1000X 1987-1999 1987 Ncube 1,000 computers: showed with more memory, apps scaled 1987 Cray XMP 4 proc. @200 Mflops/proc 1996 Intel 9,000 proc. @200 Mflops/proc 1998 600 RAP Gflops Bell prize 1999 > 1 Tflops attainable on the ASCI computers Parallelism gains – 10x in parallelism over Ncube – 1500x in parallelism over XMP Spend 5x more… cost may be 10x more! Cost effect.: 5x; ECL  CMOS; SRAM  DRAM Moore’s Law =100x Clock: 2-10x; CMOS-ECL speed cross-over

20 SCI c1983 (Strategic Computing Initiative) funded by DARPA in the early 80s and aimed at a Teraflops! Era of State computers and many efforts to build high speed computers… lead to HPCC Thinking Machines, Intel supers, Cray T3 series

21 Humble beginning: “Killer” Micro? In 1981… did you predict this would be the basis of supers?

22 Innovation The Virtuous Economic Cycle that drives the PC industry Volume Competition Standards Utility/value

23 “ ” In Dec. 1995 computers with 1,000 processors will do most of the scientific processing. Danny Hillis 1990 (1 paper or 1 company)

24 The Bell-Hillis Bet Massive Parallelism in 1995 TMC World-wide Supers TMC World-wide Supers TMC World-wide Supers Applications Revenue Petaflops / mo.

25 Bell-Hillis Bet: wasn’t paid off! My goal was not necessarily to just win the bet! Hennessey and Patterson were to evaluate what was really happening… Wanted to understand degree of MPP progress and programmability

26 SCI (c1980s): Strategic Computing Initiative funded ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Cedar (U. of IL), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, Thinking Machines (Connection Machine)

27 Those who gave their lives in the search for parallelism Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC, Chen Systems, CHOPP, Cogent, Convex (now HP), Culler, Cray Computers, Cydrome, Dennelcor, Elexsi, ETA, E & S Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, KSR, MasPar, Multiflow, Myrias, Ncube, Pixar, Prisma, SAXPY, SCS, SDSA, Supertek (now Cray), Suprenum, Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Thinking Machines, Vitec, Vitesse, Wavetracer.

28 What can we learn from this? SCI: ARPA-funded product development failed. No successes. Intel prospered. ASCI: DOE-funded product purchases creates competition First efforts in startups… all failed. – Too much competition (with each other) – Too little time to establish themselves – Too little market. No apps to support them – Too little cash Supercomputing is for the large & rich … or is it? Beowulf, shrink-wrap clusters; NOW,Condor, Legion, Grid, etc.

29 2010 ground rules: The component specs

30 2010 component characteristics 100x improvement @60% growth Chip Density500. Mt Bytes/chip8. GB On chip clock2.5 GHz Inter-system clock0.5 Disk1. TB Fiber speed (1 ch)10. Gbps

31 Computer ops/sec x word length / $

32 Processor Limit: DRAM Gap Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod “Moore’s Law”

33 Gordon B.& Greg P.: Trading places. Or why I switched from SMPs to clusters “Miles Law: where you stand depends on where you sit.” 1993 GB: SMP and DSM inevitability after 30 years of belief in/building mPs GP: multicomputers ala CM5 2000+ GB: commodity clusters, improved log(p) GP: SMPs => DSM

34 GB with NT, Compaq, HP cluster

35 AOL Server Farm

36 best price performance rapid response to technology trends no single-point vendor just-in-place configuration scalable leverages large software development investment mature, robust, accessible user empowerment meets low expectations created by MPPs from Thomas Sterling WHY BEOWULFS ?

37 $28 per sustained MFLOPS $11 per peak MFLOPS from Thomas Sterling IT'S THE COST, STUPID

38 Why did I trade places i.e. switch to clustered computing? Economics: commodity components give a 10-100x advantage in price performance – Backplane connected processors (incl. DSMs) vs board-connected processors Difficulty of making large SMPs (and DSM) – Single system image… clearly needs more work SMPs (and DSMs) are NOT scalable in: – size. All have very lumpy memory access patterns – reliability. Redundancy and ft is required. – cross-generation. Every 3-5 years start over. – Spatial. Put your computers in multiple locations. Clusters are the only structure that scales!

39 Technical users have alternatives (making the market size too small) PCs work fine for smaller problems “Do it yourself clusters” e.g. Beowulf works! MPI, PVM, Linda: programming models don’t exploit shared memory… are they lcd? ISVs have to use lcd to survive SMPs are expensive. Parallelization is limited. Clusters required for scalabilities or apps requiring extra-ordinary performance...so DSM only adds to the already complex parallelization problem Non-U.S. users buy SMPvectors for capacity for legacy apps, until cluster-ready apps

40 C1999 Clusters of computers. It’s MPP when processors/cluster >1000 WhoΣP.pap ΣP. P.pap ΣP.pap/C Σp/.C ΣMp./C ΣM.s T. fps #.KG. fps G. fps #GBTB LLNL3.95.9.665.382.562 (IBM) LANL3.16.1.564128.3276 (SGI) S andia 2.79.1.3.62- (Intel) B eowulf 0.52.04 F ujitsu 1.2.139.69.614.-16 NEC4.0.5812816128 ESRDC 405.12864816

41 Commercial users don’t need them Highest growth is & will be web servers delivering pages, audio, and video Apps are inherently, embarrassingly parallel Databases and TP parallelized and transparent A single SMP handles traditional apps Clusters required for reliability, scalabilities

42 General purpose, non- parallelizable codes (PCs have it!) Vectorizable & //able (Supers & all SmPs) Hand tuned, one-of MPP course grain. MPP embarrassingly // (Clusters of anythings) Database Database/TP Web Host Stream Audio/Video Technical Commercial Application Taxonomy If real rich then IBM Mainframes or large SMPs else PC Clusters If real rich then SMP clusters else PC Clusters (U.S. only)

43 Questions for builders & users Can we count on Moore’s Law continuation? Vector vs scalar using commodity chips? Clustered computing vs traditional SMPv? Can MPP apps be written for scalable //lism? Cost: How much time and money for apps? Benefit/need: In time & cost of execution? When will DSM occur or be pervasive? Commodity, proprietary, or net interconnections? VendorIX (or Linux) vs NT? Shrink-wrap supers? When will computer science research & teach //ism? Did Web divert follow-through efforts and funding? What’s the prognosis for gov’t leadership, funding?

44 The Physical Processor commodity aka Intel micros – Does VLIW work better as a micro than it did as a mini at Cydrome & Multiflow? vector processor… abandoned or reborn? multiple processors per chip or multi-threading FPGA chip-based special processors or other higher volume processors

45 What Is The Processor Architecture? Clearly polarized as US vs Japan VECTORS OR Comp. Sci. View MISC >> CISC Language directed RISC Super-scalar MTA Extra-Long Instruction Word Super Computer View RISC VCISC (vectors) multiple pipes

46 Weather model performance

47 40 Tflops Earth Simulator R&D Center c2002

48 Mercury & Sky Computers - & $ Rugged System With 10 Modules ~ $100K; $1K /# Scalable to several K processors; ~1-10 Gflop / Ft 3 10 9U Boards * 4 Ppc750’s  440 Specfp95 in 1 Ft 3 (18.5 * 8 * 10.75”) … 256 Gflops/$3M Sky 384 Signal Processor, #20 on ‘Top 500’, $3M Mercury VME Platinum System Sky PPC Daughtercard

49

50 Russian Elbrus E2K WhoE2KMerced Clock GHz1.20.8 Spec i/fp135./35045./70 Size mm 2 (.18u) 126.300. Power35.60. PAP Gflps10.2 Pin B/W GB/81.9 Cache (KB)64./256 System shipQ4./2001

51 Computer (P-Mp) system Alternatives Node size: most cost-effective SMPs – Now 1-2 on a single board, evolving to 4-8 – Evolves based on n processor per chip Continued use of single bus SMP “multi” with enhancements for perf. & reliability Large, backplane bus based SMP provide a single system image for small systems, but not cost or space efficient for use as cluster component SMPs evolving to weak coherency DSMs

52 Cluster system Alternatives System in a room: SAN connected e.g. NOW, Beowulf System in the building: LAN connected System across the continent or globe: Inter- / intra-net connected networks

53 Architects & architectures… clusters aka (MPP if p>1000) clusters => NUMA/DSM iff commodity interconnects supply them U.S. vendors = 9 x scalar processors – HP, IBM, and SUN: minicomputers aka servers to attack mainframes are the basic building blocks – SMPs with 100+ processors per system – surprise: 4-16 processors / chip… MTA? Intel-based: desktop & small servers – commodity supercomputers ala Beowulf and shrink-wrap supercomputing Japanese vendors = vector processors – NEC continue driving DSM/NUMA using SMPv – Fujitsu will evolve to NUMA/DSM

54 “ ” Petaflops by 2010 DOE Accelerated Strategic Computing Initiative (ASCI)

55 DOE’s 1997 “PathForward” Accelerated Strategic Computing Initiative (ASCI) 1997 1-2 Tflops: $100M 1999-2001 10-30 Tflops$200M?? 2004100 Tflops evolve & lash together and you get a Petaflops 2010Petaflops

56 1994 Petaflops Workshop c2007-2014. Clusters of clusters. Something for everyone SMPClustersActive Memory 400 P4-40K P400K P 1 Tflops*10-100 Gflops1 Gflops 400 TB SRAM400 TB DRAM0.8 TB embed 250 Kchips60K-100K chips4K chips 1 ps/result10-100 ps/result *100 x 10 Gflops threads 100,000 1 Tbyte discs => 100 Petabytes. 10 failures / day

57 Petaflops Disks Just compute it at the source 100,000 1 Tbyte discs => 100 Petabytes 8 Gbytes of memory per chip 10 Gflops of processing per chip NT, Linux, or whatever O/S 10 Gbps network interface Result: 1.0 petaflops at the disks

58 HT-MT

59 HT-MT… Mechanical: cooling and signals Chips: design tools, fabrication Chips: memory, PIM Architecture: mta on steroids Storage material

60 Global clusters… a goal, challenge, possibility? Our vision... is a system of millions of hosts… in a loose confederation. Users will have the illusion of a very powerful desktop computer through which they can manipulate objects. Grimshaw, Wulf, et al “Legion” CACM Jan. 1997 “ ” “ ”

61 Utilize in situ workstations! NoW (Berkeley) set sort record, decrypting Grid, Globus, Condor and other projects Need “standard” interface and programming model for clusters using “commodity” platforms & fast switches Giga- and tera-bit links and switches allow geo-distributed systems Each PC in a computational environment should have an additional 1GB/9GB!

62 In 2010 every organization will have its own petaflops supercomputer! 10,000 nodes in 1999 or 10x over 1987 Assume 100K nodes in 2010 10 Gflops/10GBy/1,000 GB nodes for low end c2010 PCs Communication is first problem… use the network that will be >10 Gbps Programming is still the major barrier Will any problems or apps fit it? Will any apps exploit it?

63 The Grid: Blueprint for a New Computing Infrastructure Ian Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999 Published July 1998; ISBN 1-55860-475-8 22 chapters by expert authors including: – Andrew Chien, – Jack Dongarra, – Tom DeFanti, – Andrew Grimshaw, – Roch Guerin, – Ken Kennedy, – Paul Messina, – Cliff Neuman, – Jon Postel, – Larry Smarr, – Rick Stevens, – Charlie Catlett – John Toole – and many others http://www.mkp.com/grids “A source book for the history of the future” -- Vint Cerf

64 The Grid “Dependable, consistent, pervasive access to [high-end] resources” Dependable: Can provide performance and functionality guarantees Consistent: Uniform interfaces to a wide variety of resources Pervasive: Ability to “plug in” from anywhere

65 2004 Computer Food Chain ??? Portable Computers Mainframe Vector Super Networks of Workstation/PCs Mini Massively Parallel Processors Dave Patterson, UC/Berkeley

66 Summary 1000x increase in PAP has not been accompanied with RAP, insight, infrastructure, and use. We are finally at the beginning. “The PC World Challenge” is to provide “shrink-wrap”, clustered parallelism to commercial and technical communities Only becomes true if system suppliers e.g. Microsoft deliver commodity, control software ISVs must believe that clusters are the future Computer science has to get with the program Grid etc. using world-wide resources, including in situ PCs is the new idea

67 2010 architecture evolution High end computing will continue. Advantage SMPvector clusters – Unclear that U.S. will produce one versus “stay the course” using 10x “killer micros” Shrink-wrap clusters become pervasive. – SmP (m>16) will be the cluster component, including SmP-on-a chip and board “multis”. Cost-effective systems come from best nodes. Backplanes are not cost-effective I/Cs – Interconnection nets, log(p), are the challenge. – Apps determine whether clusters become a general purpose versus niche structure

68 Technical computer types: Pick of: 4 nodes, 2-3 interconnets Fujitsu Hitachi IBM ?PC? SGI cluster Beowulf NEC ?PC? SGI DSM ?HP? NEC super Cray ??? Fujitsu Hitachi HP IBM ?PC? SUN plain old PCs SAN DSM SMP micros vector

69 The end

70 “ ” When is a Petaflops possible? What price? Moore’s Law 100x But how fast can the clock tick? Increase parallelism 10K>100K10x Spend more ($100M  $500M) 5x Centralize center or fast network3x Commoditization (competition)3x Gordon Bell, ACM 1997

71 Xpt SMPs (mainframes) Xpt-SMPvector Xpt-multithread (Tera) “multi” as a component Xpt-”multi” hybrid DSM-(commodity-SCI)? DSM (scalar) DSM (vector) Commodity “multis” Clusters of: “multis” Clusters of: DSMs (scalar & vector) SMP Multicomputers aka Clusters. MPP when n>1000 processors mainline C2000+ Architecture Taxonomy


Download ppt "Gordon Bell Microsoft SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years."

Similar presentations


Ads by Google