Presentation on theme: "Gordon Bell Bay Area Research Center Microsoft Corporation"— Presentation transcript:
1 Gordon Bell Bay Area Research Center Microsoft Corporation All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing?Gordon BellBay Area Research CenterMicrosoft Corporation
2 Architecture changes when everyone and everything is mobile Architecture changes when everyone and everything is mobile! Power, security, RF, WWW, display, data-types e.g. video & voice… it’s the application of architecture!
3 The architecture problem The appsData-types: video, voice, RF, etc.Environment: power, speed, costThe material: clock, transistors…Performance… it’s about parallelismProgram & programming environmentNetwork e.g. WWW and GridClustersMultiprocessorsStorage, cluster, and network interconnectProcessor and special processingMulti-threading and multiple processor per chipInstruction Level Parallelism vsVector processors
7 PC At An Inflection Point? It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance.They drive microprocessor competition!Non-PC devices and InternetPCs
8 Consumer PCsMobileCompanionsTV/AVThe Dawn Of The PC-Plus Era, Not The Post-PC Era… devices aggregate via PCs!!!Household ManagementCommunicationsAutomation & Security
9 PC will prevail for the next decade as a dominant platform … 2nd to smart, mobile devices Moore’s Law increases performance; and alternatively reduces pricesPC server clusters with low cost OS beat proprietary switches, smPs, and DSMsHome entertainment & control …Very large disks (1TB by 2005) to “store everything”Screens to enhance useMobile devices, etc. dominate WWW >2003!Voice and video become important apps!C = Commercial; C’ = Consumer
10 Where’s the action? Problems? Constraints: Speech, video, mobility, RF, GPS, security… Moore’s Law, including network speedScalability and high performance processingBuilding them: Clusters vs DSMStructure: where’s the processing, memory, and switches (disk and ip/tcp processing)Micros: getting the most from the nodesNot ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code!System on a chip alternatives… apps driveData-types (e.g. video, video, RF) performance, portability/power, and cost
16 Dead Supercomputer Society ACRIAlliantAmerican SupercomputerAmetekApplied DynamicsAstronauticsBBNCDCConvexCray ComputerCray ResearchCuller-HarrisCuller ScientificCydromeDana/Ardent/Stellar/StardentDenelcorElexsiETA SystemsEvans and Sutherland ComputerFloating Point SystemsGalaxy YH-1Goodyear Aerospace MPPGould NPLGuiltechIntel Scientific ComputersInternational Parallel MachinesKendall Square ResearchKey Computer LaboratoriesMasParMeikoMultiflowMyriasNumerixPrismaTeraThinking MachinesSaxpyScientific Computer Systems (SCS)Soviet SupercomputersSupertekSupercomputer SystemsSuprenumVitesse Electronics
17 SCI Research c1985-1995 35 university and corporate R&D projects 2 or 3 successes…All the rest failed to work or be successful
18 How to build scalables?To cluster or not to cluster… don’t we need a single, shared memory?
19 Application Taxonomy Technical Commercial General purpose, non-parallelizable codes (PCs have it!)VectorizableVectorizable & //able (Supers & small DSMs)Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs...)Database Database/TPWeb HostStream Audio/VideoTechnicalCommercialIf central control & rich then IBM or large SMPselse PC Clusters
20 SNAP … c1995 Scalable Network And Platforms A View of Computing in We all missed the impact of WWW!This talk / essay portrays our view of computer-server architecture trends.(It is silent on the client cellphones, toasters, and gameboysThis is an early draft.We are sending a copy to you in hopes that you’ll read and comment on it.We would like to publish it in several forms: 2 hr video lecture, an kickoff article in a ComputerWorld issue that Gordon is editing,a monograph enlarged to be published within a year.January 1, 1995Gordon BellJim Gray
21 Computing SNAP built entirely from PCs Legacymainframes &minicomputersservers & termsPortablesLegacymainframe &minicomputerservers & terminalsWide-areaglobalnetworkMobileNetsWide & LocalArea Networksfor: terminal,PC, workstation,& serversPersonservers(PCs)scalable computersbuilt from PCsPersonservers(PCs)Centralized& departmentaluni- & mP servers(UNIX & NT)Centralized& departmentalservers buit fromPCs???Here's a much more radical scenario, but one that seems very likely to me. There will be very little difference between servers and the person servers or what we mostly associate with clients.This will come because economy of scale is replaced by economy of volume. The largest computer is no longer cost-effective. Scalable computing technology dictates using the highest volume, most cost-effective nodes. This means we build everything, including mainframes and multiprocessor servers from PCs!TC=TV+PChome ...(CATV or ATMor satellite)A space, time (bandwidth), & generation scalable environment
22 How Will Future Computers Be Built? Thesis: SNAP: Scalable Networks and PlatformsUpsize from desktop to world-scale computerbased on a few standard componentsBecause:Moore’s law: exponential progressStandardization & CommoditizationStratification and competitionWhen: Sooner than you think!Massive standardization gives massive useEconomic forces are enormous
23 Bell Prize and Future Peak Tflops (t) *IBMPetaflops study targetNECCM2XMP NCube
24 Top 10 tpc-cTop two Compaq systems are: 1.1 & 1.5X faster than IBM SPs; 1/3 price of IBM 1/5 price of SUN
26 Five ScalabilitiesSize scalable -- designed from a few components, with no bottlenecksGeneration scaling -- no rewrite/recompile or user effort to run across generations of an architectureReliability scaling… chose any levelGeographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites)Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer.Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,Unclear whether scalability has any meaning for a real system. While the following are all based on some order of N, the engineering details of a system determine missing constants!A system is scalable if efficiency(n,x) =1 for all algorithms, number of processors, n, and problem sizes x.Fails to recognize cost, efficiency, and whether VLSCs are practical (affordable) in a reasonable time scale.Cost < O(N2) rules out the cross-point= O(N2), however latency is O(1); Omega O(N log N), Ring/Bus/Mesh O(N)Bandwidth required to be <O(logN) Supercomputer bandwidth are O(N)... no caching, hierarchiesSIMD didn't scale, CM5 probably won't.19:25 (4:00) -4:25Compatibility with the future is important. No matter how much you build on standards, you want the next one to take all the programs (without recompiliation), files, and run them with no changes!
27 Why I gave up on large smPs & DSMs Economics: Perf/Cost is lower…unless a commodityEconomics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance.Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system.DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway.Aren’t scalable. Reliability requires clusters. Start there.They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.
28 FVCORE Performance Finite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR 50SX-5SX-4Max C90-16Max T3E
29 Architectural Contrasts – Vector vs Microprocessor Vector registers8 KBytesMemoryCPUVector System1st & 2nd Lvl Caches8 MBytesMemoryCPUMicroprocessor System500Mhz600MhzTwo results per clockTwo results per clock(Will be 4 in next Gen SGI)Vector lengths arbitraryVector lengths fixedVectors fed at low speedVectors fed at high speedCache based systems are nothing more than “vector” processors with a highly programmable “vector” register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames.
30 Convergence to one architecture mPs continueto be the main lineConvergence to one architecture
31 “Jim, what are the architectural challenges … for clusters?” WANS (and even LANs) faster than backplanes at 40 GbpsEnd of busses (fc=100 MBps)… except on a chipWhat are the building blocks or combinations of processing, memory, & storage?Infiniband starts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.
32 What is the basic structure of these scalable systems? OverallDisk connection especially wrt to fiber channelSAN, especially with fast WANs & LANs
33 Modern scalable switches … also hide a supercomputer Scale from <1 to 120 Tbps of switch capacity1 Gbps ethernet switches scale to 10s of GbpsSP2 scales from 1.2 Gbps
34 GB plumbing from the baroque: evolving from the 2 dance-hall model Mp — S — Pc: | :|—————— S.fc — Ms| :|— S.Cluster|— S.WAN —MpPcMs — S.Lan/Cluster/Wan — :
35 SNAP Architecture---------- With this introduction about technology, computing styles, and the chaos and hype around standards and openness, we can look at the Network & Nodes architecture I posit.
36 ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk5-7 year target:MicroDrive:1.7” x 1.4” x 0.2” : ?1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW)Integrated IRAM processor2x heightConnected via crossbar switchgrowing like Moore’s law16 Mbytes; ; 1.6 Gflops; 6.4 Gops10,000+ nodes in one rack!100/board = 1 TB; 0.16 Tflops
37 The Disk Farm? or a System On a Card? 14"The 500GB disc cardAn array of discsCan be used as100 discs1 striped disc50 FT discs....etcLOTS of accesses/secondof bandwidthA few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!
38 Map of Gray Bell Prize results Redmond/Seattle, WAsingle-thread single-stream tcp/ip via 7 hops desktop-to-desktop …Win 2K out of the box performance*New YorkArlington, VASan Francisco, CA5626 km10 hops
39 Ubiquitous 10 GBps SANs in 5 years 1Gbps Ethernet are reality now.Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,…10 Gbps x4 WDM deployed now (OC192)3 Tbps WDM working in labIn 5 years, expect 10x, wow!!1 GBps120 MBps(1Gbps)80 MBps5 MBps40 MBps20 MBps
40 The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ Yesterday:10 MBps (100 Mbps Ethernet)~20 MBps tcp/ip saturates 2 cpusround-trip latency ~250 µsNowWires are 10x faster Myrinet, Gbps Ethernet, ServerNet,…Fast user-level communicationtcp/ip ~ 100 MBps 10% cpuround-trip latency is 15 us1.6 Gbps demoed on a WAN
44 Mainframes, minis, micros, and risc Created orginially in 78 at DECWill risc continue at 60%/yr. or x2/18 mos... Moore's speed law?What about GaAs??? when?When do we put the mainframe out its misery?The speed increase is typically only 26% with clock x 2 per 3 yr and 26% or x2 per 3 year arch.27:45(60) -45
46 Growth of microprocessor performance 10000Cray T90MicrosSupers1000Cray 2Cray Y-MPCray C90AlphaRS6000/590Cray X-MPAlpha100RS6000/540Cray 1Si86010Performance in Mflop/sR20001803870.168818028780870.01199819801982198619881990199219941996
47 Albert Yu predictions ‘96 WhenClock (MHz) xMTransistors xMops , xDie (sq. in.) x
48 Processor Limit: DRAM Gap 60%/yr..DRAM7%/yr..110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000CPU1982Processor-MemoryPerformance Gap: (grows 50% / year)Performance“Moore’s Law”Y-axis is performanceX-axis is timeLatencyCliché:Not e that x86 didn’t have cache on chip until 1989Alpha full cache miss / instructions executed: ns/1.7 ns =108 clks x 4 or 432 instructionsCaches in Pentium Pro: 64% area, 88% transistors*Taken from Patterson-Keeton Talk to SigMod
49 The “memory gap”Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delaysOr alternatively, multi-threading (MTA)Vector processors with a supporting memory systemSystem-on-a-chip… to reduce chip boundary crossings
50 If system-on-a-chip is the answer, what is the problem? Small, high volume productsPhones, PDAs,Toys & games (to sell batteries)CarsHome appliancesTV & videoCommunication infrastructurePlain old computers… and portables
51 SOC Alternatives… not including C/C++ CAD Tools The blank sheet of paper: FPGAAuto design of a basic system: TensilicaStandardized, committee designed components*, cells, and custom IPStandard components including more application specific processors *, IP add-ons and customOne chip does it all: SMOP*Processors, Memory, Communication & Memory Links,
54 System-on-a-chip alternatives FPGASea of un-committed gate arraysXylinx, AlteraCompile a systemUnique processor for every appTensillicaSystolic | arrayMany pipelined or parallel processors + customDSP | VLIWSpec. purpose processors cores + customTIPc & Mp.ASICSGen. Purpose cores. Specialized by I/O, etc.IBM, Intel, LucentUniversal MicroMultiprocessor array, programmable I/oCradle
55 Cradle: Universal Microsystem trading Verilog & hardware for C/C++ UMS : VLSI = microprocessor : special systems Software : HardwareSingle part for all appsApp run time using FPGA & ROM5 quad mPs at 3 Gflops/quad = 15 GlopsSingle shared memory space, cachesProgrammable periphery including: 1 GB/s; 2.5 Gips PCI, 100 baseT, firewire$4 per flops; 150 mW/Gflops
56 UMS Architecture Memory bandwidth scales with processing Must allow mix and match of applications. Design reuse is important thus scalability is a must. Resources must be balanced. Cradle is developing such an architecture which has multiple processors (MSPs) which are attached to private memories and can communicate with external devices through a Dram controller and programmable I/O.Explain architecture- Regular, Modular, Processing with Memory, High speed busMemory bandwidth scales with processingScalable processing, software, I/OEach app runs on its own pool of processorsEnables durable, portable intellectual property
57 Recapping the challenges Scalable systemsLatency in a distributed memoryStructure of the system and nodesNetwork performance for OC192 (10 Gbps)Processing nodes and legacy softwareMobile systems… power, RF, voice, I/0Design time!