Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new platforms? Apps? Challenges, whats interesting, and what needs.

Similar presentations


Presentation on theme: "Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new platforms? Apps? Challenges, whats interesting, and what needs."— Presentation transcript:

1 Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new platforms? Apps? Challenges, whats interesting, and what needs doing? Gordon Bell Bay Area Research Center Microsoft Corporation

2 Architecture changes when everyone and everything is mobile! Power, security, RF, WWW, display, data-types e.g. video & voice… its the application of architecture!

3 Copyright Gordon Bell & Jim Gray ISCA2000 The architecture problem The apps – Data-types: video, voice, RF, etc. – Environment: power, speed, cost The material: clock, transistors… Performance… its about parallelism – Program & programming environment – Network e.g. WWW and Grid – Clusters – Multiprocessors – Storage, cluster, and network interconnect – Processor and special processing – Multi-threading and multiple processor per chip – Instruction Level Parallelism vs – Vector processors

4 Copyright Gordon Bell & Jim Gray ISCA2000 IP On Everything

5 Copyright Gordon Bell & Jim Gray ISCA2000 poochi

6 Copyright Gordon Bell & Jim Gray ISCA2000 Sony Playstation export limiits

7 Copyright Gordon Bell & Jim Gray ISCA2000 PC At An Inflection Point? PCs Non-PC devices and Internet It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance. They drive microprocessor competition!

8 The Dawn Of The PC-Plus Era, Not The Post-PC Era… devices aggregate via PCs!!! Consumer PCs TV/AV MobileCompanions Household Management Communications Automation & Security

9 PC will prevail for the next decade as a dominant platform … 2 nd to smart, mobile devices Moores Law increases performance; and alternatively reduces prices PC server clusters with low cost OS beat proprietary switches, smPs, and DSMs Home entertainment & control … – Very large disks (1TB by 2005) to store everything – Screens to enhance use Mobile devices, etc. dominate WWW >2003! Voice and video become important apps! C = Commercial; C = Consumer

10 Wheres the action? Problems? Constraints: Speech, video, mobility, RF, GPS, security… Moores Law, including network speed Scalability and high performance processing – Building them: Clusters vs DSM – Structure: wheres the processing, memory, and switches (disk and ip/tcp processing) – Micros: getting the most from the nodes Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code! System on a chip alternatives… apps drive – Data-types (e.g. video, video, RF) performance, portability/power, and cost

11 Copyright Gordon Bell & Jim Gray ISCA2000 High Performance Computing A 60+ year view

12 Copyright Gordon Bell & Jim Gray ISCA2000 High performance architecture/program timeline VtubesTrans.MSI(mini) Micro RISCnMicr Sequential programming----> (single execution stream)

13 Copyright Gordon Bell & Jim Gray ISCA2000 Computer types Netwrked Supers… GRID Legion Condor Beowulf NT clusters VPPuni T3E SP2 (mP) NOW NEC mP SGI DSM clusters & SGI DSM NEC super Cray X…T (all mPv) Mainframes Multis WSs PCs Connectivity WAN/LAN SAN DSM SM micros vector Clusters

14 Technical computer types Netwrked Supers… GRID Legion Condor Beowulf VPPuni SP2 (mP) NOW NEC mP T series SGI DSM clusters & SGI DSM NEC super Cray X…T (all mPv) Mainframes Multis WSs PCs WAN/LAN SAN DSM SM micros vector Old World ( one program stream) New world: Clustered Computing (multiple program streams)

15 Copyright Gordon Bell & Jim Gray ISCA2000 Dead Supercomputer Society

16 ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics

17 Copyright Gordon Bell & Jim Gray ISCA2000 SCI Research c university and corporate R&D projects 2 or 3 successes… All the rest failed to work or be successful

18 Copyright Gordon Bell & Jim Gray ISCA2000 How to build scalables? To cluster or not to cluster… dont we need a single, shared memory?

19 General purpose, non- parallelizable codes (PCs have it!) Vectorizable Vectorizable & //able (Supers & small DSMs) Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs...) Database Database/TP Web Host Stream Audio/Video Technical Commercial Application Taxonomy If central control & rich then IBM or large SMPs else PC Clusters

20 Copyright Gordon Bell & Jim Gray ISCA2000 SNAP … c1995 S calable N etwork A nd P latforms A View of Computing in We all missed the impact of WWW! Gordon Bell Jim Gray

21 Copyright Gordon Bell & Jim Gray ISCA2000 Computing SNAP built entirely from PCs Wide & Local Area Networks for: terminal, PC, workstation, & servers Centralized & departmental uni- & mP servers (UNIX & NT) Legacy mainframes & minicomputers servers & terms Wide-area global network Legacy mainframe & minicomputer servers & terminals Centralized & departmental servers buit from PCs scalable computers built from PCs TC=TV+PC home... (CATV or ATM or satellite) ??? Portables A space, time (bandwidth), & generation scalable environment Person servers (PCs) Person servers (PCs) Mobile Nets

22 How Will Future Computers Be Built? Thesis: SNAP: Scalable Networks and Platforms Upsize from desktop to world-scale computer based on a few standard components Because: Moores law: exponential progress Standardization & Commoditization Stratification and competition When: Sooner than you think! Massive standardization gives massive use Economic forces are enormous

23 Copyright Gordon Bell & Jim Gray ISCA2000 Bell Prize and Future Peak Tflops (t) Petaflops study target NEC XMP NCube CM2 *IBM

24 Copyright Gordon Bell & Jim Gray ISCA2000 Top 10 tpc-c

25 Courtesy of Dr. Thomas Sterling, Caltech

26 Copyright Gordon Bell & Jim Gray ISCA2000 Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling… chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,

27 Copyright Gordon Bell & Jim Gray ISCA2000 Why I gave up on large smPs & DSMs Economics: Perf/Cost is lower…unless a commodity Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. Arent scalable. Reliability requires clusters. Start there. They arent needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.

28 FVCORE Performance Finite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR Max T3E Max C90-16 SX-4 SX-550

29 Cache based systems are nothing more than vector processors with a highly programmable vector register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames. Vector lengths arbitrary Vector lengths fixed Vectors fed at high speed Vectors fed at low speed Vector registers 8 KBytes Memory CPU Vector System 1st & 2nd Lvl Caches 8 MBytes Memory CPU Microprocessor System Two results per clock (Will be 4 in next Gen SGI) Two results per clock 500Mhz600Mhz Architectural Contrasts – Vector vs Microprocessor

30 Convergence to one architecture mPs continue to be the main line

31 Copyright Gordon Bell & Jim Gray ISCA2000 Jim, what are the architectural challenges … for clusters? WANS (and even LANs) faster than backplanes at 40 Gbps End of busses (fc=100 MBps)… except on a chip What are the building blocks or combinations of processing, memory, & storage? Infiniband starts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.www.infinibandta.org

32 Copyright Gordon Bell & Jim Gray ISCA2000 What is the basic structure of these scalable systems? Overall Disk connection especially wrt to fiber channel SAN, especially with fast WANs & LANs

33 Copyright Gordon Bell & Jim Gray ISCA2000 Modern scalable switches … also hide a supercomputer Scale from <1 to 120 Tbps of switch capacity 1 Gbps ethernet switches scale to 10s of Gbps SP2 scales from 1.2 Gbps

34 Copyright Gordon Bell & Jim Gray ISCA2000 GB plumbing from the baroque: evolving from the 2 dance-hall model Mp S Pc : | : | S.fc Ms | : | S.Cluster | S.WAN MpPcMs S.Lan/Cluster/Wan :

35 Copyright Gordon Bell & Jim Gray ISCA2000 SNAP Architecture

36 Copyright Gordon Bell & Jim Gray ISCA2000 ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk 5-7 year target: MicroDrive:1.7 x 1.4 x : ? 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW) Integrated IRAM processor 2x height Connected via crossbar switch growing like Moores law 16 Mbytes; ; 1.6 Gflops; 6.4 Gops 10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops

37 Copyright Gordon Bell & Jim Gray ISCA2000 The Disk Farm? or a System On a Card? The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 FT discs....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!! 14"

38 Copyright Gordon Bell & Jim Gray ISCA2000 Map of Gray Bell Prize results Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops

39 Copyright Gordon Bell & Jim Gray ISCA GBps Ubiquitous 10 GBps SANs in 5 years 1Gbps Ethernet are reality now. – Also FiberChannel,MyriNet, GigaNet, ServerNet,, ATM,… 10 Gbps x4 WDM deployed now (OC192) – 3 Tbps WDM working in lab In 5 years, expect 10x, wow!! 5 MBps 20 MBps 40 MBps 80 MBps 120 MBps (1Gbps)

40 Copyright Gordon Bell & Jim Gray ISCA2000 The Promise of SAN/VIA:10x in 2 years Yesterday: – 10 MBps (100 Mbps Ethernet) – ~20 MBps tcp/ip saturates 2 cpus – round-trip latency ~250 µs Now – Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… – Fast user-level communication - tcp/ip ~ 100 MBps 10% cpu - round-trip latency is 15 us 1.6 Gbps demoed on a WAN

41 Copyright Gordon Bell & Jim Gray ISCA2000 Processor improvements… 90% of ISCAs focus

42 Copyright Gordon Bell & Jim Gray ISCA2000

43 We get more of everything

44 Mainframes, minis, micros, and risc

45 Copyright Gordon Bell & Jim Gray ISCA2000 Computer ops/sec x word length / $

46 Copyright Gordon Bell & Jim Gray ISCA Performance in Mflop/s Micros Supers R2000 i860 RS6000/540 Alpha RS6000/590 Alpha Cray 1S Cray X-MP Cray 2 Cray Y-MP Cray C90 Cray T Growth of microprocessor performance

47 Copyright Gordon Bell & Jim Gray ISCA2000 Albert Yu predictions 96 When Clock (MHz) x MTransistors x Mops240020,0008.3x Die (sq. in.) x

48 Copyright Gordon Bell & Jim Gray ISCA2000 Processor Limit: DRAM Gap Alpha full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod Moores Law

49 Copyright Gordon Bell & Jim Gray ISCA2000 The memory gap Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays Or alternatively, multi-threading (MTA) Vector processors with a supporting memory system System-on-a-chip… to reduce chip boundary crossings

50 Copyright Gordon Bell & Jim Gray ISCA2000 If system-on-a-chip is the answer, what is the problem? Small, high volume products – Phones, PDAs, – Toys & games (to sell batteries) – Cars – Home appliances – TV & video Communication infrastructure Plain old computers… and portables

51 Copyright Gordon Bell & Jim Gray ISCA2000 SOC Alternatives… not including C/C++ CAD Tools The blank sheet of paper: FPGA Auto design of a basic system: Tensilica Standardized, committee designed components*, cells, and custom IP Standard components including more application specific processors *, IP add- ons and custom One chip does it all: SMOP *Processors, Memory, Communication & Memory Links,

52 Copyright Gordon Bell & Jim Gray ISCA2000 Xilinx 10Mg, 500Mt,.12 mic

53 Copyright Gordon Bell & Jim Gray ISCA2000 Free 32 bit processor core

54 System-on-a-chip alternatives FPGASea of un-committed gate arrays Xylinx, Altera Compile a system Unique processor for every app Tensillica Systolic | array Many pipelined or parallel processors + custom DSP | VLIW Spec. purpose processors cores + custom TI Pc & Mp. ASICS Gen. Purpose cores. Specialized by I/O, etc. IBM, Intel, Lucent Universal Micro Multiprocessor array, programmable I/o Cradle

55 Cradle: Universal Microsystem trading Verilog & hardware for C/C++ Single part for all apps App run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including: 1 GB/s; 2.5 Gips PCI, 100 baseT, firewire $4 per flops; 150 mW/Gflops UMS : VLSI = microprocessor : special systems Software : Hardware

56 UMS Architecture Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property

57 Copyright Gordon Bell & Jim Gray ISCA2000 Recapping the challenges Scalable systems – Latency in a distributed memory – Structure of the system and nodes – Network performance for OC192 (10 Gbps) – Processing nodes and legacy software Mobile systems… power, RF, voice, I/0 – Design time!

58 Copyright Gordon Bell & Jim Gray ISCA2000 The End


Download ppt "Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new platforms? Apps? Challenges, whats interesting, and what needs."

Similar presentations


Ads by Google