Nortel 20 April 1999 Scaleable Computing Jim Gray Microsoft Research

1 Nortel 20 April 1999 Scaleable Computing Jim Gray Microsoft Research Outline –The bandwidth revolution –ScaleUp, ScaleOut –TerraServer (Barclay, Slutz, Gray)

2 Nortel 20 April 1999 Gilders Law: 3x bandwidth/year for 25 more years Today: –10 Gbps per channel –4 channels per fiber: 40 Gbps –32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps

3 Nortel 20 April 1999 Software improving –User-level Net-IO Software Challenge –reduce software tax on messages –Today 30 K ins + 10 ins/byte –Goal: 1 K ins +.01 ins/byte Networking BIG!! Changes coming! Technology –1 GBps bus now –1 Gbps links now –1 Tbps links in 10 years –Fast & cheap switches Standard wires for interconnect –processor-processor –processor-device (=processor) Deregulation WILL work someday

4 Nortel 20 April 1999 Technology (hardware) NOW CPU: nearing 1 BIPS –but CPI rising fast (2-10) so less than 100 mips –1$/mips to 10$/mips DRAM: 3 $/MB DISK: 20 $/GB TAPE: –20 GB/tape, 6 MBps –Lags disk –2$/GB offline, 15$/GB nearline BUS/SAN: 10/1 GBps WAN:0.1 Mbps 2003 Forecast (10x better) CPU: 1bips real (smp) –0.1$ - 1$/mips DRAM: 1 Gb chip –0.1 $/MB Disk: –10 GB smart cards 500GB RAID5 packs (NTinside) –3$ GB BUS/SAN: 100/10 GBps WAN:1 Gbps

5 Nortel 20 April 1999 Microsoft SAN Infrastructure WinSock Direct Path 110 MBps (thats B not b) 10% cpu (not 200%) Network faster than most IO attachments IP Winsock AFD App MsAfd U K TCP NDIS MiniPort HW AFD Winsock App MsAfd U K TCP NDIS MiniPort HW IP HwSPI Switch VIA

6 Nortel 20 April 1999 Gbps SAN: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 G B ps links in lab. 100$ port cost soon Port is computer Winsock: 110 MBps (10% cpu utilization at each end) RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?

7 Nortel 20 April 1999 Outline –The bandwidth revolution –ScaleUp, ScaleOut –TerraServer (Barclay, Slutz, Gray)

8 Nortel 20 April 1999 Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk Tape /Optical Robot Sacramento This Campus This Room My Head 10 min 1.5 hr 2 Years 1 min Pluto 2,000 Years Andromeda

9 Nortel 20 April 1999 System On A Chip Integrate Processing with memory on one chip –chip is 75% memory now –1MB cache >> 1960 supercomputers –256 Mb memory chip is 32 MB! –IRAM, CRAM, PIM,… projects abound Integrate Networking with processing on one chip –system bus is a kind of network –ATM, FiberChannel, Ethernet,.. Logic on chip. –Direct IO (no intermediate bus) Functionally specialized cards shrink to a chip.

10 Scaleability Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

11 Nortel 20 April 1999 There'll be Billions Trillions Of Clients Every device will be intelligent Doors, rooms, cars… Computing will be ubiquitous

12 Nortel 20 April 1999 Billions Of Clients Need Millions Of Servers Mobile clients Fixed clients Server Superserver Clients Servers All clients networked to servers All clients networked to servers May be nomadic or on-demand May be nomadic or on-demand Fast clients want faster servers Fast clients want faster servers Servers provide Servers provide Shared Data Shared Data Control Control Coordination Coordination Communication Communication Trillions Billions

13 Nortel 20 April 1999 Windows NT Server Terminal Server Dedicated Windows terminal Existing, Desktop PC MS-DOS,UNIX,Macclients Net PC FAT SERVERS Thin Client Support ( FAT SERVERS ) TSO comes to NT lower per-client costs

14 Nortel 20 April 1999 FAT STORAGE SERVERS Windows 2000 IntelliMirror Extends CMU Coda File System ideas Files and settings mirrored on client and server Great for disconnected users Facilitates roaming Easy to replace PCs Optimizes network performance

15 Nortel 20 April 1999 SMP -> nUMA: BIG FAT SERVERS Directory based caching lets you build large SMPs Every vendor building a HUGE SMP –256 way –3x slower remote memory –8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs –64 bit addressing –nUMA sensitive OS (not clear who will do it) Or Hypervisor –like IBM LSF, –Stanford Disco Not certain what happens next

16 Nortel 20 April 1999 Thesis Many little beat few big Smoking, hairy golf ball Smoking, hairy golf ball How to connect the many little parts? How to connect the many little parts? How to program the many little parts? How to program the many little parts? Fault tolerance & Management? Fault tolerance & Management? $1 million $100 K $10 K Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" 1 M SPECmarks, 1TFLOP 10 6 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM TB 1 TB 10 GB 1 MB 100 MB

17 Nortel 20 April B PCs (1 Bips,.1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace Cost 1,000 $ Come with –NT –DBMS –High speed Net –System management –GUI / OOUI –Tools Compatible with everyone else CyberBricks

18 Nortel 20 April 1999 Super Server: 4T Machine Array of 1,000 4B machines Array of 1,000 4B machines 1 b ips processors 1 b ips processors 1 B B DRAM 1 B B DRAM 10 B B disks 10 B B disks 1 Bbps comm lines 1 Bbps comm lines 1 TB tape robot 1 TB tape robot A few megabucks A few megabucks Challenge: Challenge: Manageability Manageability Programmability Programmability Security Security Availability Availability Scaleability Scaleability Affordability Affordability As easy as a single system As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work CPU 50 GB Disc 5 GB RAM Cyber Brick a 4B machine

19 Nortel 20 April 1999 Scale OUT Clusters Have Advantages Fault tolerance: –Spare modules mask failures without limitsModular growth without limits –Grow by adding small modules Parallel data search –Use multiple processors and disks Clients and servers made from the same stuff –Inexpensive: built with commodity CyberBricks

20 Nortel 20 April : IBM DB2 + CICS Mainframe 65 tps IBM 4391 Simulated network of 800 clients 2m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers 16 GB disk farm 4 x 8 x.5GB Refrigerator-sized CPU

21 Nortel 20 April : Tandem 256 tps 14 M$ computer (Tandem) A dozen people (1.8M$/y) False floor, 2 rooms of machines Simulate 25,600 clients 32 node processor array 40 GB disk array (80 drives) OS expert Network expert DB expert Performance expert Hardware experts Admin expert Auditor Manager

22 Nortel 20 April : 9 years later 1 Person and 1 box = 1250 tps 1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 100,000x less 5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays

23 Nortel 20 April 1999 mainframe mini micro time price What Happened? Where did the 100,000x come from? Moores law: 100X (at most) Software improvements: 10X (at most) Commodity Pricing: 100X (at least) Total 100,000X 100x from commodity –(DBMS was 100K$ to start: now 1k$ to start –IBM 390 MIPS is 7.5K$ today –Intel MIPS is 10$ today –Commodity disk is 50$/GB vs 1,500$/GB –...

24 Nortel 20 April 1999 Computers shrink to a point Disks 100x in 10 years 2 TB 3.5 drive Shrink to 1 is 200GB Disk is super computer! This is already true of printers and terminals Kilo Mega Giga Tera Peta Exa Zetta Yotta

25 Nortel 20 April 1999 Tera Byte Backplane TODAY –Disk controller is 10 mips risc engine with 2MB DRAM –NIC is similar power SOON –Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages –Uniform programming model –Great tools –Security –economics (cyberbricks) –Move computation to data (minimize traffic) All Device Controllers will be Cray 1s Central Processor & Memory

26 Nortel 20 April 1999 Its Already True of Printers Peripheral = CyberBrick You buy a printer You get a –several network interfaces –A Postscript engine cpu, memory, software, a spooler (soon) –and… a print engine.

27 Nortel 20 April 1999 Functionally Specialized Cards Storage Network Display M MB DRAM P mips processor ASIC Today: P=50 mips M= 2 MB In a few years P= 200 mips M= 64 MB

28 Nortel 20 April 1999 Implications Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. h Move app to NIC/device controller higher-higher level protocols: DCOM. Cluster parallelism is VERY important. Central Processor & Memory ConventionalRadical

29 Nortel 20 April 1999 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other –DCOM? IIOP? RMI? –One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Wire(s) VIAL/VIPL streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications

30 Nortel 20 April 1999 Disk = Node has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution environment OS Kernel SAN driverDisk driver File SystemRPC,... ServicesDBMS Applications

31 Scaleability Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

32 Nortel 20 April 1999 HotMail: ~300 Computers FreeBSD and Solaris

33 Nortel 20 April 1999 ~150 nodes

34 Nortel 20 April 1999 Other Clusters 16-node Cluster –64 cpus –2 TB of disk –Decision support 45-node Compaq Cluster –140 cpus –14 GB DRAM –4 TB RAID disk –OLTP (Debit Credit) 1 B tpd (14 k tps)

35 Nortel 20 April 1999 Berkeley NOW (network of workstations) Project 105 nodes – Sun UltraSparc 170, 128 MB, 2x2GB disk –Myrinet interconnect (2x160MBps per node) –SBus (30MBps) limited GLUNIX layer above Solaris Inktomi (HotBot search) NAS Parallel Benchmarks Crypto cracker Sort 9 GB per second

36 Nortel 20 April 1999 NCSA Super Cluster National Center for Supercomputing Applications University of Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model

37 Nortel 20 April 1999 Outline –The bandwidth revolution –ScaleUp, ScaleOut –TerraServer (Barclay, Slutz, Gray) A scaleup example

38 Nortel 20 April 1999 Some Tera-Byte Databases Kilo Mega Giga Tera Peta Exa Zetta Yotta The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) –15 PB by 2007 Federal Clearing house: images of checks –15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program –10 Exabytes (???!!)

39 Nortel 20 April 1999 Library of Congress (text) Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter All Disks All Tapes A Movie LoC (image) Info Capture You can record everything you see or hear or read. What would you do with it? How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write:8 GB (words) See: / ksg.html

40 Nortel 20 April 1999 Michael Lesks Points Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology.

41 Nortel 20 April 1999 The TerraServer

42 Nortel 20 April 1999 Coverage: Range from 70ºN to 70ºS today: 35% U.S., 1% outside U.S. Source Imagery: –4 TB 1sq meter/pixel Aerial (USGS - 60,000 46Mb B&W- 151Mb Color IR files) –1 TB 1.56 meter/pixel Satellite (Spin Mb B&W) Display Imagery: 200x200 pixel images, subsample to build image pyramid Nav Tools: –1.5 m place names –Click-on Coverage map –Expedia & Virtual Globe map –Pick of the week 1.6x 1.6 km city view.8 x.8 km 8m thumbnail,4 x,4 km browse 200x200 m tile Concept: User navigates an almost seamless image of earth Database & application UI

43 Nortel 20 April 1999 Image Data USGS DOQ 4 TB 6TB Coming DRG 50,000 Topo Maps adding now Spin-2 1 TB WorldWide New Data Coming

44 Nortel 20 April 1999 The Internet IE 3…5 Netscape 3…4 HTML Java Viewer Web Client Image Delivery Application SQL Server SPIN-2/USGS Store Active Server Pages Microsoft Site Serve EE 3.0 Image Commerce Site(s) 13 SQL Server 7.0 Terra-Server DB Terra-Server Stored Procedures Internet Information Server 4.0 Terra-Server Active Server Pages Active Data Object ODBC Terra-Server Web Site (14 Img) (8 Place) Software Architecture

45 Nortel 20 April 1999 How Images are Found Coverage Map 19% Expedia Map 22% Name Search 40% Famous Places 18% Geo Coordinate 1%

46 Nortel 20 April 1999 TerraServer: Lots of Web Hits Today: –1.7 billion web hits –1 TB, largest SQL DB on the Web –100 qps average, 1,000 Qps peak –1.5 B SQL queries so far SummaryTotal Max Unique Users17 M 150 k Sessions24 M 172 k Hits 1.7 B 29 M Page Views274 M 1.1 M 6.6 M DB Queries 1.5 B 18 M Image Xfers 1.3 B Average 69 k 94 k 6.8 M 5.8 M 5.0 M 15 M As of Feb 28, 1999

47 Nortel 20 April 1999 Lookup by UGrid or ZGrid ID plus resolution Lookups are fast. Indices are in DRAM (auto-magically by SQL) SQL manages all the tiles and indices Images are brought in on demand Gazetteer Index on image, place, type image, state, type image, state, country, type image, place, state, type image, place, country, type all lookups are fast Logical Schema Country Name State Name Place Name PlaceType Feature Type Where Am I Img Meta Tile Meta Jump Img Browse Img Tile Img Theme Meta Information Spin Frame Meta Thumb Img Image Data & Meta Data

48 Nortel 20 April 1999 Image Load and Update ODBC Tx TerraLoader ODBC TX TerraServer SQL DBMS DLT Tape tar Metadata Load DB Active Server Pages Cut & Load Scheduling System Staging Disk JPEG tiles Image Cutter Merge ODBC Tx Dither Image Pyramid From base

49 Nortel 20 April 1999 TerraServer Administrator Web Site Accessible by Microsoft, SPIN-2, and USGS Web browser forms to: –Edit Famous Places list –Modify Image Status fields –Define new TerraServer Administrators

50 Nortel 20 April 1999 Backup and Recovery –Using Legato Networker integrated with SQL Backup/Restore Utility –Fast, incremental, differential, online Restore –Fast, incremental (file oriented), not online. SQL Server Enterprise Manager –DBA Maintenance –SQL Performance Monitor Load & Backup&Recovery

51 Nortel 20 April 1999 Site Configuration 9710 TimberWolf Enterprise Storage Array 9 HSZ70 Ultra-SCSI Dual redundant Controllers GB Seagate Disks Compaq x200mhz Web Servers To the Web Compaq x200mhz Web Servers Compaq x200mhz Web Servers Compaq x200mhz Web Servers Compaq x200mhz Web Servers Compaq x200mhz Web Servers

52 Nortel 20 April 1999 The Microsoft TerraServer Hardware Compaq AlphaServer 8400Compaq AlphaServer x400Mhz Alpha cpus8x400Mhz Alpha cpus 10 GB DRAM10 GB DRAM GB StorageWorks Disks GB StorageWorks Disks –3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB)STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0WindowsNT 4 EE, SQL Server 7.0

53 Nortel 20 April 1999 Use StorageWorks to form 28 RAID5 sets Each raid set has 11 disks (16 spare drives) Use NTFS to form 4 595GB NT volumes Each striped over 7 Raid sets on 7 controllers Create 26 20,000MB files on F:, 27 on G: DB is File Group of 53 files (1.011TB) F: G: H: I: File System Config

54 Nortel 20 April 1999 SQL 7 TerraServer Availability Operating for 9 months : 6400 hrs Unscheduled outage: 36.5 minutes: % scheduled up Scheduled outage: 60 minutes Availability: 99.96% overall up No NT failures (ever) One SQL7 Beta2 bug No failures in July, Aug, Oct, Dec, Jan, Feb, Mar

55 Nortel 20 April 1999 Things we did right... Use a database to store images: –Simplify management –Can dynamically load data into tables while viewing application is active Simple X, Y Z-Grid navigation system Used ImgStatus to control logical presence of the image in the app Stitching tiles together from multiple input images to form seamless mosaic Offering two forms of seamless -- time based (SPIN-2) and theme based (DOQ)

56 Nortel 20 April 1999 TS 3: Things are changing... Square Tiles, power of 2 size (200x200) Power of 2 zoom levels (2:1, 4:1, 8:1, etc.) so uniform tile size on each zoom (variable ground size per tile) Indexing system independent of tile size Digital Raster Graphics (Topo maps) Layered Maps (Topo merge with DOQ) Integrate with other applications and services Later: –Digital Elevation Models (DEMs) –Other foreign data sources (EU, etc.)

57 Nortel 20 April 1999 What TerraServer Shows Can serve huge databases on Internet for about a penny a page view mostly phone bill (!). Advertising pays more than a penny a page. Commodity tools do scale fairly far. A few people (3 developers, 1 operator) using power tools can build an impressive web site

58 Nortel 20 April 1999 Thank You! SPIN-2 Tom Barclay did most of this app, Slutz and Gray helped.

60 Nortel 20 April 1999 end

61 Windows NT Versus UNIX Best Results on an SMP: SemiLog plot shows 3x (2 year) lead by UNIX see

62 Nortel 20 April 1999 TPC C Improvements (MS SQL) 250%/year on Price, 100%/year performance 40% hardware, 100% software, 100% PC Technology

63 Nortel 20 April 1999 Price Breakdown (6 months old)

64 Nortel 20 April 1999 (dis) Economy Of Scale

