Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaleable Computing Jim Gray Microsoft Research

Similar presentations

Presentation on theme: "Scaleable Computing Jim Gray Microsoft Research"— Presentation transcript:

1 Scaleable Computing Jim Gray Microsoft Research Gray@Microsoft
Scaleable Computing Jim Gray Microsoft Research Outline The bandwidth revolution ScaleUp, ScaleOut TerraServer (Barclay, Slutz, Gray) Nortel 20 April 1999

2 Gilder’s Law: 3x bandwidth/year for 25 more years
Today: 10 Gbps per channel 4 channels per fiber: 40 Gbps 32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps Nortel 20 April 1999

3 Networking BIG!! Changes coming!
Technology 1 GBps bus “now” 1 Gbps links “now” 1 Tbps links in 10 years Fast & cheap switches Standard wires for interconnect processor-processor processor-device (=processor) Deregulation WILL work someday Software improving User-level Net-IO Software Challenge reduce software tax on messages Today 30 K ins + 10 ins/byte Goal: 1 K ins ins/byte Nortel 20 April 1999

4 Technology (hardware)
NOW CPU: nearing 1 BIPS but CPI rising fast (2-10) so less than 100 mips 1$/mips to 10$/mips DRAM: 3 $/MB DISK: 20 $/GB TAPE: 20 GB/tape, 6 MBps Lags disk 2$/GB offline, 15$/GB nearline BUS/SAN: 10/1 GBps WAN: 0.1 Mbps 2003 Forecast (10x better) CPU: 1bips real (smp) 0.1$ - 1$/mips DRAM: 1 Gb chip 0.1 $/MB Disk: 10 GB smart cards 500GB RAID5 packs (NTinside) 3$ GB BUS/SAN: 100/10 GBps WAN: 1 Gbps Nortel 20 April 1999

5 Microsoft SAN Infrastructure WinSock Direct Path
App App 110 MBps (that’s B not b) 10% cpu (not 200%) Network faster than most IO attachments Winsock Winsock Switch MsAfd MsAfd HwSPI U U VIA K K AFD AFD TCP TCP IP IP NDIS NDIS MiniPort MiniPort HW HW Nortel 20 April 1999

6 SAN: Standard Interconnect
RIP FDDI SAN: Standard Interconnect RIP ATM Gbps SAN: 110 MBps RIP SCI LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer Winsock: 110 MBps (10% cpu utilization at each end) PCI: 70 MBps RIP SCSI UW Scsi: 40 MBps FW scsi: 20 MBps RIP FC scsi: 5 MBps RIP ? Nortel 20 April 1999

7 Outline The bandwidth revolution ScaleUp, ScaleOut
TerraServer (Barclay, Slutz, Gray) Nortel 20 April 1999

8 Latency: How Far Away is the Data?
Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto 10 Disk 2 Years Sacramento 1.5 hr 100 Memory This Campus 10 On Board Cache 10 min 2 On Chip Cache This Room 1 Registers My Head 1 min Nortel 20 April 1999

9 System On A Chip Integrate Processing with memory on one chip
chip is 75% memory now 1MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM,… projects abound Integrate Networking with processing on one chip system bus is a kind of network ATM, FiberChannel, Ethernet,.. Logic on chip. Direct IO (no intermediate bus) Functionally specialized cards shrink to a chip. Nortel 20 April 1999

10 Scaleability Scale Up and Scale Out
SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

11 There'll be Billions Trillions Of Clients
Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous Nortel 20 April 1999

12 Billions Of Clients Need Millions Of Servers
Trillions Billions Of Clients Need Millions Of Servers Billions All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server Nortel 20 April 1999

13 Thin Client Support (FAT SERVERS ) TSO comes to NT lower per-client costs
Net PC Windows NT Server Terminal Server Existing, Desktop PC MS-DOS, UNIX, Mac clients Dedicated Windows terminal Nortel 20 April 1999

14 Windows 2000 IntelliMirror™
Extends CMU Coda File System ideas Files and settings mirrored on client and server Great for disconnected users Facilitates roaming Easy to replace PCs Optimizes network performance FAT STORAGE SERVERS Nortel 20 April 1999

Directory based caching lets you build large SMPs Every vendor building a HUGE SMP 256 way 3x slower remote memory 8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs 64 bit addressing nUMA sensitive OS (not clear who will do it) Or Hypervisor like IBM LSF, Stanford Disco Not certain what happens next Nortel 20 April 1999

16 Thesis Many little beat few big
$1 million 1 MM 3 $100 K $10 K Pico Processor Nano Micro 1 MB 10 pico-second ram Mainframe Mini 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram 10 MB 1 0 GB 1 TB 1 00 TB 2.5" 1.8" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance & Management? Nortel 20 April 1999

17 4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace
Cost 1,000 $ Come with NT DBMS High speed Net System management GUI / OOUI Tools Compatible with everyone else CyberBricks Nortel 20 April 1999

18 Super Server: 4T Machine
Array of 1,000 4B machines 1 b ips processors 1 B B DRAM 10 B B disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: Manageability Programmability Security Availability Scaleability Affordability As easy as a single system CPU 50 GB Disc 5 GB RAM Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work Nortel 20 April 1999

19 Scale OUT Clusters Have Advantages
Fault tolerance: Spare modules mask failures Modular growth without limits Grow by adding small modules Parallel data search Use multiple processors and disks Clients and servers made from the same stuff Inexpensive: built with commodity CyberBricks Nortel 20 April 1999

20 1988: IBM DB2 + CICS Mainframe 65 tps
Simulated network of 800 clients 2m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB Nortel 20 April 1999

21 1987: Tandem Mini @ 256 tps 14 M$ computer (Tandem)
A dozen people (1.8M$/y) False floor, 2 rooms of machines Admin expert 32 node processor array Performance expert Hardware experts Simulate 25,600 clients Network expert Auditor Manager 40 GB disk array (80 drives) Nortel 20 April 1999 DB expert OS expert

22 1997: 9 years later 1 Person and 1 box = 1250 tps
1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 100,000x less 5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays Nortel 20 April 1999

23 What Happened? Where did the 100,000x come from?
Moore’s law: X (at most) Software improvements: 10X (at most) Commodity Pricing: X (at least) Total ,000X 100x from commodity (DBMS was 100K$ to start: now 1k$ to start IBM 390 MIPS is 7.5K$ today Intel MIPS is 10$ today Commodity disk is 50$/GB vs 1,500$/GB ... mainframe mini micro time price Nortel 20 April 1999

24 Computers shrink to a point
Kilo Mega Giga Tera Peta Exa Zetta Yotta Computers shrink to a point Disks 100x in 10 years 2 TB 3.5” drive Shrink to 1” is 200GB Disk is super computer! This is already true of printers and “terminals” Nortel 20 April 1999

25 All Device Controllers will be Cray 1’s
TODAY Disk controller is 10 mips risc engine with 2MB DRAM NIC is similar power SOON Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages Uniform programming model Great tools Security economics (cyberbricks) Move computation to data (minimize traffic) Central Processor & Memory Tera Byte Backplane Nortel 20 April 1999

26 It’s Already True of Printers Peripheral = CyberBrick
You buy a printer You get a several network interfaces A Postscript engine cpu, memory, software, a spooler (soon) and… a print engine. Nortel 20 April 1999

27 Functionally Specialized Cards
P mips processor Storage Network Display Today: P=50 mips M= 2 MB ASIC M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC Nortel 20 April 1999

28 Implications Conventional Radical Move app to NIC/device controller
higher-higher level protocols: DCOM. Cluster parallelism is VERY important. Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Central Processor & Memory h Nortel 20 April 1999

29 How Do They Talk to Each Other?
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Applications Applications datagrams streams RPC ? ? RPC streams datagrams VIAL/VIPL VIAL/VIPL Wire(s) Nortel 20 April 1999

30 Disk = Node has magnetic storage (100 GB?) has processor & DRAM
has SAN attachment has execution environment Applications Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel Nortel 20 April 1999

31 Scaleability Scale Up and Scale Out
SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

32 HotMail: ~300 Computers FreeBSD and Solaris
Nortel 20 April 1999

33 ~150 nodes Nortel 20 April 1999

34 Other Clusters 16-node Cluster 45-node Compaq Cluster 64 cpus
2 TB of disk Decision support 45-node Compaq Cluster 140 cpus 14 GB DRAM 4 TB RAID disk OLTP (Debit Credit) 1 B tpd (14 k tps) Nortel 20 April 1999

35 Berkeley NOW (network of workstations) Project
105 nodes Sun UltraSparc 170, 128 MB, 2x2GB disk Myrinet interconnect (2x160MBps per node) SBus (30MBps) limited GLUNIX layer above Solaris Inktomi (HotBot search) NAS Parallel Benchmarks Crypto cracker Sort 9 GB per second Nortel 20 April 1999

36 NCSA Super Cluster National Center for Supercomputing Applications University of Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model Nortel 20 April 1999

37 Outline The bandwidth revolution ScaleUp, ScaleOut
TerraServer (Barclay, Slutz, Gray) A scaleup example Nortel 20 April 1999

38 Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) 15 PB by 2007 Federal Clearing house: images of checks 15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program 10 Exabytes (???!!) Nortel 20 April 1999

39 Library of Congress (text)
Info Capture Library of Congress (text) Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter All Disks All Tapes A Movie LoC (image) You can record everything you see or hear or read. What would you do with it? How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write: 8 GB (words) See: Nortel 20 April 1999

40 Michael Lesk’s Points
Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology. Nortel 20 April 1999

41 The TerraServer
Nortel 20 April 1999

42 Database & application UI
Coverage: Range from 70ºN to 70ºS today: 35% U.S., 1% outside U.S. Source Imagery: 4 TB 1sq meter/pixel Aerial (USGS - 60,000 46Mb B&W- 151Mb Color IR files) 1 TB 1.56 meter/pixel Satellite (Spin Mb B&W) Display Imagery: 200x200 pixel images, subsample to build image pyramid Nav Tools: 1.5 m place names “Click-on” Coverage map Expedia & Virtual Globe map Pick of the week Concept: User navigates an ‘almost seamless’ image of earth 1.6x 1.6 km “city view” .8 x .8 km 8m thumbnail ,4 x,4 km browse 200x200 m tile Nortel 20 April 1999

43 Image Data 50,000 Topo Maps 4 TB adding 6TB Coming now USGS “DOQ” 1 TB
DRG 50,000 Topo Maps adding now 4 TB 6TB Coming USGS “DOQ” Spin-2 1 TB WorldWide New Data Coming When we started the project, we searched for data sets we could find in volume which would be interesting to novices. UC Santa Barbara introduced to “Digital Ortho Quadrangles” (DOQ) from the USGS, the top left. This is 1m / pixel aerial imagery. The color imagery in the top right is another USGS product -- Digital Raster Graphics. It is scanned topo maps. Though the entire US is available, we decided that the topo maps were too much like the Expedia maps. Also, DRGs are hard to process since they are a scan of a paper image that is susceptible to huge amount of error. Because 50% of Microsoft customers are outside the U.S., we wanted to obtain world-wide coverage. We first looked at Spot Image, a French company that has been in business since the mid-eighties with satellites. Spot had 2 TB of 5m and 10m resolution imagery. On the slide is a sample of 10m imagery. Physically it is the same location as the image on the right from SPIN-2. Most people have to be told that fact. Thus, we abandoned working with Spot Image in favor of the SPIN-2 folks because they have worldwide coverage high resolution coverage. Not many other companies have high resolution satellite data available commercially in large volumes. Turns out there was an international treaty that banned distribution of imagery better than 5m per pixel. This ban was lifted in Thus only former government agencies had substantial volumes of high resolution data. And that’s how we ended up having two themes -- USGS & SPIN-2 -- because they were roughly the same resolution, both in large volumes, from roughly the same period of time. Nortel 20 April 1999

44 Software Architecture
IE 3…5 Netscape 3…4 HTML Java Viewer Web Client SQL Server 7.0 Terra-Server DB Terra-Server Stored Procedures Internet Information Server 4.0 Terra-Server Active Server Pages Active Data Object ODBC Terra-Server Web Site 19 24 39 (14 Img) (8 Place) The Internet Image Delivery Application SQL Server SPIN-2/USGS Store Active Server Pages Microsoft Site Serve EE 3.0 Image Commerce Site(s) 13 Nortel 20 April 1999

45 How Images are Found Expedia Name Search 22% 40% Famous Places 18% Geo
Coverage Map 19% Expedia 22% Name Search 40% Famous Places 18% Geo Coordinate 1% Nortel 20 April 1999

46 TerraServer: Lots of Web Hits
Summary Total Max Unique Users 17 M 150 k Sessions 24 M 172 k Hits 1.7 B 29 M Page Views 274 M 1.1 M 6.6 M DB Queries 1.5 B 18 M Image Xfers 1.3 B Average 69 k 94 k 6.8 M 5.8 M 5.0 M 15 M As of Feb 28, 1999 Today: 1.7 billion web hits 1 TB, largest SQL DB on the Web 100 qps average, 1,000 Qps peak 1.5 B SQL queries so far Nortel 20 April 1999

47 Theme Meta Information
Logical Schema Country Name State Place PlaceType Feature Type Where Am I Img Meta Tile Meta Jump Img Browse Img Tile Img Theme Meta Information Spin Frame Meta Thumb Img Image Data & Meta Data Gazetteer Index on • image, place, type • image, state, type • image, state, country, type • image, place, state, type • image, place, country, type all lookups are fast Lookup by UGrid or ZGrid ID plus resolution Lookups are fast. Indices are in DRAM (auto-magically by SQL) SQL manages all the tiles and indices Images are brought in on demand Nortel 20 April 1999

48 Image Load and Update Dither Image Pyramid From base Staging
“tar” Metadata Load DB Active Server Pages Cut & Load Scheduling System Staging Disk JPEG tiles DLT Tape Image Cutter Merge ODBC Tx TerraLoader ODBC TX TerraServer SQL DBMS ODBC Tx Dither Image Pyramid From base Nortel 20 April 1999

49 TerraServer Administrator Web Site
Accessible by Microsoft, SPIN-2, and USGS Web browser forms to: Edit Famous Places list Modify Image Status fields Define new TerraServer Administrators Nortel 20 April 1999

50 Load & Backup&Recovery
Backup and Recovery Using Legato Networker integrated with SQL Backup/Restore Utility Fast, incremental, differential, online Restore Fast, incremental (file oriented), not online. SQL Server Enterprise Manager DBA Maintenance SQL Performance Monitor Nortel 20 April 1999

51 9 HSZ70 Ultra-SCSI Dual redundant Controllers
Site Configuration 9710 TimberWolf Alpha 8400 (8x440) 10GB Ram Enterprise Storage Array 9 HSZ70 Ultra-SCSI Dual redundant Controllers GB Seagate Disks Compaq 5500 4x200mhz Web Servers To the Web Talk about the 6 web servers front ending TerraServer. Nortel 20 April 1999

52 The Microsoft TerraServer Hardware
Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware Nortel 20 April 1999

53 File System Config I: F: G: H:
Use StorageWorks to form 28 RAID5 sets Each raid set has 11 disks (16 spare drives) Use NTFS to form GB NT volumes Each striped over 7 Raid sets on 7 controllers Create 26 20,000MB files on F:, 27 on G: DB is File Group of 53 files (1.011TB) F: G: H: I: Nortel 20 April 1999

54 SQL 7 TerraServer Availability
Operating for 9 months: 6400 hrs Unscheduled outage: 36.5 minutes: % scheduled up Scheduled outage: 60 minutes Availability: % overall up No NT failures (ever) One SQL7 Beta2 bug No failures in July, Aug, Oct, Dec, Jan, Feb, Mar Nortel 20 April 1999

55 Things we did right... Use a database to store images:
Simplify management Can dynamically load data into tables while viewing application is active Simple X, Y Z-Grid navigation system Used ImgStatus to control logical “presence” of the image in the app “Stitching tiles together” from multiple input images to form seamless mosaic Offering two forms of seamless -- time based (SPIN-2) and theme based (DOQ) Nortel 20 April 1999

56 TS 3: Things are changing...
Square Tiles, power of 2 size (200x200) Power of 2 zoom levels (2:1, 4:1, 8:1, etc.) so uniform tile size on each zoom (variable ground size per tile) Indexing system independent of tile size Digital Raster Graphics (Topo maps) Layered Maps (Topo merge with DOQ) Integrate with other applications and services Later: Digital Elevation Models (DEMs) Other foreign data sources (EU, etc.) Nortel 20 April 1999

57 What TerraServer Shows
Can serve huge databases on Internet for about a penny a page view mostly phone bill (!). Advertising pays more than a penny a page. Commodity tools do scale fairly far. A few people (3 developers, 1 operator) using power tools can build an impressive web site Nortel 20 April 1999

58 Thank You! SPIN-2 Tom Barclay did most of this app,
Slutz and Gray helped. Nortel 20 April 1999

59 Outline The bandwidth revolution ScaleUp, ScaleOut
TerraServer (Barclay, Slutz, Gray) Nortel 20 April 1999

60 end Nortel 20 April 1999

61 Windows NT Versus UNIX Best Results on an SMP: SemiLog plot shows 3x (2 year) lead by UNIX see Nortel 20 April 1999

62 TPC C Improvements (MS SQL) 250%/year on Price, 100%/year performance
40% hardware, 100% software, 100% PC Technology Nortel 20 April 1999

63 Price Breakdown (6 months old)
Nortel 20 April 1999

64 (dis) Economy Of Scale Nortel 20 April 1999

Download ppt "Scaleable Computing Jim Gray Microsoft Research"

Similar presentations

Ads by Google