Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Scaleable WindowsNT? Jim Gray Microsoft Research

Similar presentations


Presentation on theme: "1 Scaleable WindowsNT? Jim Gray Microsoft Research"— Presentation transcript:

1 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com http://research.Microsoft.com/~Gray

2 2 Outline What is Scalability? Why does Microsoft care about ScaleUp Current ScaleUp Status? NT5 & SQL7 & Exchange

3 Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

4 Billions Of Clients Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous

5 Billions Of Clients Need Millions Of Servers Mobile clients Fixed clients Server Superserver Clients Servers  All clients networked to servers  May be nomadic or on-demand  Fast clients want faster servers  Servers provide  Shared Data  Control  Coordination  Communication

6 Thesis Many little beat few big  Smoking, hairy golf ball  How to connect the many little parts?  How to program the many little parts?  Fault tolerance? $1 million $100 K $10 K Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" 1 M SPECmarks, 1TFLOP 10 6 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM 3 100 TB 1 TB 10 GB 1 MB 100 MB

7 7 Outline What is Scalability Why does Microsoft care about ScaleUp Current ScaleUp Status? NT5 & SQL7 & Exchange

8 8 Scalability 1 billion transactions 1.8 million mail messages 4 terabytes of data 100 million web hits Scale up: to large SMP nodes Scale up: to large SMP nodes Scale out: to clusters of SMP nodes Scale out: to clusters of SMP nodes

9 9 “Commercial” NT Clusters 16-node Tandem Cluster » 64 cpus » 2 TB of disk » Decision support 45-node Compaq Cluster » 140 cpus » 14 GB DRAM » 4 TB RAID disk » OLTP (Debit Credit) 1 B tpd (14 k tps)

10 10 Tandem Oracle/NT 27,383 tpmC 71.50 $/tpmC 4 x 6 cpus 384 disks =2.7 TB

11 11 24 cpu, 384 disks (=2.7TB)

12 Billion Transactions per Day Project Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks All off-the-shelf parts Using SQL Server & DTC distributed transactions DebitCredit Transaction Each node has 1/20 th of the DB Each node does 1/20 th of the work 15% of the transactions are “distributed”

13 13 Billion Transactions Per Day Hardware 45 nodes (Compaq Proliant) Clustered with 100 Mbps Switched Ethernet 140 cpu, 13 GB, 3 TB.

14 14 How Much Is 1 Billion Tpd? 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) ATT » 185 million calls per peak day (worldwide) Visa ~20 million tpd » 400 million customers » 250K ATMs worldwide » 7 billion transactions (card+cheque) in 1994 New York Stock Exchange » 600,000 tpd Bank of America » 20 million tpd checks cleared (more than any other bank) » 1.4 million tpd ATM transactions Worldwide Airlines Reservations: 250 Mtpd

15 15 All Shipping Products! Per Sec Per Min Per Day Per Sec Per Min Per Day 10K TPC 166 10,000 14,400,000 1 BTPD 11,574 694,444 1,000,000,000 1.4 BTPD 16,204 972,222 1,400,000,000 SQLSQLSQLSQLSQLSQL COM / ActiveX MTS IIS Infinite, Ubiquitous Scaling Redefining the rules

16 16 Microsoft.com: ~150x4 nodes (3)

17 17 NCSA Super Cluster National Center for Supercomputing Applications University of Illinois @ Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

18 18 TPC C Improved Fast (250%/year!) 40% hardware, 100% software, 100% PC Technology

19 19 Windows NT Versus UNIX

20 20 Economy Of Scale

21 21 Microsoft TerraServer: Scaleup to Big Databases Build a 1 TB SQL Server database Data must be » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere Loaded » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) On the web (world’s largest atlas) Sell images with commerce server.

22 22 Microsoft TerraServer Background Earth is 500 Tera-meters square » USA is 10 tm 2 100 TM 2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with » Encarta™ Atlas globe gazetteer » StreetsPlus™ in the USA 40x60 km 2 jump image 20x30 km 2 browse image 10x15 km 2 thumbnail 1.8x1.2 km 2 tile Someday » multi-spectral image » of everywhere » once a day / hour

23 23 Demo navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

24 24 The Microsoft TerraServer Hardware Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0

25 25 browser HTML Java Viewer The Internet Web Client Microsoft Automap ActiveX Server Internet Info Server 4.0 Image Delivery Application SQL Server 7 Microsoft Site Server EE Internet Information Server 4.0 Image Provider Site(s) TerraServer DB Automap Server Terra-Server Stored Procedures Internet Information Server 4.0 Image Server Active Server Pages MTS TerraServer Web Site Software SQL Server 7

26 26 Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape “tar” \ Drop’N’ DoJob Wait 4 Load LoadMgr DB 100mbit EtherSwitch 108 9.1 GB Drives Enterprise Storage Array Alpha Server 8400 108 9.1 GB Drives 108 9.1 GB Drives STK DLT Tape Library 60 4.3 GB Drives Alpha Server 4100 ESA Alpha Server 4100 LoadMgr DLT Tape NT Backup ImgCutter \ Drop’N’ \Images 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place... LoadMgr

27 27 TerraServer: A Real “World” Example Largest DB on the Web 1.3TB 99.95% uptime since July 1 No downtime, period, in August 70% of downtime for SQL software upgrades

28 28 NT Clusters (Wolfpack) Scale DOWN to PDA: WindowsCE Scale UP an SMP: TerraServer Scale OUT with a cluster of machines Single-system image » Naming » Protection/security » Management/load balance Fault tolerance » “Wolfpack” Hot pluggable hardware & software

29 29 Web site Database Web site files Database files Server 1 Browser Symmetric Virtual Server Failover Example Server 1 Server 2 Web site files Database files Web site Database Database

30 30 Windows NT 5 (scalability features) Better SMP support Clusters: » 16x packs (fault tolerant clusters) » 100x mobs: arrays for manageability » SAN/VIA support 64 bit addressing for data » Apps like SQL, Oracle, will use it for data » 64 bit API to NT comes later (in lab now). Remote management ( scripting and DCOM ) Active Directory Veritas volume manager Many 3rd party HSMs Batch support

31 31 Microsoft SQL Server 7.0 Fixes the famous performance bugs » dynamic record locking » online backup, quick recovery…. 64 bit addressing buffer pool SMP parallelism and better SMP support Built in OLAP (cubes and MOLAP) Scale down to Win9x Improved management interfaces Data transform services (for warehouses)

32 32 Outline What is Scalability Why does Microsoft care about ScaleUp Current ScaleUp Status? NT5 & SQL7

33 33 end Other slides would be interesting, but...

34 34 Interesting “other slides” No time for them but... How much information is there? IO bandwidth in the Intel world Intelligent disks SAN/VIA NT Cluster Sort

35 35 Some Tera-Byte Databases Kilo Mega Giga Tera Peta Exa Zetta Yotta The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of email Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) » 15 PB by 2007 Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!)

36 36 Library of Congress (text) Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter All Disks All Tapes A Movie LoC (image) Info Capture You can record everything you see or hear or read. What would you do with it? How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write:8 GB (words) See: http://www.lesk.com/mlesk/ksg97 / ksg.html

37 37 Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology.

38 38 PAP (peak advertised Performance) vs RAP (real application performance) Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 7.2 MB/s 133 MBps 7.2 MB/s 10-15 MBps 7.2 MB/s SCSI File System Buffers Application Data Disk PCI 40 MBps 7.2 MB/s

39 39 PAP vs RAP Reads are easy, writes are hard Async write can match WCE. 422 MBps 142MBps 133 MBps 72 MBps 10-15 MBps 9 MBps SCSI File System Application Data PCI SCSI Disks 40 MBps 31 MBps

40 40 Bottleneck Analysis NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps PCI Adapter 120 MBps

41 41 Year 2002 Disks Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential Both running Windows NT™ 7.0? (see below for why)

42 42 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other » CORBA? DCOM? IIOP? RMI? » One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Wire(s) h streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications

43 43 Gbps Ethernet: 110 MBps SAN: Standard Interconnect PCI 32: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 GBps links in lab. 300$ port cost soon Port is computer RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?

44 44 PennySort Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE Software » NT workstation 4.3 » NT 5 sort Performance » sort 15 M 100-byte records (~1.5 GB) » Disk to disk » elapsed time 820 sec cpu time = 404 sec

45 45 Cluster Sort Conceptual Model Multiple Data Sources Multiple Data Destinations Multiple nodes Disks -> Sockets -> Disk -> Disk B AAA BBB CCC A AAA BBB CCC C AAA BBB CCC BBB AAA CCC BBB AAA CCC

46 46 Cluster Install & Execute If this is to be used by others, it must be: Easy to install Easy to execute Installations of distributed systems take time and can be tedious. (AM2, GluGuard) Parallel Remote execution is non-trivial. (GLUnix, LSF) How do we keep this “simple” and “built-in” to NTClusterSort ?

47 47 Remote Install RegConnectRegistry() RegCreateKeyEx() Add Registry entry to each remote node.

48 48 Cluster Execution MULT_QI COSERVERINFO Setup : MULTI_QI struct COSERVERINFO struct CoCreateInstanceEx() Retrieve remote object handle from MULTI_QI struct Invoke methods as usual HANDLE Sort()


Download ppt "1 Scaleable WindowsNT? Jim Gray Microsoft Research"

Similar presentations


Ads by Google