1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco.

Slides:



Advertisements
Similar presentations
1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.
Advertisements

U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.
IT253: Computer Organization
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Mr Greenhalgh S4 Computing Int 1 Things you could do with knowing before the Exam…
4/5/20001 Windows 2000 IO Performance Leonard Chung & Jim Gray.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
1  1998 Morgan Kaufmann Publishers Chapter 8 Storage, Networks and Other Peripherals.
Distributed Hardware How are computers interconnected ? –via a bus-based –via a switch How are processors and memories interconnected ? –Private –shared.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Hands-On Microsoft Windows Server 2003 Administration Chapter 6 Managing Printers, Publishing, Auditing, and Desk Resources.
Hardware and Software Basics. Computer Hardware  Central Processing Unit - also called “The Chip”, a CPU, a processor, or a microprocessor  Memory (RAM)
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
4 Network Hardware & Software Network Operating systems: software controlling traffic on the network 2 types of s.ware: server software &client software.
Module 3 - Storage MIS5122: Enterprise Architecture for IT Auditors.
Computer Systems 1 Fundamentals of Computing
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Managing Storage Lesson 3.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Stuart Cunningham - Computer Platforms COMPUTER PLATFORMS Network Operating Systems Week 9.
Hardware Case that houses the computer Monitor Keyboard and Mouse Disk Drives – floppy disk, hard disk, CD Motherboard Power Supply (PSU) Speakers Ports.
Computer System Architectures Computer System Software
14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.
CPU (CENTRAL PROCESSING UNIT): processor chip (computer’s brain) found on the motherboard.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Microsoft Research Jim Gray Researcher Microsoft Research Microsoft Corporation.
1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research
1 Scaleable WindowsNT? Jim Gray Microsoft Research
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Operating Systems. Without an operating system your computer would be useless! A computer contains an Operating System on its Hard Drive. This is loaded.
Computing and the Web Computer Hardware Components.
Block1 Wrapping Your Nugget Around Distributed Processing.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray.
Windows NT Scalability Jim Gray Microsoft Research http/
TCP/IP (Transmission Control Protocol / Internet Protocol)
1 Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Jim Gray Acknowledgements : Dave Patterson.
Windows NT Scalability Jim Gray Microsoft Research http/
NETWORKING FUNDAMENTALS. Network+ Guide to Networks, 4e2.
Page 1 Printing & Terminal Services Lecture 8 Hassan Shuja 11/16/2004.
Local Area Networks School of Business Eastern Illinois University © Abdou Illia, Spring 2007 (Week 8, Tuesday 2/27/2007)
Clustering Servers Chapter Seven. Exam Objectives in this Chapter:  Plan services for high availability Plan a high availability solution that uses clustering.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Windows NT Scalability Jim Gray Microsoft Research http/
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Enterprise Network Systems Client/ Server Mark Clements.
Information Technology Essentials Deloris Y. McBride.
Intro to Distributed Systems and Networks Hank Levy.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
IST 201 Chapter 11 Lecture 2. Ports Used by TCP & UDP Keep track of different types of transmissions crossing the network simultaneously. Combination.
BARC BARC Microsoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech)
Design Unit 26 Design a small or home office network
Web Server Administration
Chapters 1-3 Concepts NT Server Capabilities
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Presentation transcript:

1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco Erik Riedel (CMU)* Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen (NTFS)*

2 Overview Telepresence » Goals » Prototypes Rags: automating software testing Scaleable Systems. » Goals » Prototypes Misc.

3

4 Telepresence: The next Killer App Space shifting: » Reduce travel Time shifting: » Retrospectives » Condensations » Just in time meetings. Example: ACM 97 » » NetShow and Web site. » More web visitors than attendees

5 What We Are Doing Scalable Reliable Multicast (SRM) » used by WB (white board) of Mbone » Nack suppression (backoff) » N 2 message traffic to set up Error Correcting SRM (EC SRM) Error Correcting SRM (EC SRM) » Do not resend lost packets. » Send Error Correction in addition to regular » (or)Send Error Correction in response to NACK » One EC packet repairs any of k lost packets » Improved scaleability (millions of subscribers).

6 Telepresence Prototypes PowerCast: multicast PowerPoint » Streaming - pre-sends next anticipated slide » Send slides and voice rather than talking head and voice » Uses ECSRM for reliable multicast » 1000’s of receivers can join and leave any time. » No server needed; no pre-load of slides. » Cooperating with NetShow FileCast: multicast file transfer. » Erasure encodes all packets » Receivers only need to receive as many bytes as the length of the file » Multicast IE to solve Midnight-Madness problem NT SRM: reliable IP multicast library for NT Spatialized Teleconference Station » Texture map faces onto spheres » Space map voices

7 IP Multicast Is pruned broadcast to a multicast address Unreliable Reliable would require Ack/Nack. State or Nack implosion problem router router router =sender=receiver =not interested router

8 (n,k) encoding Original packets 12k 12kk+1k+2n Encode (copy 1st k) 12k Original packets Decode Take any k

9 Fcast File tranfer protocol FEC-only Files transmitted in parallel

10 Fcast send order 12k 12k 12k 12k 12k 12k 12k 12k 12k k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n File 1 File 2 X Need k from each row

11 ECSRM - Erasure Correcting SRM Combines: » suppression » erasure correction

12 Suppression Delay a NACK or repair in the hopes that someone else will do it. NACKs are multicast After NACKing, re-set timer and wait for repair If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

13 ECSRM - adding FEC to suppression Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,x  c). Don’t re-send originals; send EC packets using (n,k) encoding

14 ECSRM Combine suppression & erasure correction Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,x  c). Don’t re-send originals; send EC packets using (n,k) encoding Below, 1 NACK and one EC packet fixes all errors EC X X X X X X X

15 Multicast PowerPoint Add-in Slides Annotations Control information ECSRM slide masterFcast

16 Multicast PowerPoint - Late Joiners Viewers joining late don’t impact others with session persistent data (slide master) time joinleave Fcast ECSRM join

17 Future Work Adding hierarchy (e.g. PGM by Cisco) Do we need 2 protocols?

18 Spatialized Teleconferences Map heads to “Eggs” Project voices in stereo using “nose vector”

19 RAGS: RAndom SQL test Generator Microsoft spends a LOT of money on testing. (60% of development according to one source). Idea: test SQL by » generating random correct queries » executing queries against database » compare results with SQL 6.5, DB2, Oracle, Sybase Being used in SQL 7.0 testing. » 375 unique bugs found (since 2/97) » Very productive test tool

20 Sample Rags Generated Statement SELECT TOP 3 T1.royalty, T0.price, "Apr :23AM", T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11, "Apr :23AM", T0.advance, ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance, (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs", T2.ord_date, AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange, ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error: Query processor could not produce a query plan.

21 Automation Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) Control statement attributes » complexity, kind, depth,... Multi-user stress tests » tests concurrency, allocation, recovery

22 One 4-Vendor Rags Test 3 of them vs Us 60 k Selects on MSS, DB2, Oracle, Sybase. 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements. Examine 10 suspects, filed 4 Bugs! One duplicate. Assume 3/10 are new Note: This is the SS Beta 2 Product Quality rising fast (and RAGS sees that)

23 RAGS Next Steps Done: » Patents, Papers, Talks » tech transfer to development SQL 7 (over 400 bugs), FoxPro, OLE DB. Next steps: » Make even more automatic » Extend to other parts of SQL and Tsql » “Crawl” the config space (look for new holes) » Apply ideas to other domains (ole db).

Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

Billions Of Clients Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous

Billions Of Clients Need Millions Of Servers Mobile clients Fixed clients Server Superserver Clients Servers  All clients networked to servers  May be nomadic or on-demand  Fast clients want faster servers  Servers provide  Shared Data  Control  Coordination  Communication

Thesis Many little beat few big  Smoking, hairy golf ball  How to connect the many little parts?  How to program the many little parts?  Fault tolerance? $1 million $100 K $10 K Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" 1 M SPECmarks, 1TFLOP 10 6 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM TB 1 TB 10 GB 1 MB 100 MB

28 Microsoft TerraServer: Scaleup to Big Databases Build a 1 TB SQL Server database Data must be » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere Loaded » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) On the web (world’s largest atlas) Sell images with commerce server.

29 Microsoft TerraServer Background Earth is 500 Tera-meters square » USA is 10 tm TM 2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with » Encarta™ Atlas globe gazetteer » StreetsPlus™ in the USA 40x60 km 2 jump image 20x30 km 2 browse image 10x15 km 2 thumbnail 1.8x1.2 km 2 tile Someday » multi-spectral image » of everywhere » once a day / hour

30 USGS Digital Ortho Quads (DOQ) US Geologic Survey 4 Tera Bytes Most data not yet published Based on a CRADA » Microsoft TerraServer makes data available. USGS “DOQ” 1x1 meter 4 TB Continental US New Data Coming

31 Russian Space Agency (SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor) 1.5 Meter Geo Rectified imagery of (almost) anywhere Almost equal-area projection De-classified satellite photos (from 200 KM), More data coming (1 m) Selling imagery on Internet. Putting 2 tm 2 onto Microsoft TerraServer. SPIN-2

32 Live on the internet 6/24/98 For 18 Months One Billion Served New Since S-Day: More data: 4.8 TB USGS DOQ.8 TB Russian Bigger Server: Alpha proc, 8 GB RAM, 2.9 TB Disk Improved Application Better UI Uses ASP Commerce App Load 6 TB more 60% US 4% world 30 M web hits per day peak 8 Mhpd avg (1 M page views /day) 1 Billion pages served! 99.95% available No NT failures, 30 minute SQL restart

33 Microsoft.com/ Demo SPIN-2 Microsoft BackOffice

34 Demo navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

35 1TB Database Server AlphaServer x GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 ) Hardware 100 Mbps Ethernet Switch DS3 Site Servers Internet Map Server SPIN-2 Web Servers STK 9710 DLT Tape Library 48 9 GB Drives Alpha Server 8400 Enterprise Storage Array 8 x 440MHz Alphacpus 10 GB DRAM 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives

36 The Microsoft TerraServer Hardware Compaq AlphaServer x400Mhz Alpha cpus 10 GB DRAM GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0

37 browser HTML Java Viewer The Internet Web Client Microsoft Automap ActiveX Server Internet Info Server 4.0 Image Delivery Application SQL Server 7 Microsoft Site Server EE Internet Information Server 4.0 Image Provider Site(s) TerraServer DB Automap Server Terra-Server Stored Procedures Internet Information Server 4.0 Image Server Active Server Pages MTS TerraServer Web Site Software SQL Server 7

38 Backup and Recovery » STK 9710 Tape robot » Legato NetWorker™ » SQL Server 7 Backup & Restore » Clocked at 80 MBps (peak) (~ 200 GB/hr) SQL Server Enterprise Mgr » DBA Maintenance » SQL Performance Monitor System Management & Maintenance

39 Microsoft TerraServer File Group Layout Convert 324 disks to 28 RAID5 sets plus 28 spare drives Make 4 WinNT volumes (RAID 50) 595 GB per volume Build 30 20GB files on each volume DB is File Group of 120 files E: F: G: H:

40 Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape “tar” \ Drop’N’ DoJob Wait 4 Load LoadMgr DB 100mbit EtherSwitch GB Drives Enterprise Storage Array Alpha Server GB Drives GB Drives STK DLT Tape Library GB Drives Alpha Server 4100 ESA Alpha Server 4100 LoadMgr DLT Tape NT Backup ImgCutter \ Drop’N’ \Images 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place... LoadMgr

41 Technical Challenge Key idea Problem: Geo-Spatial Search without geo-spatial access methods. (just standard SQL Server) Solution: H Geo-spatial search key: ä Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) ä Z-transform X & Y into single Z value, build B-tree on Z ä Adjacent images stored next to each other H Search Method: ä Latitude and Longitude => X, Y, then Z ä Select on matching Z value

42 Some Tera-Byte Databases Kilo Mega Giga Tera Peta Exa Zetta Yotta The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) » 15 PB by 2007 Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!)

43 Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter Library of Congress (text) All Disks All Tapes A Movie LoC (image) All Photos LoC (sound + cinima) All Information!

44 Michael Lesk’s Points Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology.

45 Scalability 1 billion transactions 1.8 million mail messages 4 terabytes of data 100 million web hits Scale up: to large SMP nodes Scale up: to large SMP nodes Scale out: to clusters of SMP nodes Scale out: to clusters of SMP nodes

B tpd 1 B tpd ran for 24 hrs. Out-of-the-box software Off-the-shelf hardware AMAZING! Sized for 30 days Linear growth 5 micro-dollars per transaction

47 How Much Is 1 Billion Tpd? 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) ATT » 185 million calls per peak day (worldwide) Visa ~20 million tpd » 400 million customers » 250K ATMs worldwide » 7 billion transactions (card+cheque) in 1994 New York Stock Exchange » 600,000 tpd Bank of America » 20 million tpd checks cleared (more than any other bank) » 1.4 million tpd ATM transactions Worldwide Airlines Reservations: 250 Mtpd

48 NCSA Super Cluster National Center for Supercomputing Applications University of Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model

49 NT Clusters (Wolfpack) Scale DOWN to PDA: WindowsCE Scale UP an SMP: TerraServer Scale OUT with a cluster of machines Single-system image » Naming » Protection/security » Management/load balance Fault tolerance » “Wolfpack” Hot pluggable hardware & software

50 Web site Database Web site files Database files Server 1 Browser Symmetric Virtual Server Failover Example Server 1 Server 2 Web site files Database files Web site Database Database

51 Clusters & BackOffice Research: Instant & Transparent failover Making BackOffice PlugNPlay on Wolfpack » Automatic install & configure Virtual Server concept makes it easy » simpler management concept » simpler context/state migration » transparent to applications SQL 6.5E & 7.0 Failover MSMQ (queues), MTS (transactions).

52 Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk Tape /Optical Robot This Campus This Room 10 min My Head 1 min 1.5 hr Sacramento 2 Years Pluto 2,000 Years Andromeda

53 Controller The Memory Hierarchy Measuring & Modeling Sequential IO Where is the bottleneck? How does it scale with » SMP, RAID, new interconnects Adapter SCSI File cache PCI Memory Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Mem bus App address space

54 Sequential IO your mileage will vary 40 MB/secAdvertised UW SCSI 35r-23w MB/secActual disk transfer 29r-17w MB/sec64 KB request (NTFS) 9 MB/secSingle disk media 3 MB/sec 2 KB request (SQL Server) Measuring hardware & Software Looking for software fixes.. Aiming for “out of the box” 1/2 power point: 50% of peak power “out of the box”

55 PAP (peak advertised Performance) vs RAP (real application performance) Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 7.2 MB/s 133 MBps 7.2 MB/s MBps 7.2 MB/s SCSI File System Buffers Application Data Disk PCI 40 MBps 7.2 MB/s

56 The Best Case: Temp File, NO IO Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware is limited to 150 MBps per processor

57 Bottleneck Analysis Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Memory Read/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps

58 3 Stripes and Your Out! 3 disks can saturate adapter Similar story with UltraWide CPU time goes down with request size Ftdisk (striping is cheap) =

59 Parallel SCSI Busses Help Second SCSI bus nearly doubles read and wce throughput Write needs deeper buffers Experiment is unbuffered (3-deep +WCE)  2 x

60 File System Buffering & Stripes (UltraWide Drives) FS buffering helps small reads FS buffered writes peak at 12MBps 3-deep async helps Write peaks at 20 MBps Read peaks at 30 MBps

61 PAP vs RAP Reads are easy, writes are hard Async write can match WCE. 422 MBps 142MBps 133 MBps 72 MBps MBps 9 MBps SCSI File System Application Data PCI SCSI Disks 40 MBps 31 MBps

62 Bottleneck Analysis NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps Adapter 70 MBps

63 Hypothetical Bottleneck Analysis NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps PCI Adapter 120 MBps

64 Computers shrink to a point Disks on track 100x in 10 years 2 TB 3.5” drive Shrink to 1” is 200GB Disk replaces tape? Disk is super computer! Kilo Mega Giga Tera Peta Exa Zetta Yotta

65 Data Gravity Processing Moves to Transducers Move Processing to data sources Move to where the power (and sheet metal) is Processor in » Modem » Display » Microphones (speech recognition) & cameras (vision) » Storage: Data storage and analysis

66 It’s Already True of Printers Peripheral = CyberBrick You buy a printer You get a » several network interfaces » A Postscript engine cpu, memory, software, a spooler (soon) » and… a print engine.

67 Remember Your Roots

68 Year 2002 Disks Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential Both running Windows NT™ 7.0? (see below for why)

69 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other » CORBA? DCOM? IIOP? RMI? » One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Wire(s) h streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications

70 What if Networking Was as Cheap As Disk IO? TCP/IP » Unix/NT 100% 40MBps Disk » Unix/NT 8% 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, checksum,… flow control small buffers

71 Technology Drivers: The Promise of SAN/VIA:10x in 2 years Today: » wires are 10 MBps (100 Mbps Ethernet) » ~20 MBps tcp/ip saturates 2 cpus » round-trip latency is ~300 us In the lab » Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… » Fast user-level communication tcp/ip ~ 100 MBps 10% of each processor round-trip latency is 15 us

72 Gbps Ethernet: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?

73 Technology Drivers Plug & Play Software RPC is standardizing: (DCOM, IIOP, HTTP) » Gives huge TOOL LEVERAGE » Solves the hard problems for you: naming, security, directory service, operations,... Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools » NetWare + tools » WinCE, WinNT,…+ tools » JavaOS + tools Apps gravitate to data. General purpose OS on controller runs apps.

74 Disk = Node has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution environment OS Kernel SAN driverDisk driver File SystemRPC,... ServicesDBMS Applications

75 Penny Sort Ground Rules How much can you sort for a penny. » Hardware and Software cost » Depreciated over 3 years » 1M$ system gets about 1 second, » 1K$ system gets about 1,000 seconds. » Time (seconds) = SystemPrice ($) / 946,080 Input and output are disk resident Input is » 100-byte records (random data) » key is first 10 bytes. Must create output file and fill with sorted version of input file. Daytona (product) and Indy (special) categories

76 PennySort Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE Software » NT workstation 4.3 » NT 5 sort Performance » sort 15 M 100-byte records (~1.5 GB) » Disk to disk » elapsed time 820 sec cpu time = 404 sec

77 Cluster Sort Conceptual Model Multiple Data Sources Multiple Data Destinations Multiple nodes Disks -> Sockets -> Disk -> Disk B AAA BBB CCC A AAA BBB CCC C AAA BBB CCC BBB AAA CCC BBB AAA CCC

78 Cluster Install & Execute If this is to be used by others, it must be: Easy to install Easy to execute Installations of distributed systems take time and can be tedious. (AM2, GluGuard) Parallel Remote execution is non-trivial. (GLUnix, LSF) How do we keep this “simple” and “built-in” to NTClusterSort ?

79 Remote Install RegConnectRegistry() RegCreateKeyEx() Add Registry entry to each remote node.

80 Cluster Execution MULT_QI COSERVERINFO Setup : MULTI_QI struct COSERVERINFO struct CoCreateInstanceEx() Retrieve remote object handle from MULTI_QI struct Invoke methods as usual HANDLE Sort()

81 Public Service Gordon Bell » Computer Museum » Vanguard Group » Edits column in CACM Jim Gray » National Research Council Computer Science and Telecommunications Board » Presidential Advisory Committee on NGI-IT-HPPC. Tom Barclay » USGS and Russian cooperative research

82 A Plug for CoRR CoRR = Computer Science Research Repository All computer science literature in cyberspace Endorsed by CACM Reviewed & Refereed EJournals will evolve from this archive PLEASE submit articles Copyright issues are still problematic

83 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco Erik Riedel (CMU)* Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen (NTFS)*