Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco.

Similar presentations


Presentation on theme: "1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco."— Presentation transcript:

1 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco Erik Riedel (CMU)* Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen (NTFS)* http://www.research.Microsoft.com/barc/

2 2 Overview Telepresence » Goals » Prototypes Rags: automating software testing Scaleable Systems. » Goals » Prototypes Misc.

3 3

4 4 Telepresence: The next Killer App Space shifting: » Reduce travel Time shifting: » Retrospectives » Condensations » Just in time meetings. Example: ACM 97 » http://research.Microsoft.com/barc/acm97/ http://research.Microsoft.com/barc/acm97/ » NetShow and Web site. » More web visitors than attendees

5 5 What We Are Doing Scalable Reliable Multicast (SRM) » used by WB (white board) of Mbone » Nack suppression (backoff) » N 2 message traffic to set up Error Correcting SRM (EC SRM) Error Correcting SRM (EC SRM) » Do not resend lost packets. » Send Error Correction in addition to regular » (or)Send Error Correction in response to NACK » One EC packet repairs any of k lost packets » Improved scaleability (millions of subscribers).

6 6 Telepresence Prototypes PowerCast: multicast PowerPoint » Streaming - pre-sends next anticipated slide » Send slides and voice rather than talking head and voice » Uses ECSRM for reliable multicast » 1000’s of receivers can join and leave any time. » No server needed; no pre-load of slides. » Cooperating with NetShow FileCast: multicast file transfer. » Erasure encodes all packets » Receivers only need to receive as many bytes as the length of the file » Multicast IE to solve Midnight-Madness problem NT SRM: reliable IP multicast library for NT Spatialized Teleconference Station » Texture map faces onto spheres » Space map voices

7 7 IP Multicast Is pruned broadcast to a multicast address Unreliable Reliable would require Ack/Nack. State or Nack implosion problem router router router =sender=receiver =not interested router

8 8 (n,k) encoding Original packets 12k 12kk+1k+2n Encode (copy 1st k) 12k Original packets Decode Take any k

9 9 Fcast File tranfer protocol FEC-only Files transmitted in parallel

10 10 Fcast send order 12k 12k 12k 12k 12k 12k 12k 12k 12k k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n File 1 File 2 X Need k from each row

11 11 ECSRM - Erasure Correcting SRM Combines: » suppression » erasure correction

12 12 Suppression Delay a NACK or repair in the hopes that someone else will do it. NACKs are multicast After NACKing, re-set timer and wait for repair If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

13 13 ECSRM - adding FEC to suppression Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,x  c). Don’t re-send originals; send EC packets using (n,k) encoding

14 14 ECSRM Combine suppression & erasure correction Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,x  c). Don’t re-send originals; send EC packets using (n,k) encoding Below, 1 NACK and one EC packet fixes all errors. 12345671234567 EC 12345671234567 12345671234567 12345671234567 12345671234567 12345671234567 12345671234567 12345671234567 X X X X X X X

15 15 Multicast PowerPoint Add-in Slides Annotations Control information ECSRM slide masterFcast

16 16 Multicast PowerPoint - Late Joiners Viewers joining late don’t impact others with session persistent data (slide master) time joinleave Fcast ECSRM join

17 17 Future Work Adding hierarchy (e.g. PGM by Cisco) Do we need 2 protocols?

18 18 Spatialized Teleconferences Map heads to “Eggs” Project voices in stereo using “nose vector”

19 19 RAGS: RAndom SQL test Generator Microsoft spends a LOT of money on testing. (60% of development according to one source). Idea: test SQL by » generating random correct queries » executing queries against database » compare results with SQL 6.5, DB2, Oracle, Sybase Being used in SQL 7.0 testing. » 375 unique bugs found (since 2/97) » Very productive test tool

20 20 Sample Rags Generated Statement SELECT TOP 3 T1.royalty, T0.price, "Apr 15 1996 10:23AM", T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11, "Apr 15 1996 10:23AM", T0.advance, ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance, (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs", T2.ord_date, AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange, ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error: Query processor could not produce a query plan.

21 21 Automation Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) Control statement attributes » complexity, kind, depth,... Multi-user stress tests » tests concurrency, allocation, recovery

22 22 One 4-Vendor Rags Test 3 of them vs Us 60 k Selects on MSS, DB2, Oracle, Sybase. 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements. Examine 10 suspects, filed 4 Bugs! One duplicate. Assume 3/10 are new Note: This is the SS Beta 2 Product Quality rising fast (and RAGS sees that)

23 23 RAGS Next Steps Done: » Patents, Papers, Talks » tech transfer to development SQL 7 (over 400 bugs), FoxPro, OLE DB. Next steps: » Make even more automatic » Extend to other parts of SQL and Tsql » “Crawl” the config space (look for new holes) » Apply ideas to other domains (ole db).

24 Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

25 Billions Of Clients Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous

26 Billions Of Clients Need Millions Of Servers Mobile clients Fixed clients Server Superserver Clients Servers  All clients networked to servers  May be nomadic or on-demand  Fast clients want faster servers  Servers provide  Shared Data  Control  Coordination  Communication

27 Thesis Many little beat few big  Smoking, hairy golf ball  How to connect the many little parts?  How to program the many little parts?  Fault tolerance? $1 million $100 K $10 K Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" 1 M SPECmarks, 1TFLOP 10 6 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM 3 100 TB 1 TB 10 GB 1 MB 100 MB

28 28 Microsoft TerraServer: Scaleup to Big Databases Build a 1 TB SQL Server database Data must be » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere Loaded » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) On the web (world’s largest atlas) Sell images with commerce server.

29 29 Microsoft TerraServer Background Earth is 500 Tera-meters square » USA is 10 tm 2 100 TM 2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with » Encarta™ Atlas globe gazetteer » StreetsPlus™ in the USA 40x60 km 2 jump image 20x30 km 2 browse image 10x15 km 2 thumbnail 1.8x1.2 km 2 tile Someday » multi-spectral image » of everywhere » once a day / hour

30 30 USGS Digital Ortho Quads (DOQ) US Geologic Survey 4 Tera Bytes Most data not yet published Based on a CRADA » Microsoft TerraServer makes data available. USGS “DOQ” 1x1 meter 4 TB Continental US New Data Coming

31 31 Russian Space Agency (SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor) 1.5 Meter Geo Rectified imagery of (almost) anywhere Almost equal-area projection De-classified satellite photos (from 200 KM), More data coming (1 m) Selling imagery on Internet. Putting 2 tm 2 onto Microsoft TerraServer. SPIN-2

32 32 Live on the internet 6/24/98 For 18 Months One Billion Served New Since S-Day: More data: 4.8 TB USGS DOQ.8 TB Russian Bigger Server: Alpha 8400 8 proc, 8 GB RAM, 2.9 TB Disk Improved Application Better UI Uses ASP Commerce App Load 6 TB more 60% US 4% world 30 M web hits per day peak 8 Mhpd avg (1 M page views /day) 1 Billion pages served! 99.95% available No NT failures, 30 minute SQL restart

33 33 http://www.TerraServer. Microsoft.com/ Demo SPIN-2 Microsoft BackOffice

34 34 Demo navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

35 35 1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 ) Hardware 100 Mbps Ethernet Switch DS3 Site Servers Internet Map Server SPIN-2 Web Servers STK 9710 DLT Tape Library 48 9 GB Drives Alpha Server 8400 Enterprise Storage Array 8 x 440MHz Alphacpus 10 GB DRAM 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives

36 36 The Microsoft TerraServer Hardware Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0

37 37 browser HTML Java Viewer The Internet Web Client Microsoft Automap ActiveX Server Internet Info Server 4.0 Image Delivery Application SQL Server 7 Microsoft Site Server EE Internet Information Server 4.0 Image Provider Site(s) TerraServer DB Automap Server Terra-Server Stored Procedures Internet Information Server 4.0 Image Server Active Server Pages MTS TerraServer Web Site Software SQL Server 7

38 38 Backup and Recovery » STK 9710 Tape robot » Legato NetWorker™ » SQL Server 7 Backup & Restore » Clocked at 80 MBps (peak) (~ 200 GB/hr) SQL Server Enterprise Mgr » DBA Maintenance » SQL Performance Monitor System Management & Maintenance

39 39 Microsoft TerraServer File Group Layout Convert 324 disks to 28 RAID5 sets plus 28 spare drives Make 4 WinNT volumes (RAID 50) 595 GB per volume Build 30 20GB files on each volume DB is File Group of 120 files E: F: G: H:

40 40 Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape “tar” \ Drop’N’ DoJob Wait 4 Load LoadMgr DB 100mbit EtherSwitch 108 9.1 GB Drives Enterprise Storage Array Alpha Server 8400 108 9.1 GB Drives 108 9.1 GB Drives STK DLT Tape Library 60 4.3 GB Drives Alpha Server 4100 ESA Alpha Server 4100 LoadMgr DLT Tape NT Backup ImgCutter \ Drop’N’ \Images 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place... LoadMgr

41 41 Technical Challenge Key idea Problem: Geo-Spatial Search without geo-spatial access methods. (just standard SQL Server) Solution: H Geo-spatial search key: ä Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) ä Z-transform X & Y into single Z value, build B-tree on Z ä Adjacent images stored next to each other H Search Method: ä Latitude and Longitude => X, Y, then Z ä Select on matching Z value

42 42 Some Tera-Byte Databases Kilo Mega Giga Tera Peta Exa Zetta Yotta The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of email Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) » 15 PB by 2007 Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!)

43 43 Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter Library of Congress (text) All Disks All Tapes A Movie LoC (image) All Photos LoC (sound + cinima) All Information!

44 44 Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology.

45 45 Scalability 1 billion transactions 1.8 million mail messages 4 terabytes of data 100 million web hits Scale up: to large SMP nodes Scale up: to large SMP nodes Scale out: to clusters of SMP nodes Scale out: to clusters of SMP nodes

46 46 1.2 B tpd 1 B tpd ran for 24 hrs. Out-of-the-box software Off-the-shelf hardware AMAZING! Sized for 30 days Linear growth 5 micro-dollars per transaction

47 47 How Much Is 1 Billion Tpd? 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) ATT » 185 million calls per peak day (worldwide) Visa ~20 million tpd » 400 million customers » 250K ATMs worldwide » 7 billion transactions (card+cheque) in 1994 New York Stock Exchange » 600,000 tpd Bank of America » 20 million tpd checks cleared (more than any other bank) » 1.4 million tpd ATM transactions Worldwide Airlines Reservations: 250 Mtpd

48 48 NCSA Super Cluster National Center for Supercomputing Applications University of Illinois @ Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

49 49 NT Clusters (Wolfpack) Scale DOWN to PDA: WindowsCE Scale UP an SMP: TerraServer Scale OUT with a cluster of machines Single-system image » Naming » Protection/security » Management/load balance Fault tolerance » “Wolfpack” Hot pluggable hardware & software

50 50 Web site Database Web site files Database files Server 1 Browser Symmetric Virtual Server Failover Example Server 1 Server 2 Web site files Database files Web site Database Database

51 51 Clusters & BackOffice Research: Instant & Transparent failover Making BackOffice PlugNPlay on Wolfpack » Automatic install & configure Virtual Server concept makes it easy » simpler management concept » simpler context/state migration » transparent to applications SQL 6.5E & 7.0 Failover MSMQ (queues), MTS (transactions).

52 52 Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 10 9 6 This Campus This Room 10 min My Head 1 min 1.5 hr Sacramento 2 Years Pluto 2,000 Years Andromeda

53 53 Controller The Memory Hierarchy Measuring & Modeling Sequential IO Where is the bottleneck? How does it scale with » SMP, RAID, new interconnects Adapter SCSI File cache PCI Memory Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Mem bus App address space

54 54 Sequential IO your mileage will vary 40 MB/secAdvertised UW SCSI 35r-23w MB/secActual disk transfer 29r-17w MB/sec64 KB request (NTFS) 9 MB/secSingle disk media 3 MB/sec 2 KB request (SQL Server) Measuring hardware & Software Looking for software fixes.. Aiming for “out of the box” 1/2 power point: 50% of peak power “out of the box”

55 55 PAP (peak advertised Performance) vs RAP (real application performance) Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 7.2 MB/s 133 MBps 7.2 MB/s 10-15 MBps 7.2 MB/s SCSI File System Buffers Application Data Disk PCI 40 MBps 7.2 MB/s

56 56 The Best Case: Temp File, NO IO Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware is limited to 150 MBps per processor

57 57 Bottleneck Analysis Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Memory Read/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps

58 58 3 Stripes and Your Out! 3 disks can saturate adapter Similar story with UltraWide CPU time goes down with request size Ftdisk (striping is cheap) =

59 59 Parallel SCSI Busses Help Second SCSI bus nearly doubles read and wce throughput Write needs deeper buffers Experiment is unbuffered (3-deep +WCE)  2 x

60 60 File System Buffering & Stripes (UltraWide Drives) FS buffering helps small reads FS buffered writes peak at 12MBps 3-deep async helps Write peaks at 20 MBps Read peaks at 30 MBps

61 61 PAP vs RAP Reads are easy, writes are hard Async write can match WCE. 422 MBps 142MBps 133 MBps 72 MBps 10-15 MBps 9 MBps SCSI File System Application Data PCI SCSI Disks 40 MBps 31 MBps

62 62 Bottleneck Analysis NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps Adapter 70 MBps

63 63 Hypothetical Bottleneck Analysis NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps PCI Adapter 120 MBps

64 64 Computers shrink to a point Disks on track 100x in 10 years 2 TB 3.5” drive Shrink to 1” is 200GB Disk replaces tape? Disk is super computer! Kilo Mega Giga Tera Peta Exa Zetta Yotta

65 65 Data Gravity Processing Moves to Transducers Move Processing to data sources Move to where the power (and sheet metal) is Processor in » Modem » Display » Microphones (speech recognition) & cameras (vision) » Storage: Data storage and analysis

66 66 It’s Already True of Printers Peripheral = CyberBrick You buy a printer You get a » several network interfaces » A Postscript engine cpu, memory, software, a spooler (soon) » and… a print engine.

67 67 Remember Your Roots

68 68 Year 2002 Disks Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential Both running Windows NT™ 7.0? (see below for why)

69 69 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other » CORBA? DCOM? IIOP? RMI? » One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Wire(s) h streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications

70 70 What if Networking Was as Cheap As Disk IO? TCP/IP » Unix/NT 100% cpu @ 40MBps Disk » Unix/NT 8% cpu @ 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, checksum,… flow control small buffers

71 71 Technology Drivers: The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ Today: » wires are 10 MBps (100 Mbps Ethernet) » ~20 MBps tcp/ip saturates 2 cpus » round-trip latency is ~300 us In the lab » Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… » Fast user-level communication tcp/ip ~ 100 MBps 10% of each processor round-trip latency is 15 us

72 72 Gbps Ethernet: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?

73 73 Technology Drivers Plug & Play Software RPC is standardizing: (DCOM, IIOP, HTTP) » Gives huge TOOL LEVERAGE » Solves the hard problems for you: naming, security, directory service, operations,... Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools » NetWare + tools » WinCE, WinNT,…+ tools » JavaOS + tools Apps gravitate to data. General purpose OS on controller runs apps.

74 74 Disk = Node has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution environment OS Kernel SAN driverDisk driver File SystemRPC,... ServicesDBMS Applications

75 75 Penny Sort Ground Rules http://research.microsoft.com/barc/SortBenchmark How much can you sort for a penny. » Hardware and Software cost » Depreciated over 3 years » 1M$ system gets about 1 second, » 1K$ system gets about 1,000 seconds. » Time (seconds) = SystemPrice ($) / 946,080 Input and output are disk resident Input is » 100-byte records (random data) » key is first 10 bytes. Must create output file and fill with sorted version of input file. Daytona (product) and Indy (special) categories

76 76 PennySort Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE Software » NT workstation 4.3 » NT 5 sort Performance » sort 15 M 100-byte records (~1.5 GB) » Disk to disk » elapsed time 820 sec cpu time = 404 sec

77 77 Cluster Sort Conceptual Model Multiple Data Sources Multiple Data Destinations Multiple nodes Disks -> Sockets -> Disk -> Disk B AAA BBB CCC A AAA BBB CCC C AAA BBB CCC BBB AAA CCC BBB AAA CCC

78 78 Cluster Install & Execute If this is to be used by others, it must be: Easy to install Easy to execute Installations of distributed systems take time and can be tedious. (AM2, GluGuard) Parallel Remote execution is non-trivial. (GLUnix, LSF) How do we keep this “simple” and “built-in” to NTClusterSort ?

79 79 Remote Install RegConnectRegistry() RegCreateKeyEx() Add Registry entry to each remote node.

80 80 Cluster Execution MULT_QI COSERVERINFO Setup : MULTI_QI struct COSERVERINFO struct CoCreateInstanceEx() Retrieve remote object handle from MULTI_QI struct Invoke methods as usual HANDLE Sort()

81 81 Public Service Gordon Bell » Computer Museum » Vanguard Group » Edits column in CACM Jim Gray » National Research Council Computer Science and Telecommunications Board » Presidential Advisory Committee on NGI-IT-HPPC. Tom Barclay » USGS and Russian cooperative research

82 82 A Plug for CoRR CoRR = Computer Science Research Repository All computer science literature in cyberspace http://xxx.lanl.gov/archive/cs Endorsed by CACM Reviewed & Refereed EJournals will evolve from this archive PLEASE submit articles Copyright issues are still problematic

83 83 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco Erik Riedel (CMU)* Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen (NTFS)* http://www.research.Microsoft.com/barc/


Download ppt "1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco."

Similar presentations


Ads by Google