Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

Similar presentations


Presentation on theme: "1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell."— Presentation transcript:

1

2 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/

3 2 Telepresence The next killer app Space shifting: » Reduce travel Time shifting: » Retrospective » Offer condensations » Just in time meetings. Example: ACM 97 » NetShow and Web site. » More web visitors than attendees People-to-People communication

4 3 Telepresent Jim Gemmell Scaleable Reliable Multicast Outline What Reliable Multicast is & why it is hard to scale Fcast file transfer ECSRM Layered Telepresentations

5 4 Multiple Unicast Sender must repeat Link sees repeats

6 5 IP Multicast Pruned broadcast Unreliable

7 6 Reliable Multicast Difficult to scale: » Sender state explosion » Message implosion State: receiver 1, receiver 2, … receiver n

8 7 Receiver-Reliable Receivers job to NACK State: receiver 1, receiver 2, … receiver n

9 8 SRM Approaches Hierarchy / local recovery Forward Error Correction (FEC) Suppression *HYBRID* _________________________________ Fcast is FEC only ECSRM is suppression + FEC

10 9 (n,k) linear block encoding Original packets 12k 12kk+1k+2n Encode (copy 1st k) 12k Original packets Decode Take any k

11 10 Fcast File tranfer protocol FEC-only Files transmitted in parallel

12 11 Fcast send order 12k 12k 12k 12k 12k 12k 12k 12k 12k k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n k+1k+2n File 1 File 2 X Need k from each row

13 12 Fcast reception time Sender Files + FEC Low loss receiver joinleave High loss receiver joinleave

14 13 Fcast demo

15 14 ECSRM - Erasure Correcting SRM Combines: » suppression » erasure correction

16 15 Suppression Delay a NACK or repair in the hopes that someone else will do it. NACKs are multicast After NACKing, re-set timer and wait for repair If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

17 16 ECSRM - adding FEC to suppression Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,x c). Dont re-send originals; send EC packets using (n,k) encoding

18 17 Example: EC group size (k) = 7 12345671234567 12345671234567 12345671234567 12345671234567 12345671234567 12345671234567 12345671234567XX X X X X X 12345671234567

19 18 …example NACK: Group 1, 1 lost NACKs suppressed

20 19 …example Erasure correcting packet

21 20 …example: summary Normal suppression needs: » 7 NACKs, 7 repairs ECSRM requires: » 1 NACK, 1 repair Large group: each packet lost by someone » Without FEC, 1/2 of traffic is repairs » With ECSRM, only 1/8 of traffic is repairs » NACK traffic reduced by factor of 7

22 21 Simulation: 112 receivers

23 22 Simulation: 112 receivers

24 23 Multicast PowerPoint Add-in Slides Annotations Control information ECSRM slide masterFcast

25 24 Multicast PowerPoint - Late Joiners Viewers joining late dont impact others with session persistent data (slide master) time joinleave Fcast ECSRM join

26 25 Future Work Adding hierarchy (e.g. PGM by Cisco) Do we need 2 protocols?

27 26 RAGS: RAndom SQL test Generator Microsoft spends a LOT of money on testing. (60% of development according to one source). Idea: test SQL by » generating random correct queries » executing queries against database » compare results with SQL 6.5, DB2, Oracle, Sybase Being used in SQL 7.0 testing. » 375 unique bugs found (since 2/97) » Very productive test tool

28 27 Sample Rags Generated Statement SELECT TOP 3 T1.royalty, T0.price, "Apr 15 1996 10:23AM", T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11, "Apr 15 1996 10:23AM", T0.advance, ( "= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange, ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error: Query processor could not produce a query plan.

29 28 Automation Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) Control statement attributes » complexity, kind, depth,... Multi-user stress tests » tests concurrency, allocation, recovery

30 29 One 4-Vendor Rags Test 3 of them vs Us 60 k Selects on MSS, DB2, Oracle, Sybase. 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements. Examine 10 suspects, filed 4 Bugs! One duplicate. Assume 3/10 are new Note: This is the SS Beta 2 Product Quality rising fast (and RAGS sees that)

31 30 RAGS Next Steps Done: » Patents, » Papers, » talks, » tech transfer to development Next steps: » Extend to other parts of SQL and Tsql » Crawl the config space (look for new holes) » Apply ideas to other domains (ole db).

32 31 Scaleup - Big Database Build a 1 TB SQL Server database » Show off Windows NT and SQL Server scalability » Stress test the product Data must be » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere Loaded » 1.1 M place names from Encarta World Atlas » 1 M Sq Km from USGS (1 meter resolution) » 2 M Sq Km from Russian Space agency (2 m) Will be on web (worlds largest atlas) Sell images with commerce server. USGS CRDA: 3 TB more coming.

33 32 The System DEC Alpha + 8400 324 StorageWorks Drives (2.9 TB) SQL Server 7.0 USGS 1-meter data (30% of US) Russian Space data Two meter resolution images SPIN-2

34 33 Worlds Largest PC! 324 disks (2.9 terabytes) 8 x 440Mhz Alpha CPUs 10 GB DRAM

35 34 1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 ) SPIN-2 Hardware

36 35 broswer HTML Java Viewer The Internet Web Client Microsoft Automap ActiveX Server Internet Info Server 4.0 Image Delivery Application SQL Server 7 Microsoft Site Server EE Internet Information Server 4.0 Image Provider Site(s) Terra-Server DB Automap Server Sphinx (SQL Server) Terra-Server Stored Procedures Internet Information Server 4.0 Image Server Active Server Pages MTS Terra-Server Web Site Software

37 36 Backup and Recovery » STC 9717 Tape robot » Legato NetWorker » Sphinx Backup/Restore Utility » Clocked at 80 MBps (peak) (~ 200 GB/hr) SQL Server Enterprise Mgr » DBA Maintenance » SQL Performance Monitor System Management & Maintenance

38 37 TerraServer File Group Layout Convert 324 disks to 28 RAID5 sets plus 28 spare drives Make 4 NT volumes (RAID 50) 595 GB per volume Build 30 20GB files on each volume DB is File Group of 120 files E: F: G: H:

39 38 Demo Http://TerraWeb2

40 39 Technical Challenge Key idea Problem: Geo-Spatial Search without geo-spatial access methods. (just standard SQL Server) Solution: H Geo-spatial search key: ä Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) ä Z-transform X & Y into single Z value, build B-tree on Z ä Adjacent images stored next to each other H Search Method: ä Latitude and Longitude => X, Y, then Z ä Select on matching Z value

41 40 Now What? New Since S-Day: More data: 4.8 TB USGS DOQ.5 TB Russian Bigger Server: Alpha 8400 8 proc, 8 GB RAM, 2.5 TB Disk Improved Application Better UI Uses ASP Commerce App Cut images and Load Sept & Feb Built Commerce App for USGS & Spin-2 Launch at Fed Scalability Day SQL 7 Beta 3 (6/24/98) Operate on Internet for 18 months Add more data (double) Working with Sloan Digital Sky Survey » 40 TB of images » 3 TB of objects

42 41 NT Clusters (Wolfpack) Scale DOWN to PDA: WindowsCE Scale UP an SMP: TerraServer Scale OUT with a cluster of machines Single-system image » Naming » Protection/security » Management/load balance Fault tolerance » Wolfpack Hot pluggable hardware & software

43 42 Web site Database Web site files Database files Server 1 Browser Symmetric Virtual Server Failover Example Server 1 Server 2 Web site files Database files Web site Database Database

44 43 Clusters & BackOffice Research: Instant & Transparent failover Making BackOffice PlugNPlay on Wolfpack » Automatic install & configure Virtual Server concept makes it easy » simpler management concept » simpler context/state migration » transparent to applications SQL 6.5E & 7.0 Failover MSMQ (queues), MTS (transactions).

45 44 1.2 B tpd 1 B tpd ran for 24 hrs. Out-of-the-box software Off-the-shelf hardware AMAZING! Sized for 30 days Linear growth 5 micro-dollars per transaction

46 45 Controller The Memory Hierarchy Measued & Modeled Sequential IO Where is the bottleneck? How does it scale with » SMP, RAID, new interconnects Adapter SCSI File cache PCI Memory Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Mem bus App address space

47 46 Sequential IO your mileage will vary 40 MB/secAdvertised UW SCSI 35r-23w MB/secActual disk transfer 29r-17w MB/sec64 KB request (NTFS) 9 MB/secSingle disk media 3 MB/sec 2 KB request (SQL Server) Measured hardware & Software Find software fixes.. out of the box 1/2 power point: 50% of peak power out of the box

48 47 PAP (peak advertised Performance) vs RAP (real application performance) Goal: RAP = PAP / 2 (the half-power point) http://research.Microsoft.com/BARC/Sequential_IO/

49 48 Disk Bottleneck Analysis NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was internal) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps PCI Adapter 120 MBps

50 49 Penny Sort Ground Rules http://research.microsoft.com/barc/SortBenchmark How much can you sort for a penny. » Hardware and Software cost » Depreciated over 3 years » 1M$ system gets about 1 second, » 1K$ system gets about 1,000 seconds. » Time (seconds) = SystemPrice ($) / 946,080 Input and output are disk resident Input is » 100-byte records (random data) » key is first 10 bytes. Must create output file and fill with sorted version of input file. Daytona (product) and Indy (special) categories

51 50 PennySort Hardware » 266 Mhz Intel P2 » 64 MB SDRAM (10ns) » Dual UDMA 3.2GB EIDE disk Software » NT workstation 4.3 » NT 5 sort Performance » sort 15 M 100-byte records (~1.5 GB) » Disk to disk » elapsed time 820 sec cpu time = 404 sec or 100 sec

52 51 CHALLENGE » reduce software tax on messages » Today 30 K ins + 10 ins/byte » Goal: 1 K ins +.01 ins/byte Best bet: » SAN/VIA » Smart NICs » Special protocol » User-Level Net IO ( like disk) Networking BIG!! Changes coming! Technology » 10 GBps bus now » 1 Gbps links now » 1 Tbps links in 10 years » Fast & cheap switches Standard interconnects » processor-processor » processor-device (=processor) Deregulation WILL work someday

53 52 What if Networking Was as Cheap As Disk IO? TCP/IP » Unix/NT 100% cpu @ 40MBps Disk » Unix/NT 8% cpu @ 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, checksum,… flow control small buffers

54 53 The Promise of SAN/VIA 10x better in 2 years http://www.viarch.org/ Today: » wires are 10 MBps (100 Mbps Ethernet) » ~20 MBps tcp/ip saturates 2 cpus » round-trip latency is ~300 us In University & NT lab » wires are 1 Gbps Ethernet, ServerNet,… » Fast user-level communication tcp/ip ~ 100 MBps 10% of each processor round-trip latency is 15 us

55 54 Public Service Gordon Bell » Computer Museum » Vanguard Group » Edits column in CACM Jim Gray » National Research Council Computer Science and Telecommunications Board » Presidential Advisory Committee on NGI-IT-HPPC » Edit Journals & Conferences. Tom Barclay » USGS and Russian cooperative research

56 55 BARC Microsoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/


Download ppt "1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell."

Similar presentations


Ads by Google