Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Distributed Computing Economics Slides at: Grayhttp://research.microsoft.com/~gray/talks Microsoft Research.

Similar presentations


Presentation on theme: "1 Distributed Computing Economics Slides at: Grayhttp://research.microsoft.com/~gray/talks Microsoft Research."— Presentation transcript:

1 1 Distributed Computing Economics Slides at: http://research.microsoft.com/~gray/talksJim Grayhttp://research.microsoft.com/~gray/talks Microsoft Research gray@microsoft.comgray@microsoft.com Talk at IEEE Computer Society: 11 December 2003 Palo Alto, CA.

2 2 Two (?) Talks Distributed Computing Economics What Im doing –Online Science – World Wide Telescope –TerraServer Brick Design/Deploy/Operate –Paxos Commit –Spatial Data done relationally With Alex Szalay JHU With Tom Barclay With Leslie Lamport With Alex Szalay JHU

3 3 Distributed Computing Economics Why is Seti@Home a great idea? Why is Napster a great deal? Why is the Computational Grid uneconomic? When does Computing on Demand work? What is the right level of abstraction? Is the Access Grid the real killer app? Based on: Distributed Computing Economics, Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24 http://research.microsoft.com/research/pubs/view.aspx?tr_id=655

4 4 Computing is Free Computers cost 1k$ (if you shop) (yes, there are 1μ$ to 1M$ computers, but..) So 1 cpu day == 1$ (computers last 3 years) If you pay the phone bill Internet bandwidth costs 50 … 500$/mbps/m (not including routers and management). So 1GB costs 1$ to send and 1$ to receive Caveat: All numbers rounded to nearest factor of 3.

5 5 Why is Seti@Home a Good Deal? Send 300 KB costs 3e-4$ User computes for ½ day:benefit.5e-1$ ROI: 1500:1 Finance guys will tell you that is a good Return On Investment (ROI)

6 6 Seti@Home The worlds most powerful computer 61 TF is sum of top 4 of Top 500. 61 TF is 9x the number 2 system. 61 TF more than the sum of systems 2..10 Seti@Home http://setiathome.ssl.berkeley.edu/totals.html 20 May 2003 TotalLast 24 Hours Users4,493,7311,900 Results received886 M1.4 M Total CPU time 1.5 M years 1,514 years Floating Point Operations 3 E+21 ops 3 zeta ops 5 E+18 FLOPS/day 61.3 TeraFLOPs

7 7 Why was Napster a Good Deal? Send 5 MB costs 5e-3$ ½ a penny per song Both sender and receiver can afford it. Same logic powers web sites (Yahoo!...): –1e-3$/page view advertising revenue –1e-5$/page view cost of serving web page –100:1 ROI

8 8 Computing is Free!!! This is not a Surprise Everywhere I go I see Beowulfs Clusters of PCs (or high-slice-price micros) True: I have not visited Earth Simulator, but… Google, MSN, Hotmail, Yahoo, NCBI, FNAL, Los Alamos, Cal Tech, MIT, Berkeley, NARO, Smithsonian, Wisconsin, eBay, Amazon.com, Schwab, Citicorp, Beijing, Cern, BaBar, NCSA, Cornell, UCSD, and of course NASA and Cal Tech

9 9 The Cost of Computing: Computers are NOT free! IBM, HP, Dell make billions Capital Cost of a TpcC system is mostly storage and storage software (database) IBM 32 cpu, 512 GB ram 2,500 disks, 43 TB (680,613 tpmC @ 11.13 $/tpmc available 11/08/03) http://www.tpc.org/results/individual_results/IBM/IBMp690es_05092003.pdf http://www.tpc.org/results/individual_results/IBM/IBMp690es_05092003.pdf A 7.5M$ super-computer Total Data Center Cost: 40% capital & facilities 60% staff (includes app development)

10 10 Computing Equivalents 1 $ buys 1 day of cpu time 4 GB (fast) ram for a day 1 GB of network bandwidth 1 GB of disk storage for 3 years 10 M database accesses 10 TB of disk access (sequential) 10 TB of LAN bandwidth (bulk) 10 KWhrs == 4 days of computer time Depreciating over 3 years, and there are about 1k days in 3 years.

11 11 Some consequences Beowulf networking is 10,000x cheaper than WAN networking factors of 10 5 matter. The cheapest and fastest way to move Terabytes cross country is sneakernet. 24 hours ~ 92 Mbps ~ 12 MB/s 50$ shipping vs 1,000$ wan cost. Sending 10PB CERN data via network is silly: buy disk bricks in Geneva, fill them, ship them. TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange Jim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan vandenBerg Microsoft Technical Report may 2002, MSR-TR-2002-54 http://research.microsoft.com/research/pubs/view.aspx?tr_id=569

12 12 How Do You Move A Terabyte? 14 minutes6172001,920,0009600OC 1922.2 hours1000Gbps 1 day100100 Mpbs 14 hours97631649,000155OC3 2 days2,01065128,00043T3 2 months2,4698001,2001.5T1 5 months360117 50 0.6Home DSL 6 years3,0861,000400.04 Home phone Time/TB $/TB Sent $/Mbps Rent $/month Speed Mbps Context Source: TeraScale Sneakernet, Microsoft Research, Gray et. all

13 13 Computational Grid Economics To the extent that computational grid is like Seti@Home or ZetaNet or Folding@home or… it is a great thing The extent that the computational grid is MPI or data analysis, it fails on economic grounds: move the programs to the data, not the data to the programs. The Internet is NOT the cpu backplane. An alternate reality: Nearly free networking –Telcos go bankrupt an price=cost=0 –Taxpayers pay your phone bill so price=0 and telcos get BIG government subsidy

14 14 When to Export a Task IF instruction density > 100,000 instructions/byte AND remote computer is free (costs you nothing) THEN ROI > 0 ELSE ROI < 0 Finance guys will tell you negative ROI is bad

15 15 Computing on Demand Was called outsourcing or service bureaus in my youth. CSC and IBM did it. It is not a new way of doing things: think payroll. Payroll is standard outsource. Now Hotmail, Salesforce.com, Oracle.com,…. Works for standard apps. COD works for commoditized services. Airlines outsource reservations. Banks outsource ATMs. But Amazon, Amex, Wal-Mart, eTrade, eBay... Cant outsource their core competence.

16 16 Whats the right abstraction level for Internet Scale Distributed Computing? Disk block? No too low (Ø). File? No too low (XDrvive) Database? No too low (SkyServer). RPCYes, –TerraService, MapQuest,…. –Blast search –Google search Application? Yes, even better. –Send/Get eMail –Expedia –Amazon –Portals that federate astronomy archives (http://skyQuery.Net/)http://skyQuery.Net/ –Web Services (.NET, EJB, OGSA) give plumbing for rpc/App abstraction level.

17 17 Access Grid Q: What comes after the telephone? A: eMail? A: Instant messaging? Both seem retro: text & emotons. Access Grid could revolutionize human communication. But, it needs a new idea. Q: What comes after the telephone?

18 18 Distributed Computing Economics Why is Seti@Home a great idea? Why is Napster a great deal? Why is the Computational Grid uneconomic When does computing on demand work? What is the right level of abstraction? Is the Access Grid the real killer app? Based on: Distributed Computing Economics, Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24 http://research.microsoft.com/research/pubs/view.aspx?tr_id=655

19 19 Two (?) Talks Distributed Computing Economics What Im doing –Online Science – World Wide Telescope –TerraServer Brick Design/Deploy/Operate –Paxos Commit –Spatial Data done relationally With Alex Szalay JHU With Tom Barclay With Leslie Lamport With Alex Szalay JHU

20 20 Online Science The World Wide Telescope I have been looking for a distributed DB for most of my career. I think I found one! (sort of).

21 21 The Evolution of Science Observational Science –Scientist gathers data by direct observation –Scientist analyzes Information Analytical Science –Scientist builds analytical model –Makes predictions. Computational Science –Simulate analytical model –Validate model and makes predictions

22 22 Computational Science Evolves Historically, Computational Science = simulation. Science - Informatics Information Exploration Science Information captured by instruments Or Information generated by simulator –Processed by software –Placed in a database / files –Scientist analyzes database / files New emphasis on informatics: –Capturing, –Organizing, –Summarizing, –Analyzing, –Visualizing Largely driven by observational science, but also needed by simulations. Too soon to say if comp-X and X-info will unify or compete. BaBar, Stanford Space Telescope P&E Gene Sequencer From http://www.genome.uci.edu/

23 23 Both comp-X and X-info Generating Petabytes Comp-Science generating an Information avalanche comp-chem, comp-physics, comp-bio, comp-astro, comp-linguistics, comp-music, comp-entertainment, comp-warfare Science-Info dealing with Information avalanche bio-info, astro-info, text-info,

24 24 Information Avalanche Stories Turbulence: 100 TB simulation then mine the Information BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information CERN: LHC will generate 1GB/s 10 PB/y VLBA (NRAO) generates 1GB/s today NCBI: only ½ TB but doubling each year very rich dataset. Pixar: 100 TB/Movie

25 25 Astro-Info World Wide Telescope http://www.astro.caltech.edu/nvoconf/ http://www.voforum.org/ http://www.astro.caltech.edu/nvoconf/ http://www.voforum.org/ Premise: Most data is (or could be online) Internet is the worlds best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up. The seeing is always great (no working at night, no clouds no moons no..). –Its a smart telescope: links objects and data to literature on them.

26 26 Why Astronomy Data? It has no commercial value –No privacy concerns –Can freely share results with others –Great for experimenting with algorithms It is real and well documented – High-dimensional data (with confidence intervals) – Spatial data – Temporal data Many different instruments from many different places and many different times But, its the same universe so comparisons make sense & are interesting. Federation is a goal There is a lot of it (petabytes) Great sandbox for data mining algorithms –Can share cross company –University researchers Great way to teach both Astronomy and Computational Science IRAS 100 ROSAT ~keV DSS Optical 2MASS 2 IRAS 25 NVSS 20cm WENSS 92cm GB 6cm

27 27 What X-info Needs from us (cs) (not drawn to scale) Science Data & Questions Scientists Database To store data Execute Queries Plumbers Data Mining Algorithms Miners Question & Answer Visualization Tools

28 28 Show Marias 5-minute PPT SDSS Image Cutout slide show by Maria A. Nieto-Santisteban of JHU http://www.research.microsoft.com/~Gray/talks/FDIS_ImgCutoutPresentation.ppt http://www.research.microsoft.com/~Gray/talks/FDIS_ImgCutoutPresentation.ppt

29 29 Data Access is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~5,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$

30 30 Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 As data and processing grow at same rate, we can only keep up with N logN A way out? –Discard notion of optimal (data is fuzzy, answers are approximate) –Dont assume infinite computational resources or memory Requires combination of statistics & computer science

31 31 Analysis and Databases Statistical analysis deals with –Creating uniform samples –data filtering & censoring bad data –Assembling subsets –Estimating completeness –Counting and building histograms –Generating Monte-Carlo subsets –Likelihood calculations –Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database close to the data. Move Mohamed to the mountain, not the mountain to Mohamed.

32 32 Goal: Easy Data Publication & Access Augment FTP with data query: Return intelligent data subsets Make it easy to –Publish: Record structured data –Find: Find data anywhere in the network Get the subset you need –Explore datasets interactively Realistic goal: –Make it as easy as publishing/reading web sites today.

33 33 Federation Data Federations of Web Services Massive datasets live near their owners: –Near the instruments software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes a web service –Schema: documents the data –Methods on objects (queries) Scientists get personalized extracts Uniform access to multiple Archives –A common global schema

34 34 Web Services: The Key? Web SERVER: –Given a url + parameters –Returns a web page (often dynamic) Web SERVICE: –Given a XML document (soap msg) –Returns an XML document –Tools make this look like an RPC. F(x,y,z) returns (u, v, w) –Distributed objects for the web. –+ naming, discovery, security,.. Internet-scale distributed computing Your program Data In your address space Web Service soap object in xml Your program Web Server http Web page

35 35 The Challenge This has failed several times before– understand why. Develop –Common data models (schemas), –Common interfaces (class/method) Build useful prototypes (nodes and portals) Create a community that uses the prototypes and evolves the prototypes.

36 36 Grid and Web Services Synergy I believe the Grid will be many web services IETF standards Provide –Naming –Authorization / Security / Privacy –Distributed Objects Discovery, Definition, Invocation, Object Model –Higher level services: workflow, transactions, DB,.. Synergy: commercial Internet & Grid tools

37 37 Some Interesting Things We are Doing in SDSS (whats new) SkyServer is done. Now it is 99% perspiration to load 25 TB (many times) and manage it. Im using it as a research vehicle to explore new DB ideas. Others are cloning it for other surveys. Some doing DB2 & Oracle variants.

38 38 SkyServer Overview (10 min) 10 minute SkyServer tour –Pixel space http://skyserver.sdss.org/en/ –Record space: http://skyserver.sdss.org/en/tools/explore/obj.asp?id=2255030989160697 –Doc space: Ned –Set space: –Web & Query Logs: –Dr1 WebService You can download (thanks to Cathan Cook ) –Data + Database code: –Website: Data Mining the SDSS SkyServer Database MSR-TR-2002-01Data Mining the SDSS SkyServer Database select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc select top 10 * from weblog..sqlLog order by theTime Desc http://skyserver.pha.jhu.edu/dr1/en/tools/chart/navi.asp http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm

39 39 Cutout Service (10 min) A typical web service Show it Show WSDL Show fixing a bug Rush through code. You can download it. Maria A. Nieto-Santisteban did most of this (Alex and I started it) http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm

40 40 SkyQuery: http://skyquery.net/ http://skyquery.net/ Distributed Query tool using a set of web services Four astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England). Feasibility study, built in 6 weeks –Tanu Malik (JHU CS grad student) –Tamas Budavari (JHU astro postdoc) –With help from Szalay, Thakar, Gray Implemented in C# and.NET Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

41 41 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes –Schema Web Service –Database Web Service Portal is –Plans Query (2 phase) –Integrates answers –Is itself a web service

42 42 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery and The Grid This is a DataGrid It works today It is challenging for OGSA-DAIS (hello world in OGSI-DAI is complex) SkyQuery is being used as a vehicle to explore OGSA and DAIS requirements.

43 43 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout MyDB added to SkyQuery Let users add personal DB 1GB for now. Use it as a workbook. Online and batch queries. Moves analysis to the data Users can cooperate (share MyDB) Still exploring this MyDB

44 44 Two (?) Talks Distributed Computing Economics What Im doing –Online Science – World Wide Telescope –TerraServer Brick Design/Deploy/Operate –Paxos Commit –Spatial Data done relationally With Alex Szalay JHU With Tom Barclay With Leslie Lamport With Alex Szalay JHU

45 45 SQL x4 SAN SAN TerraServer V4 8 web front end 4x8cpu+4GB DB 18TB triplicate disks Classic SAN (tape not shown) ~2M$ capital expense Works GREAT! 2000…2004 Now replaced by.. WEB WEBx8

46 46 KVM / IP TerraServer V5 Storage Bricks –White-box commodity servers –4tb raw / 2TB Raid1 SATA storage –Dual Hyper-threaded Xeon 2.4ghz, 4GB RAM Partitioned Databases (PACS – partitioned array) –3 Storage Bricks = 1 TerraServer data –Data partitioned across 20 databases –More data & partitions coming Low Cost Availability –4 copies of the data RAID1 SATA Mirroring 2 redundant Bunches –Spare brick to repair failed brick 2N+1 design –Web Application bunch aware Load balances between redundant databases Fails over to surviving database on failure ~100K$ capital expense.

47 47 Two (?) Talks Distributed Computing Economics What Im doing –Online Science – World Wide Telescope –TerraServer Brick Design/Deploy/Operate –Paxos Commit –Spatial Data done relationally With Alex Szalay JHU With Tom Barclay With Leslie Lamport With Alex Szalay JHU

48 48 Two Phase Commit N Resource Managers (RMs) Want all RMs to commit or all abort. Coordinated by Transaction Manager (TM) TM sends Prepare, Commit-Abort RM responds Prepared, Aborted 3N+1 messages N+1 stable writes Delay –3 message –2 stable write Blocking: if TM fails, Commit-Abort stalls working committedaborted Transaction Manager working prepared committedaborted Resource Manager RequestCommit Prepare Commit Prepare Prepared

49 49 Two Phase Commit: 2PC Atomicity – all or nothing Consistency/Reliability – does right thing Isolation – no concurrency anomalies Durability – state survives failures Availability: always up ACID-A

50 50 I can do better Those 2PC wimps are –Stupid – they do not understand my app –Fascists – the force me to send messages I can do better –I can write async code –I can keep logs –I can deal with failures and complexities –Indeed, this is my destiny a full employment act

51 51 Commit KISS Simple fault / failure model It is hard to get these optimizations right. But you want availability… OK… No 2PC just C

52 52 2PC Commit Availability: always up Atomicity – all or nothing Consistency/Reliability – does right thing Isolation – no concurrency anomalies Durability – state survives failures => 2PC++ = 3PC = Non Blocking Commit Solves the availability problem AACID

53 53 Consensus N processes want to agree on a value Want to tolerate F faults –Tolerate F processes stopping –Tolerate F Messages delayed or lost If there are less than F faults in a window Then consensus achieved. Byzantine faults need 3F acceptors Benign faults need 2F+1 acceptors stalls but safe if more than F faults

54 54 Paxos Consensus Group has a leader known to all –leader election is a subroutine Process proposes a value v to leader. Leader sends proposal (phase 2) (ballot, value) to all acceptors Acceptors respond with: max(ballot, value) they have seen If leader gets no higher ballot, and gets at least F+1 responses then leader can announce (ballot, value) Protocol is 3-phase Phase 1: –Leader starts new ballot Phase 2 –Leader proposes value Phase 3 –If value accepted by F+1 then value is accepted. –If not, leader tries to get majority value accepted. 6F+4 messages, F+1 stable writes 4 message delays and 2 stable writes

55 55 Paxos Commit Obvious idea: Have TM use Paxos consensus of RMs prepared More efficient idea: 2F+1 acceptors (~2F+1 TMs) Each RM leads a Paxos on: Im Prepared. If F+1 acceptors see all RMs prepared, then transaction committed. 2F(N+1) + 3N + 1 messages 5 message delays (one extra delay) 2 stable write delays. == 2PC when F=0 RM0 Commit Leader RM0…N Acceptors 0…2F request commit prepare prepared all prepared commit

56 56 Paxos Commit (success case) Acceptors working prepared committedaborted Resource Managers working AllPreparedaborted Commit Leader working committedaborted Request Commit Prepare Prepared Commit All Prepared

57 57 Two (?) Talks Distributed Computing Economics What Im doing –Online Science – World Wide Telescope –TerraServer Brick Design/Deploy/Operate –Paxos Commit –Spatial Data done relationally With Alex Szalay JHU With Tom Barclay With Leslie Lamport With Alex Szalay JHU

58 58 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

59 59

60 60 Background I have been working with Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB A great Web Services app The work here joint with Alex Szalay SkyServer.Sdss.Org is first installment, SkyQuery.Net is second installment (federated web services)

61 61 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup

62 62 Spatial Data Access – SQL extension Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htmhttp://www.sdss.jhu.edu/htm Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins Every object has a 20-deep Mesh ID Given a spatial definition, routine returns up to 10 covering triangles Spatial query is then up to 10 range queries Fast: 1,000 triangles / second / Ghz 2,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,2 2,1 2,0 2,32,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,3,0 2,3,1 2,3,22,3,3

63 63 A typical call -- find objects within 1 arcminute of (60,20) select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and -- or better yet select objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1) careful distance test rejects false positives Coarse distanc e test Coarse filter Correct filter

64 64 Integration with CLR Makes it Nicer Peter Kukol converted 500 lines of external stored procedure glue code to 50 lines of C# code. Now we are converting library to C# Also, Cross Apply is VERY useful select objID, count(*) from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

65 65 Object Relational Has Arrived VMs are moving inside the DB Yukon includes Common Language Runtime (Oracle & DB2 have similar mechanisms). So, C++, VB, C# and Java are co-equal with TransactSQL. You can define classes and methods SQL will store the instances Access them via methods You can put your analysis code INSIDE the database. Minimizes data movement. You cant move petabytes to the client But we will soon have petabyte databases. data code data code +code

66 66 The HTM code body Spatial Data Search The Pre CLR design Transact SQL sp_HTM (20 lines) 469 lines of glue looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed..."); // Is Coordinate param a character string if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR && bType1 != SRVVARCHAR && bType1 != SRVCHAR) ErrorExit("Coordinate param should be a string."); // Is Coordinate param non-null if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1) ErrorExit("Coordinate param is null."); // Get pointer to Coordinate param pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1); if (pzCoordinateSpec == NULL) ErrorExit("Coordinate param is null."); pzCoordinateSpec[cbActualLen1] = 0; // Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param...");

67 67 The glue CLR design Discard 450 lines of UGLY code The HTM code body C# SQL sp_HTM (50 lines) using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM { public class HTM_wrapper { [DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str); public static unsafe void HTM_cover_RS(string input) { // convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte * input; byte * output; // invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input); // Convert the array to a table SqlResultSet outputTable = SqlContext.GetReturnResultSet(); if (output[0] == 'O') {// if Output is OK uint c = *(UInt32 *)(s + 4); // cast results as dataset Int64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2] for (int i = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]); newRecord.SetSqlInt64(1, r[1]); r++;r++; outputTable.Insert(newRecord); }} // return outputTable; } } } Thanks!!! To Peter Kukol (who wrote this)

68 68 The Clean CLR design Discard all glue code return array cast as table CREATE ASSEMBLY HTM_A FROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION HTM_cover( @input NVARCHAR(100) ) RETURNS @t TABLE ( HTM_ID_START BIGINT NOT NULL PRIMARY KEY, HTM_ID_END BIGINT NOT NULL ) AS EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover using System; using System.Data; using System.Data.Sql; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM_NS { public class HTM_C { public static Int64[,2] HTM_cover(string input) { // invoke the HTM routine return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here. } } } Your/My code goes here

69 69 Performance (Beta1) On a 2.2 Ghz Xeon Call a Transact SQL function33μs Call a C# function50μs Table valued function not good in β1 Array (== table) valued function 200 μs + per row 27 μs

70 70 CREATE ASSEMBLY ReturnOneA FROM '\\localhost\C:\ReturnOne.dll' GO CREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INT AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int GO --------------------------------------------- -- time echo an integer declare @i int, @j int, @cpu_seconds float, @null_loop float declare @start datetime, @end datetime set @j = 0 set @i = 10000 set @start = current_Timestamp while(@i > 0) begin set @j = @j + 1 set @i = @i -1 end set @end = current_Timestamp set @null_loop = datediff(ms, @start,@end) / 10.0 set @i = 10000 set @start = current_Timestamp while(@i > 0) begin select @j = dbo.ReturnOne_Int(@i) set @j = @j + 1 set @i = @i -1 end set @end = current_Timestamp set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loop print 'average cpu time for 1,000 calls to ReturnOne_Int was ' + str(@cpu_seconds,8,2)+ ' micro seconds' The Code using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace ReturnOneNS { public class ReturnOneC { public static int ReturnOne_Int(int input) { return input; } Function written in C# inside the DB Program in DB in different language (Tsql) calling function

71 71 What Is the Significance? No more inside/outside DB dichotomy. You can put your code near the data. Indeed, we are letting users put personal databases near the data archive. This avoids moving large datasets. Just move questions and answers.

72 72 Meta-Message Trying to fit science data into databases When it does not fit, something is wrong. Look for solutions –Many solutions come from OR extensions –Some are fundamental engine changes More structure in DB Richer operator sets Better statistics

73 73 But… Wanted a faster way to do this: some computations were taking toooooo long (see below). Wanted to define areas in relational form. Wanted a portable way that works on any relational system. So, developed a constraint database approach – see below.

74 74 The Idea: Equations Define Subspaces For (x,y) above the line ax+by > c Reverse the space by -ax + -by > -c Intersect 3 half-spaces: a 1 x + b 1 y > c 1 a 2 x + b 2 y > c 2 a 3 x + b 3 y > c 3 x y x=c/a y=c/b ax + by = c x y

75 75 The Idea: Equations Define Subspaces a 1 x + b 1 y > c 1 a 2 x + b 2 y > c 2 a 3 x + b 3 y > c 3 x y select count(*) from convex where a*@x + b*@y < c 3 2 2 2 11 1 select count(*) from convex where a*@x + b*@y > c x y 0 1 1 1 22 2

76 76 Domain is Union of Convex Hulls Simple volumes are unions of convex hulls. Higher order curves also work Complex volumes have holes and their holes have holes. (that is harder). Not a convex hull +

77 77 Now in Relational Terms create table HalfSpace ( domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null,-- grouping a set of ½ spaces halfSpaceID int identity(),-- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space cfloat not null, -- the constraint (c above) primary key (domainID, convexID, halfSpaceID) (x,y) inside a convex if it is inside all lines of the convex (x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex Convexes containing point (@x,@y): select convexID -- return the convex hulls from HalfSpace -- from the constraints where (@x * a + @y * b) < c -- point outside the line? group by all convexID -- insist no line of convex having count(*) = 0 -- is outside (count outside == 0)

78 78 All Domains Containing this Point The group by is supported by the domain/convex index, so its a sequential scan (pre-sorted!). select distinct domainID -- return domains from HalfSpace -- from constraints where (@x * a + @y * b) < c -- point outside group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

79 79 The Algebra is Simple (Boolean) @domainID = spDomainNew (@type varchar(16), @comment varchar(8000)) @convexID = spDomainNewConvex (@domainID int) @halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @a float, @b float, @c float) @returnCode = spDomainDrop(@domainID) select * from fDomainsContainPoint(@x float, @y float) Once constructed they can be manipulated with the Boolean operations. @domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000)) @domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000)) @domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))

80 80 What! No Bounding Box? Bounding box limits search. A subset of the convex hulls. If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines. But, if you have a lot of half-spaces then bounding box is good.

81 81 OK: solved Areas Contain Point? What about: Points near point? Table-valued function find points near a point –Select * from fGetNearbyEq(ra,dec,r) Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/ www.sdss.jhu.edu/htm/ –Space filling curve, bounding triangles… –Standard approach 13 ms/call… So 70 objects/second. Too slow, so pre-compute neighbors: Materialized view. At 70 objects/sec: takes 6 months to compute materialized view on billion objects.

82 82 Zone Based Spatial Join Divide space into zones Key points by Zone, offset (on the sphere this need wrap-around margin.) Point search look in a few zones at a limited offset: ra ± r a bounding box that has 1-π/4 false positives All inside the relational engine Avoids impedance mismatch Can batch all-all comparisons 33x faster and parallel 6 days, not 6 months! r ra-zoneMax (r 2 +(ra-zoneMax) 2 ) cos(radians(zoneMax)) zoneMax x Ra ± x

83 83 In SQL: points near point select o1.objID -- find objects from zone o1 -- in the zoned table where o1.zoneID between -- where zone # floor((@dec-@r)/@zoneHeight) and -- overlaps the circle ceiling((@dec+@r)/@zoneHeight) and o1.ra between @ra - @r and @ra + @r-- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2)))) < @r -- careful filter on distance Eliminates the ~ 21% = 1-π/4 False positives Bounding box

84 84 Quantitative Evaluation: 7x faster than external stored proc: (linkage is expensive) time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate. time vs. best time @ various radius. A zoneHeight of 4 is near-optimal

85 85 All Neighbors of All points (can Batch Process the Joins) A 5x additional speedup (35x in total) for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper insert neighbors-- insert one zone's neighbors select o1.objID as objID, -- object pairs o2.objID as NeighborObjID,.. other fields elided from zone o1 join zone o2 -- join 2 zones on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra and o2.ra between o1.ra - @r and o1.ra + @r -- points near ra where -- elided margin logic, see paper. and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2)) < @r -- careful filter on distance

86 86 Spatial Stuff Summary Easy –Point in polygon –Polygons containing points – (instance and batch) Works in higher dimensions Side note: Spherical polygons are –hard in 2-space –Easy in 3-space

87 87 Spatial Stuff Summary Constraint databases are in –Streams (data is query, query is in DB) –Notification: subscription in DB, data is query –Spatial: constraints in DB, data is query You can express constraints as rows Then You –Can evaluate LOTS of predicates per second –Can do set algebra on the predicates. Benefits from SQL parallelism SQL == Prolog // DataLog?

88 88 References Representing Polygon Areas and Testing Point-in-Polygon Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc http://research.microsoft.com/~Gray/papers/Polygon.doc A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc http://research.microsoft.com/~Gray/papers/Neighbors.doc

89 89 Some Database Topics Sparse tables: column vs row store tag and index tables pivot Maplist (cross apply) Dealing with bad statistics:.

90 90 Column Store Pyramid Users see fat base tables (universal relation) Define popular columns index tag table 10% ~ 100 columns Make many skinny indices 1% ~ 10 columns Query optimizer picks right plan Automate definition & use Fast read, slow insert/update Data warehouse Note: prior to Yukon, index had 16 column limit. A bane of my existence. Simpl e Typical Semi-join Fat quer y Obese query BASE INDICIES TAG

91 91 Examples create table base ( id bigint, f1 int primary key, f2 int, …,f1000 int) create index tag on base (id) include (f1, …, f100) create index skinny on base(f2,…f17) Simpl e Typical Semi-join Fat quer y Obese query BASE INDICIES TAG

92 92 A Semi-Join Example create table fat(a int primary key, b int, c int, fat char (988)) declare @i int, @j int; set @i = 0 again: insert fat values(@i, cast(100*rand() as int), cast (100*rand() as int), ' ') set @i = @i + 1; if (@i < 1000000) goto again create index ab on fat(a,b) create index ac on fat(a,c) dbcc dropcleanbuffers with no_infomsgs select count(*) from fat with(index (0)) where c = b -- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s. dbcc dropcleanbuffers with no_infomsgs select count(*) from fat where b=c -- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s. 1GB 8MB b=c 3.4K IO 1.4 sec abac b=c 137 K IO 31 sec

93 93 Moving From Rows to Columns Pivot & UnPivot What if the table is sparse? LDAP has 7 mandatory and 1,000 optional attributes Store row, col, value create table Features (object varchar, attribute varchar, value varchar, primary key (object, attribute)) select * from (featurespivot value on attribute in (year, color) ) as T where object = 4PNC450 Features object attribute value 4PNC450 year 2000 4PNC450 color white 4PNC450 make Ford 4PNC450 model Taurus T Object year color 4PNC450 2000 white

94 94 Maplist Meets SQL – cross apply Your table-valued function F(a,b,c) returns all objects related to a,b,c. spatial neighbors, sub-assemblies, members of a group, items in a folder,… Apply this function to each row Classic drill-down use outer apply if f() may be null select p.*, q.* from parent as p cross apply f(p.a, p.b, p.c) as q where p.type = 1 p1 f(p1) p2 f(p2) pn f(pn)

95 95 When SQL Optimizer Guesses Wrong, Life is DREADFUL SQL is a non-procedural language. The compiler/optimizer picks the procedure based on statistics. If the stats are wrong or missing…. Bad things happen. Queries can run VERY slowly. Strategy 1: allow users to specify plan. Strategy 2: make the optimizer smarter (and accept hints from the user.)

96 96 An Example of the Problem A query selects some fields of an index and of huge table. Bookmark plan: –look in index for a subset. –Lookup subset in Fat table. This is –great if subset << table. –terrible if subset ~ table. If statistics are wrong, or if predicates not independent, you get the wrong plan. How to fix the statistics? Index Huge table

97 97 A Fix: Let user ask for stats Create Statistics on View(f1,..,fn) Then the optimizer has the right data Picks the right plan. Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, Q3: Select count(*) from Galaxy where r 0.120 Bookmark: 34 M random IO, 520 minutes Create Statistics on Galaxy(objID ) Scan: 5 M sequential IO 18 minutes Ultimately this should be automated, but for now,… its a step in the right direction.

98 98 Two (?) Talks Distributed Computing Economics Online Science (what I have been doing).


Download ppt "1 Distributed Computing Economics Slides at: Grayhttp://research.microsoft.com/~gray/talks Microsoft Research."

Similar presentations


Ads by Google