Presentation on theme: "Building Petabyte Databases SQL+.Net"— Presentation transcript:
1 Building Petabyte Databases SQL+.Net Jim GrayMicrosoft researchVSlive! SQL To The Max15 February San FranciscoObjects are closerthan they appearin the mirrorObjects are closerthan they appearin the mirrorPhotoServer:Tom BarclayYa Feng SungTerraServerUSGSSkyServerAlex SzalayAni ThakarPeter KunsztTanu MalikJordan RaddickDon SlutzJan vandenBergSome SlidesRobert Brunner
2 SQLserver™: Past and Future History XMLReplication x, y, z,…Auto AdminData TransformationOLAPData MiningText IndexingEnglish QueryPartitioningClusters.NetXML schema supportupdategramsMore xPath supportSPs and templates as web servicesSQL 200xBeta late this yearTrustworthy: Availability Privacy SecurityCLR (objects)XML (xQuery,….)Unify Files & RecordsManageability,ScalabilityWebReference.soap proxy = new WebReference.soap();object results1 = proxy.StoredProcedure (inParam, ref inoutParam, out returnValue);object results2 = proxy.Template(inParam);
3 Outline We will be able to store everything, How do we represent it? (objects)How will we find it (aka: who cares?)PhotoServer: Objects vs records vs files,XML++ gives us portable objects.Similarity search: better than nothing!Scalability: a solved problem,but… Trustworthy & Manageable is not.TerraServer and TerraServiceWhy put everything in the database?A prototypical Web Service.SkyServer and the World Wide TelescopeData Mining science dataServing Windows/Macintosh/Unix clients with .NetFederating Archives with .Net
4 Record Everything? What’s that? YottaZettaExaPetaTeraGigaMegaKiloRecord Everything? What’s that?Everything!RecordedDisks will get 100x to 1,000x more capacity10x to 30x more bandwidth.Other technologies in the wings:mram,mems, …The 20TB … 200TB disk drive!Library of Congress (books)A billion photos2…20 years of video (continuous)All Books MultiMediaAll LoC books(words).MovieA PhotoSee Mike Lesk: How much information is there:See Lyman & Varian:How much informationA Book
5 Why Put Everything in Cyberspace? Low rentmin $/byteShrinks timenow or laterShrinks spacehere or thereAutomate processingknowbotsPoint-to-PointORBroadcastImmediate OR Time DelayedLocateProcessAnalyzeSummarize
6 Most storage is personal 90% of disks are IDE/ATA85% of bytes areGordon Bell’s shoebox:Scans 20 k “pages” 300 dpi 1 GBMusic: 2 k “tacks” 7 GBPhotos: 13 k images 2 GBVideo: 10 hrs 3 GBDocs: 3 k ppt, word,.. 2 GBMail: 50 k messages 2 GB16 GB
7 How will we find it? Put everything in the DB (and index it) More than a file systemUnifies data and meta-dataSimpler to manageEasier to subset and reorganizeSet-oriented accessAllows online updatesAutomatic indexingAutomatic replicationSQLSQL
8 How do we represent it to the outside world? <?xml version="1.0" encoding="utf-8" ?>- <DataSet xmlns="- <xs:schema id="radec" xmlns="" xmlns:xs=" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata"><xs:element name="radec" msdata:IsDataSet="true"><xs:element name="Table"> <xs:element name="ra" type="xs:double" minOccurs="0" /> <xs:element name="dec" type="xs:double" minOccurs="0" />…- <diffgr:diffgram xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xmlns:diffgr="urn:schemas-microsoft-com:xml-diffgram-v1">- <radec xmlns="">- <Table diffgr:id="Table1" msdata:rowOrder="0"> <ra> </ra> <dec> </dec> </Table>- <Table diffgr:id="Table10" msdata:rowOrder="9"> <ra> </ra> <dec> </dec></Table></radec> </diffgr:diffgram></DataSet>File metaphor too primitive: just a blobTable metaphor too primitive: just recordsNeed Metadata describing data contextFormatProvidence (author/publisher/ citations/…)RightsHistoryRelated documentsIn a standard formatXML and XML schemaDataSet is great example of thisWorld is now defining standard schemasschemaData ordifgram
9 There is a problem: Need Standard Data AND Methods Niklaus Wirth:Algorithms + Data Structures = ProgramsXML data is GREAT!!!!XML documents are portable objectsXML documents are complex objectsWSDL defines the methods on objects (the class)But will all the implementations match?Think of UNIX or SQL or C or…We need conformance tests.That’s why Web Services Interoperability is so important.
10 Outline We will be able to store everything, How do we represent it? (objects)How will we find it (aka: who cares?)PhotoServer: Objects vs records vs files,XML++ gives us portable objects.Similarity search: better than nothing!Scalability: a solved problem,but… Trustworthy & Manageable is not.TerraServer and TerraServiceWhy put everything in the database?A prototypical Web Service.SkyServer and the World Wide TelescopeData Mining science dataServing Windows/Macintosh/Unix clients with .NetFederating Archives with .Net
11 PhotoServer: Managing Photos Load all photos into the databaseAnnotate the photosView by various attributesDo similarity SearchUse XML for interchangeUse dbObject, Template for accessSQL, Templates, XML dataIISjScriptXML datasets & mime dataTemplatesSchemaDOMSQL(for xml)
12 How Similarity Search Works For each picture LoaderInserts thumbnailsExtracts 270 Features into a blobWhen looking for similar pictureScan all photos comparing features (dot product of vectors)Sort by similarityFeature blob is an arrayToday I fake the array with functions and cast cast(substring(feature,72,8) as float)When SQL Server gets C#, we won’t have to fake it.And… it will run 100x faster (compiled managed code).Idea pioneered by IBM Research, we use a variant by MS Beijing Research.many black squares10% orange…etcNo black squares20% orange…etc72% match27% match
13 Things I Learned from PhotoServer Data:XML data sets are a universal way to represent answersXML data sets minimize round trips: 1 request/responseSearchIt is BEST to indexYou can put objects and attributes in a row (SQL puts big blobs off-page)If you can’t index, You can extract attributes and quickly compareSQL can scan at 2M records/cpu/secondSequential scans are embarrassingly parallel.
14 Outline We will be able to store everything, How do we represent it? (objects)How will we find it (aka: who cares?)PhotoServer: Objects vs records vs files,XML++ gives us portable objects.Similarity search: better than nothing!Scalability: a solved problem,but… Trustworthy & Manageable is not.TerraServer and TerraServiceWhy put everything in the database?A prototypical Web Service.SkyServer and the World Wide TelescopeData Mining science dataServing Windows/Macintosh/Unix clients with .NetFederating Archives with .Net
15 Big! Servers ScaleUP: a BIG box ScaleOut: computing by the slice SMP (32 cpus)64 bitScaleOut: computing by the slice6 years ago: 8ktpmC, today 750ktpmCSQL Server is #1, #2, #3 (Windows is best DB2 platform too)VLDB ManagementAvailability:Clusters, remote logging, replication
16 TPC measures peak performance and Price/Performance SQL Server always had best price PerformanceNow best of both (using scaleout)SMP performance also impressive32x8 900Mhz Xenon256GB ram59 TB diskRankCompanySystemtpmCprice/tpmCDatabaseOSTP MonDate1 ProLiant DL P 709,22014.96US$Microsoft SQL Server 2000 EnterpriseMicrosoftWindows 2000Advanced COM+09/19/01 2 IBMeSeries370c/s 688,22022.58US$Microsoft SQL Server 2000 Datacenter 04/10/01 3 ProLiant DL P 567,88214.04US$Microsoft SQL Server 2000 Enterprise 7HPHP 9000 Superdome389,43521.24US$Oracle 9i EnterpriseHP UX 11.i 64-bitBEA Tuxedo6.412/21/0114UnisysEnterprise Server ES7000165,21921.33US$Datacenter LE COM+ Source:32 900Mhz Xeon 64GB ram15TB disk
17 Scale Out: Buy Computing by the Slice 709,202 tpmC Scale Out: Buy Computing by the Slice 709,202 tpmC! == 1 Billion transactions/daySlice: 8cpu, 8GB, 100 disks (=1.8TB) 20ktpmC per slice, ~300k$/sliceclients and 4 DTC nodes not shown
18 ScaleUp: A Very Big System! UNISYS Windows 2000 Data Center Limited Edition32 cpus on32 GB of RAM and1,061 disks (15.5 TB)Will be helped by 64bit addressing24fiberchannel
19 Outline We will be able to store everything, How do we represent it? (objects)How will we find it (aka: who cares?)PhotoServer: Objects vs records vs files,XML++ gives us portable objects.Similarity search: better than nothing!Scalability: a solved problem,but… Trustworthy & Manageable is not.TerraServer and TerraServiceWhy put everything in the database?A prototypical Web Service.SkyServer and the World Wide TelescopeData Mining science dataServing Windows/Macintosh/Unix clients with .NetFederating Archives with .Net
20 TerraServer – A SQL poster child http://TerraServer. HomeAdvisor 3 x 2 TB databases18TB disk tri-plexed (=6TB)3 + 1 Cluster99.96% uptime1B page views 5B DB queriesNow a .NET web service
22 TerraServer Traffic & Database Growth SessionsPage ViewsImage TilesDb QueriesBytes XferedAverage Day44,320879,7203,786,5514,566,02459 GBPeak Day277,29212,388,10410,475,674163 GB2,401,20944,851,547890,2770873,831,989,8874,620,815,91359 TBJan 2002900 m RowsSQL TB Db678 m RowsSQL TB DbSQL TB DbSQL Server 1.5 TB DbSQL Server .8 TB Db298 m RowsSQL TB Db231 m RowsSQL TB DbSQL TB Db217 m RowsSQL TB DbSQL TB Db173 m RowsSQL TB DbSQL TB Db1 Server / Win NT 4.0 EE2nd Server / Win 2k DataCenter4 Node / Win2k Datacenter Failover Cluster
23 8 Compaq DL360 “Photon” Web Servers 4 Compaq ProLiant 8500 Db Servers Hardware8 Compaq DL360 “Photon” Web ServersOne SQL database per rackEach rack contains 4.5 tb261 total drives / 13.7 TB total2200Fiber SANSwitchesEJOMeta DataStored on 101 GB“Fast, Small Disks” (18 x 18.2 GB)SQL\Inst1FGLKPQImagery DataStored on GB“Slow, Big Disks”(15 x 73.8 GB)SQL\Inst2IHMNRSSQL\Inst3To Add GBDisks in Feb 2001to create 18 TB SANSpare4 Compaq ProLiant 8500 Db Servers
24 TerraServer Lessons Learned Hardware is 5 9’s (with clustering)Software is 5 9’s (with clustering)Admin is 4 9’s (offline maintenance)Network is 3 9’s (mistakes, environment)Simple designs are best10 TB DB is management limit 1 PB = 100 x 10 TB DB this is 100x better than 5 years ago.Minimize use of tapeBackup to disk (snapshots)Portable disk TBs9999
25 TerraService http://TerraService.Net/ Added .NET web services to TerraServerA great way to learn what Web Services areAnd what .Net is.Image serverGives arbitrary rectangle/zoom of USOverlays features (hospitals, schools,..)Census serviceYou can use it in your app.USDA is using it today.DemoTour APIDemo map makerMention location and census services
26 Outline We will be able to store everything, How do we represent it? (objects)How will we find it (aka: who cares?)PhotoServer: Objects vs records vs files,XML++ gives us portable objects.Similarity search: better than nothing!Scalability: a solved problem,but… Trustworthy & Manageable is not.TerraServer and TerraServiceWhy put everything in the database?A prototypical Web Service.SkyServer and the World Wide TelescopeData Mining science dataServing Windows/Macintosh/Unix clients with .NetFederating Archives with .Net
27 Computational Science The Third Science Branch is Evolving In the beginning science was empirical.Then theoretical branches evolved.Now, we have computational branches.Has primarily been simulationGrowth area data analysis/visualization of peta-scale instrument data.Computational ScienceData captured by instruments Or data generated by simulatorProcessed by softwarePlaced in a database / filesScientist analyzes database / files
28 Exploring Parameter Space Manual or Automatic Data Mining There is LOTS of datapeople cannot examine most of it.Need computers to do analysis.Manual or Automatic ExplorationManual: person suggests hypothesis, computer checks hypothesisAutomatic: Computer suggests hypothesis person evaluates significanceGiven an arbitrary parameter space:Data ClustersPoints between Data ClustersIsolated Data ClustersIsolated Data GroupsHoles in Data ClustersIsolated PointsNichol et al. 2001Slide courtesy of and adapted from Robert CalTech.
29 What’s needed? (not drawn to scale) ScientistsMinersData MiningAlgorithmsScience Data & QuestionsPlumbersToolsDatabases toStore DataAndExecute QueriesQuestion & AnswerVisualization
30 Some science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in a secondYou can GREP 1 GB in a minuteYou can GREP 1 TB in 2 daysYou can GREP 1 PB in 3 years.Oh!, and 1PB ~10,000 disksAt some point you need indices to limit search parallel data search and analysisThis is where databases can helpGoal Make it easy toPublish: Record structured dataFind: Find data anywhere in the networkGet the subset you needExplore datasets interactivelyYou can FTP 1 MB in 1 secYou can FTP 1 GB / min (= 1 $/GB)… days and 1K$… 3 years and 1M$
31 Web Services are The Key YourprogramWebServerWeb SERVER:Given a url + parametersReturns a web page (often dynamic)Web SERVICE:Given a XML document (soap msg)Returns an XML documentTools make this look like an RPC.F(x,y,z) returns (u, v, w)Distributed objects for the web.+ naming, discovery, security,..Internet-scale distributed computinghttpWeb pageYourprogramWebServicesoapDataIn your address spaceobject in xml
32 Data Federations of Web Services Massive datasets live near their owners:Near the instrument’s software pipelineNear the applicationsNear data knowledge and curationSuper Computer centers become Super Data CentersEach Archive publishes a web serviceSchema: documents the dataMethods on objects (queries)Scientists get “personalized” extractsUniform access to multiple ArchivesA common global schemaFederation
33 Why Astronomy Data? It has no commercial value IRAS 25m2MASS 2mIt has no commercial valueNo privacy concernsCan freely share results with othersGreat for experimenting with algorithmsIt is real and well documentedHigh-dimensional data (with confidence intervals)Spatial dataTemporal dataMany different instruments from Many different places and Many different timesFederation is a goalThe questions are interestingHow did the universe form?There is a lot of it (petabytes)DSS OpticalIRAS 100mWENSS 92cmNVSS 20cmGB 6cmROSAT ~keV
34 The Internet will be the world’s best telescope: Web Services & Grid Enable Virtual ObservatoryThe Internet will be the world’s best telescope:It has data on every part of the skyIn every measured spectral band: optical, x-ray, radio..As deep as the best instruments (2 years ago).It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..).It’s a smart telescope: links objects and data to literature on them.W3C & IETF standards ProvideNamingAuthorization / Security / PrivacyDistributed ObjectsDiscovery, Definition, Invocation, Object ModelHigher level services: workflow, transactions, DB,..
35 Steps to Virtual Observatory Prototype Define a set of Astronomy Objects and methods.Based on UDDI, WSDL, XSL, SOAP, dataSetUse them locally to debug ideasSchema, Units,…Dataset problemsTypical use scenarios.Federate different archivesEach archive is a web serviceGlobal query tool accesses themWorking on this plan withSloan Digital Sky Survey and CalTech/Palomar. Especially Alex Szalay et. al. at JHU
36 Sloan Digital Sky Survey http://www.sdss.org/ For the last 12 years astronomers have been building a telescope (with funding from Sloan Foundation, NSF, and a dozen universities). 90M$.Y2000: engineer, calibrate, commission: now public data.5% of the survey, 600 sq degrees, 15 M objects 60GB, ½ TB raw.This data includes most of the known high z quasars.It has a lot of science left in it but….New the data is arriving:250GB/nite (20 nights per year) = 5TB/y.100 M stars, 100 M galaxies, 1 M spectra.
37 Demo of Sky Server http://skyserver.sdss.org/ Demo sky server Demo ExplorerExplain need for Unix/Mac clientsDemo Java SQLQA?Talk about federation plan.Work is product of Alex Johns Hopkins Tanu Malik did SQLQA.
38 Two kinds of SDSS data in an SQL DB (objects and images all in DB) 15M Photo Objects ~ 400 attributes50K Spectra with ~30 lines/spectrum
39 Spatial Data Access – SQL extension (Szalay, Kunszt, Brunner) http://www.sdss.jhu.edu/htm Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins.Every object has a 20-deep Mesh ID.Given a spatial definition: Routine returns up to ~10 covering triangles.Spatial query is then up to ~10 range queries.Very fast: 10,000 triangles / second / cpu.Based onSQL Server Extended Stored Procedure2
41 Scenario Design Astronomers proposed 20 questions Typical of things they want to doEach would require a week of programming in tcl / C++/ FTPGoal, make it easy to answer questionsDB and tools design motivated by this goalImplemented utility proceduresJHU Built GUI for Linux clientsQ11: Find all elliptical galaxies with spectra that have an anomalous emission line.Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid.Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization.Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes.Q15: Provide a list of moving objects consistent with an asteroid.Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5.Q17: Find binary stars where at least one of them has the colors of a white dwarf.Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m.Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies.Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy.Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero.Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75.Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds.Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profileQ6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes.Q7: Provide a list of star-like objects that are 1% rare.Q8: Find all objects with unclassified spectra.Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7.Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.)
42 An easy one Q7: Provide a list of rare star-like objects. Found 14,681 buckets, first 140 buckets have 99% time 62 secondsCPU bound 226 k records/second (2 cpu) KB/s.Select cast((u-g) as int) as ug,cast((g-r) as int) as gr,cast((r-i) as int) as ri,cast((i-z) as int) as iz,count(*) as Populationfrom starsgroup by cast((u-g) as int), cast((g-r) as int),cast((r-i) as int), cast((i-z) as int)order by count(*)
43 An Easy One Q15: Find asteroids Sounds hard but there are 5 pictures of the object at 5 different times (color filters) and so can “see” velocity.Image pipeline computes velocity.Computing it from the 5 color x,y would also be fastFinds 1,303 objects in 3 minutes, 140MBps. (could go 2x faster with more disks)select objId, dbo.fGetUrlEq(ra,dec) as url --return object ID & urlsqrt(power(rowv,2)+power(colv,2)) as velocityfrom photoObj check each object.where (power(rowv,2) + power(colv, 2)) square of velocitybetween 50 and huge values =error
44 Q15: Fast Moving Objects Find near earth asteroids: Finds 3 objects in 11 minutes(or 52 seconds with an index)Ugly, but consider the alternatives (c programs an files and…)SELECT r.objID as rId, g.objId as gId,dbo.fGetUrlEq(g.ra, g.dec) as urlFROM PhotoObj r, PhotoObj gWHERE r.run = g.run and r.camcol=g.camcol and abs(g.field-r.field)<2 -- nearby-- the red selection criteriaand ((power(r.q_r,2) + power(r.u_r,2)) > )and r.fiberMag_r between 6 and 22 and r.fiberMag_r < r.fiberMag_g and r.fiberMag_r < r.fiberMag_iand r.parentID=0 and r.fiberMag_r < r.fiberMag_u and r.fiberMag_r < r.fiberMag_zand r.isoA_r/r.isoB_r > 1.5 and r.isoA_r>2.0-- the green selection criteriaand ((power(g.q_g,2) + power(g.u_g,2)) > )and g.fiberMag_g between 6 and 22 and g.fiberMag_g < g.fiberMag_r and g.fiberMag_g < g.fiberMag_iand g.fiberMag_g < g.fiberMag_u and g.fiberMag_g < g.fiberMag_zand g.parentID=0 and g.isoA_g/g.isoB_g > 1.5 and g.isoA_g > 2.0-- the matchup of the pairand sqrt(power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2))*(10800/PI())< 4.0and abs(r.fiberMag_r-g.fiberMag_g)< 2.0
48 Performance (on current SDSS data) Run times: on 15k$ COMPAQ Server (2 cpu, 1 GB , 8 disk)Some take 10 minutesSome take 1 minuteMedian ~ 22 sec.Ghz processors are fast!(10 mips/IO, 200 ins/byte)2.5 m rec/s/cpu~1,000 IO/cpu sec ~ 64 MB IO/cpu sec
49 Sequential Scan Speed is Important In high-dimension data, best way is to search.Sequential scan covering index is 10x fasterSeconds vs minutesSQL scans at 2M records/s/cpu (!)
50 What we learned from the 20 Queries All have fairly short SQL programs -- a substantial advance over (tcl, C++)Many are sequential one-pass and two-pass over dataCovering indices make scans run fastTable valued functions are wonderful but limitations are painful.Counting, Binning, Histograms VERY commonSpatial indices helpful,Materialized view (Neighbors) helpful.
51 Cosmo: Computing the Cosmological Constant Compares simulated galaxy distribution to observed distributionMeasure distance between each pair of galaxies A lot of work (108 x 108 = 1016 steps)Good algorithms make this ~Nlog2NNeeds LARGE main memoryUsing Itanium donated by Compaq and SQL server for data store(this is Alex Szalay, Adrian Pope,… of JHU).decadeyearmonthweekday
52 Summary We will be able to store everything, The challenge is organizing and finding answers.PhotoServer: Objects vs records vs files,XML++ gives us portable objects.Similarity search: better than nothing!Scalability: a solved problem,but… Trustworthy & Manageable is not.TerraServer and TerraServiceWhy put everything in the database?A prototypical Web Service.SkyServer and the World Wide TelescopeData Mining science dataServing Windows/Macintosh/Unix clients with .NetFederating Archives with .Net
53 References These Slides http://research.Microsoft.com/~Gray/talks/ TerraServer & TerraServiceVirtual Observatory (aka World Wide Telescope)SkyServerSee documents atDownload “personal SkyServer” (1GB)