Presentation is loading. Please wait.

Presentation is loading. Please wait.

GigaByte TeraByte PetaByte ExaByte In Search of PetaByte Databases Jim Gray Tony Hey.

Similar presentations


Presentation on theme: "GigaByte TeraByte PetaByte ExaByte In Search of PetaByte Databases Jim Gray Tony Hey."— Presentation transcript:

1 GigaByte TeraByte PetaByte ExaByte In Search of PetaByte Databases Jim Gray Tony Hey

2 GigaByte TeraByte PetaByte ExaByte The Cost of Storage (heading for 1K$/TB soon) 12/1/1999 9/1/2000 9/1/2001

3 GigaByte TeraByte PetaByte ExaByte Summary DBs own the sweet-spot: –1GB to 100TB Big data is not in databases HPTS does not own high performance storage (BIG DATA) We should Cost of storage is people: –Performance goal: 1 Admin per PB

4 GigaByte TeraByte PetaByte ExaByte State is Expensive Stateless clones are easy to manage –App servers are middle tier Cost goes to zero with Moores law. –One admin per 1,000 clones. –Good story about scaleout. Stateful servers are expensive to manage –1TB to 100TB per admin –Storage cost is going to zero(2k$ to 200k$). Cost of storage is management cost

5 GigaByte TeraByte PetaByte ExaByte Personal 100 GB today The Personal Petabyte (someday) Its coming (2M$ today…2K$ in 10 years) Today the pack rats have ~ 10-100GB –1-10 GB in text (eMail, PDF, PPT, OCR…) –10GB – 50GB tiff, mpeg, jpeg,… –Some have 1TB (voice + video). Video can drive it to 1PB. Online PB affordable in 10 years. Get ready: tools to capture, manage, organize, search, display will be big app.

6 GigaByte TeraByte PetaByte ExaByte 10 TB An Image Database: TerraServer Snapshot of the USA (1 meter granularity) –10,000,000,000,000 (=10^13) sq meters –== 15TB raw (some duplicates) –== 5 TB cooked 5x compression + Image pyramid + gazetteer Interesting things: –Its all in the Database –Clustered (allows flaky hardware, online upgrade) –Triplexed – snapshot each night

7 GigaByte TeraByte PetaByte ExaByte Databases (== SQL) VLDB survey (Winter Corp). 10 TB to 100TB DBs. –Size doubling yearly –Riding disk Moores law –10,000 disks at 18GB is 100TB cooked. Mostly DSS and data warehouses. Some media managers

8 GigaByte TeraByte PetaByte ExaByte DB iFS DB2: leave the files where they live –Referential integrity between DBMS and FS. Oracle: put the files in the DBMS –One security model –One storage management model –One space manager –One recovery manger –One replication system –One thing to tune. –Features: transactions,….

9 GigaByte TeraByte PetaByte ExaByte Interesting facts No DBMSs beyond 100TB. Most bytes are in files. The web is file centric eMail is file centric. Science (and batch) is file centric. But…. SQL performance is better than CIFS/NFS.. –CISC vs RISC

10 GigaByte TeraByte PetaByte ExaByte BarBar: the biggest DB 350 TB Uses Objectivity SLAC events Linux cluster scans DB looking for patterns

11 GigaByte TeraByte PetaByte ExaByte 300 TB (cooked) Hotmail / Yahoo Clone front ends ~10,000@hotmail. Application servers –~100 @ hotmail –Get mail box –Get/put mail –Disk bound ~30,000 disks ~ 20 admins

12 GigaByte TeraByte PetaByte ExaByte AOL (msn) (1PB?) 10 B transactions per day (10% of that) Huge storage Huge traffic Lots of eye candy DB used for security/accounting. GUESS AOL is a petabyte –(40M x 10MB = 400 x 10 12 )

13 GigaByte TeraByte PetaByte ExaByte Google 1.5PB as of last spring 8,000 no-name PCs –Each 1/3U, 2 x 80 GB disk, 2 cpu 256MB ram 1.4 PB online. 2 TB ram online 8 TeraOps Slice-price is 1K$ so 8M$. 15 admins (!) (== 1/100TB).

14 GigaByte TeraByte PetaByte ExaByte Computational Science Traditional Empirical Science –Scientist gathers data by direct observation –Scientist analyzes data Computational Science –Data captured by instruments Or data generated by simulator –Processed by software –Placed in a database –Scientist analyzes database –tcl scripts on C programs –on ASCII files

15 GigaByte TeraByte PetaByte ExaByte Astronomy Ive been trying to apply DB to astronomy Today they are at 10TB per data set Heading for Petabytes Using Objectivity Trying SQL (talk to me offline)

16 GigaByte TeraByte PetaByte ExaByte Fast Moving Objects Find near earth asteroids: SELECT r.objID as rId, g.objId as gId, r.run, r.camcol, r.field as field, g.field as gField, r.ra as ra_r, r.dec as dec_r, g.ra as ra_g, g.dec as dec_g, sqrt( power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2) )*(10800/PI()) as distance FROM PhotoObj r, PhotoObj g WHERE r.run = g.run and r.camcol=g.camcol and abs(g.field-r.field)<2 -- the match criteria -- the red selection criteria and ((power(r.q_r,2) + power(r.u_r,2)) > 0.111111 ) and r.fiberMag_r between 6 and 22 and r.fiberMag_r < r.fiberMag_g and r.fiberMag_r < r.fiberMag_i and r.parentID=0 and r.fiberMag_r < r.fiberMag_u and r.fiberMag_r < r.fiberMag_z and r.isoA_r/r.isoB_r > 1.5 and r.isoA_r>2.0 -- the green selection criteria and ((power(g.q_g,2) + power(g.u_g,2)) > 0.111111 ) and g.fiberMag_g between 6 and 22 and g.fiberMag_g < g.fiberMag_r and g.fiberMag_g < g.fiberMag_i and g.fiberMag_g < g.fiberMag_u and g.fiberMag_g < g.fiberMag_z and g.parentID=0 and g.isoA_g/g.isoB_g > 1.5 and g.isoA_g > 2.0 -- the matchup of the pair and sqrt(power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2))*(10800/PI())< 4.0 and abs(r.fiberMag_r-g.fiberMag_g)< 2.0 Finds 3 objects in 11 minutes Ugly, but consider the alternatives (c programs an files and…) –

17 GigaByte TeraByte PetaByte ExaByte Particle Physics – Hunting the Higgs and Dark Matter April 2006: First pp collisions at TeV energies at the Large Hadron Collider in Geneva ATLAS/CMS Experiments involve 2000 physicists from 200 organizations in US, EU, Asia Need to store,access, process, analyse 10 PB/yr with 200 TFlop/s distributed computation Building hierarchical Grid infrastructure to distribute data and computation Many 10s of million $ funding – GryPhyN, PPDataGrid, iVDGL, DataGrid, DataTag, GridPP ExaBytes and PetaFlop/s by 2015

18 GigaByte TeraByte PetaByte ExaByte Astronomy: Past and Future of the Universe Virtual Observatories – NVO, AVO, AstroGrid –Store all wavelengths, need distributed joins –NVO 500 TB/yr from 2004 Laser Interferometer Gravitational Observatory –Search for direct evidence for gravitational waves –LIGO 250 TB/yr, random streaming from 2002 VISTA Visible and IR Survey Telescope in 2004 –250 GB/night, 100 TB/yr, Petabytes in 10 yrs New phase of astronomy, storing, searching and analysing Petabytes of data

19 GigaByte TeraByte PetaByte ExaByte Engineering, Environment and Medical Applications Real-Time Health Monitoring –UK DAME project for Rolls Royce Aero Engines –1 GB sensor data/flight, 100,000 engine hours/day Earth Observation –ESA satellites generate 100 GB/day –NASA 15 PB by 2007 Medical Images to Information –UK IRC Project on mammograms and MRIs –100 MB/mammogram, UK 3M/yr, US 26M/yr –200 MB/patient, Oxford 500 women/yr Many Petabytes of data of real commercial interest

20 GigaByte TeraByte PetaByte ExaByte Grids, Databases and Cool Tools Scientists: – will build Grids based on Globus Open Source m/w – will have instruments generating Petabytes of data – will annotate their data with XML-based metadata Realize a version of Licklider and Taylors original vision of resource sharing and the ARPANET TP and DB community: -Should assist in developing Grid Interfaces to DBMS -Should develop Cool Tools for Grid Services There will be commercial Grid applications and viable business opportunities

21 GigaByte TeraByte PetaByte ExaByte Summary DBs own the sweet-spot: –1GB to 100TB Big data is not in databases HPTS crowd is not really high performance storage (BIG DATA) Cost of storage is people: –Performance goal: 1 Admin per PB


Download ppt "GigaByte TeraByte PetaByte ExaByte In Search of PetaByte Databases Jim Gray Tony Hey."

Similar presentations


Ads by Google