Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001.

Similar presentations


Presentation on theme: "Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001."— Presentation transcript:

1 Building Peta-Byte Data Stores Jim Gray @ Claus Shira Anniversary European Media Lab 12 February 2001

2 How Much Information Is there? Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. www.lesk.com/mlesk/ksg97/ksg.html Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

3 ops/s/$ Had Three Growth Phases Now doubling every year 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doubling

4 Gilder’s Law: 3x bandwidth/year for 25 more years Today: –10 Gbps per channel (per lambda) –4 channels per fiber: 40 Gbps –32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

5 Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops Information Sciences Institute MicrosoftQwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA

6 Storage capacity beating Moore’s law 3 k$/TB today (raw disk) 3 M$ /PB

7 Microsoft TerraServer: http://TerraServer.Microsoft.com/ http://TerraServer.Microsoft.com/ Build a multi-TB SQL Server database Data must be –1 TB –Unencumbered –Interesting to everyone everywhere –And not offensive to anyone anywhere Loaded –1.5 M place names from Encarta World Atlas –7 M Sq Km USGS doq (1 meter resolution) –10 M sq Km USGS topos (2m) –1 M Sq Km from Russian Space agency (2 m) On the web (world’s largest atlas) Sell images with commerce server.

8 TerraServer 4.0 Configuration SQL\Inst1 - Topo & Relief Data SQL\Inst2 – Aerial Imagery SQL\Inst3 – Aerial Imagery MetaData 101GB Image1-10 3.4 TB cooked 10 x 339 GB volumes Spread across 3 servers 2x4 to photo servers 1x2 for topo/relief server Compaq 8500 Passive Srvr Compaq Compaq 8500 SQL\Inst3 Compaq 8500 SQL\Inst2 Compaq 8500 SQL\Inst1 Controller Controller Controller Compaq E F G HI Controller Controller Controller Compaq L MN OP Controller Controller Controller Compaq S TU VU Compaq DL360 DL360 DL360 DL360 DL360 DL360 DL360 DL360 Web Servers 8 2-proc “Photon” DL360 3 Active Database Servers Logical Volume Structure One rack per database All volumes triple mirrored (3x) MetaData on 15k rpm 18.2 GB drives Image Data on 10k rpm 72.8 GB drives

9 TerraServer Activity Usage Summary July 1998 –Oct 2000 TotalsMonthlyDaily Users38,285,034 1,367,323 46,127 Page Views729,063,781 26,037,992 878,390 Image Tiles3,154,632,827 112,665,458 3,800,762 Db Queries3,791,078,522 135,395,662 4,567,564 Hits4,153,678,577 148,345,663 5,004,432

10 TerraServer.Microsoft.NET A Web Service Before.NET TerraServer SQL Db HtmlPage ImageTile Internet Web Browser TerraServerWebSite GetAreaByPointGetAreaByRectGetPlaceListByNameGetPlaceListByRectGetTileMetaByLonLatPtGetTileMetaByTileIdGetTileConvertLonLatToNearestPlaceConvertPlaceToLonLatPt... TerraServer SQL Db Internet ApplicationProgram TerraServerWebService With.NET

11 TerraServer Recent/Current Effort Added USGS Topographic maps (4 TB) High availability (4 node cluster with failover) Integrated with Encarta Online The other 25% of the US DOQs (photos) Adding digital elevation maps Open architecture: publish SOAP interfaces. Adding mult-layer maps (with UC Berkeley) Geo-Spatial extension to SQL Server

12 Astronomy is Changing (and so are other sciences) The World Virtual Observatory Doubles every 2 years. Astronomers have a few PB Data is public after 2 years. So: Everyone has ½ the data Some people have 5%more “private data” So, it’s a nearly level playing field: –Most accessible data is public. Cyberspace is the new telescope: –Multi-spectral, very deep,… Computer Science challenge: Organize these datasets Provide easy access to them.

13 Special 2.5m telescope Two surveys in one: Photometric survey in 5 bands. Spectroscopic redshift survey. Huge CCD Mosaic 30 CCDs 2K x 2K(imaging) 22 CCDs 2K x 400(astrometry) Two high resolution spectrographs 2 x 320 fibers, with 3 arcsec diameter. R=2000 resolution with 4096 pixels. Spectral coverage from 3900Å to 9200Å. Automated data reduction Over 70 man-years of development effort. (Fermilab + collaboration scientists) Very high data volume 40 TB of raw, 3TB cooked data (all public). The Sloan Digital Sky Survey The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study SLOAN Foundation, NSF, DOE, NASA The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study SLOAN Foundation, NSF, DOE, NASA Goal: Create a detailed multicolor map of the Northern Sky over 5 years

14 The Cosmic Genome Project The SDSS will create the ultimate map of the Universe, with much more detail than any other measurement before Gregory and Thompson 1978 deLapparent, Geller and Huchra 1986 daCosta etal 1995 SDSS Collaboration 2002

15 Area and Size of Redshift Surveys

16 Experiment with Relational DBMS See if SQL’s Good Indexing and Scanning Compensates for Poor Object Support. Leverage Fast/Big/Cheap Commodity Hardware. Ported 40 GB Sample Database (from SDSS Sample Scan) to SQL Server 2000 Building public web site and data server

17 20 Astronomy Queries Implemented spatial access extension to SQL (HTM) Implement 20 Astronomy Queries in SQL (see paper for details). 15M rows 378 cols, 30 GB. Can scan it in 8 minutes (disk IO limited). Many queries run in seconds Create Covering Indexes on queried columns. Create ‘Neighbors’ Table listing objects within 1 arc- minute (5 neighbors on the average) for spatial joins. Install some more disks!

18 Query to Find Gravitational Lenses Find all objects within 1 arc-minute of each other that have very similar colors (the color ratios u-g, g-r, r-i are less than 0.05m) 1 arc-minute

19 SQL Query to Find Gravitational Lenses Find nearby objects with similar color ratios. select count(*) from Objects L, Objects O, neighbors N where L.Obj_id = N.Obj_id and O.Obj_id = N.neighbor_Obj_id and L.Obj_id < O.Obj_id -- no dups and ABS((L.u-L.g)-(O.u-O.g))<0.05 -- similar color and ABS((L.g-L.r)-(O.g-O.r))<0.05 – ratios and ABS((L.r-L.i)-(O.r-O.i))<0.05 – (=dif of log) and ABS((L.z-L.r)-(O.z-O.r))<0.05 Finds 5223 objects, executes in 6 minutes.

20 SQL Results so far. Have run 17 of 20 Queries so far. Working on spectra load and queries now. Most Queries IO bound, ( 80MB/sec on 4 disks in 6 minutes) Covering indexes reduce execution to < 30 secs. Common to get Grid Distributions: select convert(int,ra*30)/30.0, as ra_bucket convert(int,dec*30)/30.0, as dec_bucket count(*) as bucket count from Galaxies where (u-g) > 1 and r < 21.5 group by ra_bucket, dec_bucket

21 Summary Technology: –1M$/PB: store everything online (twice!) –Gigabit to the desktop : store it anywhere So: You can store everything, Anywhere in the world Online everywhere Research driven by apps: –TerraServer –National Virtual Astronomy Observatory.


Download ppt "Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001."

Similar presentations


Ads by Google