1 Where The Rubber Meets the Sky Giving Access to Science Data Talk at National Institute of Informatics, Tokyo, Japan October 2005 Jim Gray Microsoft.

Slides:

Advertisements

Similar presentations

Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.

Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research

World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.

1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at

1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

John Cunniffe Dunsink Observatory Dublin Institute for Advanced Studies Evert Meurs (Dunsink Observatory) Aaron Golden (NUI Galway) Aus VO 18/11/03 Efficient.

The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.

Astronomy Data Bases Jim Gray Microsoft Research.

Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine.

The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.

1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.

Using Sakai to Support eScience Sakai Conference June 12-14, 2007 Sayeed Choudhury Tim DiLauro, Jim Martino, Elliot Metsger, Mark Patton and David Reynolds.

A Very Brief Introduction to iRODS

14 October 2003ADASS 2003 – Strasbourg1 Resource Registries for the Virtual Observatory R.Plante (NCSA), G. Greene (STScI), R. Hanisch (STScI), T. McGlynn.

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.

20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,

Data-Intensive Computing in the Science Community Alex Szalay, JHU.

The aims of SC4DEVO and SC4DEVO-1 Bob Mann Institute for Astronomy and National e-Science Centre, University of Edinburgh.

Developing PANDORA Mark Corbould Director, IT Business Systems.

1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.

Upcoming Enhancements to the HST Archive Mark Kyprianou Operations and Engineering Division Data System Branch.

Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.

Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Virtual Observatory & LIGO Roy Williams California Institute of Technology.

Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.

1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.

Prototype system of the Japanese Virtual Observatory The Japanese Virtual Observatory (JVO) aims at providing easy access to federated astronomical databases.

1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.

Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.

Fourth Paradigm Science-based on Data-intensive Computing.

Lewis Shepherd GOVERNMENT AND THE REVOLUTION IN SCIENTIFIC COMPUTING.

Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.

1 10-June-2004Andy Lawrence : PPARC data curation panel meeting AstroGrid, Data Centres, & Edinburgh What is curation ? Data Centres in the VO era Data.

What is the VSO? (and what isn’t it?). The VSO …  Allows you to search multiple archives in a single search  Keeps you from needing to keep track of.

Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.

Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.

Data and storage services on the NGS Mike Mineter Training Outreach and Education

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.

1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.

Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center

Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF

AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.

Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.

Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.

Introduction to the VO ESAVO ESA/ESAC – Madrid, Spain.

Data and storage services on the NGS.

1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.

Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.

LECTURE 2: DATA MINING. WHAT IS DATA MINING? 2 D ATA M INING AND D ATA W AREHOUSES ? It evolved in to being as the science of databases evolved Database.

IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.

Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.

HELIO: Discovery and Analysis of Data in Heliophysics Robert Bentley, John Brooke, André Csillaghy, Donal Fellows, Anja Le Blanc, Mauro Messerotti, David.

Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

BARC Scaleable Servers

Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.

Jim Gray Researcher Microsoft Research

Jim Gray Microsoft Research

Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.

Presentation transcript:

1 Where The Rubber Meets the Sky Giving Access to Science Data Talk at National Institute of Informatics, Tokyo, Japan October 2005 Jim Gray Microsoft Research Alex Szalay Johns Hopkins University

2 Abstract: I have been working with some astronomers for the last 6 years trying to apply DB technology to science problems. These are some lessons I learned Paper at: Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science,” Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science,” Jim Gray; Alexander S. Szalay; MSR-TR , October 2004

3 New Science Paradigms Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) unify theory, experiment, and simulation using data management and statistics –Data captured by instruments Or generated by simulator –Processed by software –Scientist analyzes database / files

4 The Big Picture Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it? How to coexist with others? Data Query and Visualization tools Support/training Performance –Execute queries in a minute –Batch (big) query scheduling The Big Problems Experiments & Instruments Simulations facts answers questions ? Literature Other Archives facts

5 Experiment Budgets ¼…½ Software Software for Instrument scheduling Instrument control Data gathering Data reduction Database Analysis Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning Let’s work to change this Identify generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers …

6 Data Lifecycle Raw data → primary data → derived data Data has bugs: –Instrument bugs –Pipeline bugs Data comes in versions – later versions fix known bugs –Just like software (indeed data is software) Can’t “un-publish” bad data. instrument or simulator pipeline other data other data pipeline Level 0 raw Level 1 calibrated Level 2 derived

7 Data Inflation – Data Pyramid Level 1A Grows X TB/year ~.4X TB/y compressed (level 1A in NASA terms) Level 2 Derived data products ~10x smaller But there are many. L2≈L1 Publish new edition each year –Fixes bugs in data. –Must preserve old editions –Creates data pyramid Store each edition –1, 2, 3, 4… N ~ N 2 bytes Net: Data Inflation: L2 ≥ L1 E1 E2 E3 E4 4 editions of level 1A data (source data) 4 editions of level 2 derived data products. Note that each derived product is small, but they are numerous. This proliferation combined with the data pyramid implies that level2 data more than doubles the total storage volume. time Level 1A4 editions of 4 Level 2 products

8 The Year 5 Problem Data arrives at R bytes/year New Storage & Processing –Need to buy R units in year N Data inflation means ~N 2 R –Need to buy NR units Depreciate over 3 years –After year 3 need to buy N 2 R + (N-3) 2 R Moore’s law: 60%/year price decline Capital expense peaks at year 5 See 6x Over-Power slide next

9 6x Over-Power Ratio If you think you need X raw capacity, then you probably need 6X Reprocessing Backup copies Versions … Hardware is cheap, Your time is precious. PubDB 3.6TB DR3C 2.4TB DR2C 1.8TB DR2M 1.8TB DR2P 1.8TB DR3M 2.4TB DR3P 2.4TB

10 Data Loading Data from outside –Is full of bugs –Is not in your format Advice –Get it in a “Universal Format” (e.g. Unicode CSV) –Create Blood-Brain barrier Quarantine in a “load database” –Scrub the data Cross check everything you can Check data statistics for sanity Reject or repair bad data Generate detailed bug reports (needed to send rejection upstream) –Expect to reload many times Automate everything!

11 Performance Prediction & Regression Database grows exponentially Set up response-time requirements –For load –For access Define a workload to measure each Run it regularly to detect anomalies SDSS uses –one-week to reload –20 queries with response of 10 sec to 10 min.

12 Data Subsets For Science and Development Offer 1GB, 10GB, …, Full subsets Wonderful tool for you Design & Debug Good tool for scientists –Experiment on subset –Not for needle in haystack, but good for global stats Challenge: How make statistically valid subsets? –Seems domain specific –Seems problem specific –But, must be some general concepts.

13 Data Curation Problem Statement Once published, scientific data needs to be available forever, so that the science can be reproduced/extended. What does that mean? –Data can be characterized as Primary Data: could not be reproduced Derived data: could be derived from primary data. –Meta-data: how the data was collected/derived is primary Must be preserved Includes design docs, software, , pubs, personal notes, teleconferences, NASA “level 0”

14 Schema (aka metadata) Everyone starts with the same schema Then the start arguing about semantics. Virtual Observatory: Metadata based on Dublin Core: Universal Content Descriptors (UCD): Captures quantitative concepts and their units Reduced from ~100,000 tables in literature to ~1,000 terms VOtable – a schema for answers to questions Common Queries: Cone Search and Simple Image Access Protocol, SQL Registry: still a work in progress.

15 Archive Challenges Cost of administering storage : –Presently 10x to 100x the hardware cost. Resist attack: geographic diversity At 1GBps it takes 12 days to move a PB Store it in two (or more) places online (on disk). A geo-plex Scrub it continuously (look for errors) On failure, –use other copy until failure repaired, –refresh lost copy from safe copy. Can organize the copies differently (e.g.: one by time, one by space)

References / (download personal SkyServer) / Extending the SDSS Batch Query System to the National Virtual Observatory Grid, Extending the SDSS Batch Query System to the National Virtual Observatory Grid, M. A. Nieto-Santisteban, W. O'Mullane, J. Gray, N. Li, T. Budavari, A. S. Szalay, A. R. Thakar, MSR-TR , Feb Scientific Data Federation, Scientific Data Federation, J. Gray, A. S. Szalay, The Grid 2: Blueprint for a New Computing Infrastructure, I. Foster, C. Kesselman, eds, Morgan Kauffman, 2003, pp Data Mining the SDSS SkyServer Database, Data Mining the SDSS SkyServer Database, J. Gray, A.S. Szalay, A. Thakar, P. Kunszt, C. Stoughton, D. Slutz, J. vandenBerg, Distributed Data & Structures 4: Records of the 4th International Meeting, pp , W. Litwin, G. Levy (eds),, Carleton Scientific 2003, ISBN , also MSR-TR , Jan Petabyte Scale Data Mining: Dream or Reality?, Alexander S. Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes and Instruments, August 2002, Waikoloa, Hawaii, MSR-TR Petabyte Scale Data Mining: Dream or Reality?, Online Scientific Data Curation, Publication, and Archiving, Online Scientific Data Curation, Publication, and Archiving, J. Gray; A. S. Szalay; A.R. Thakar; C. Stoughton; J. vandenBerg, SPIE Astronomy Telescopes and Instruments, August 2002, Waikoloa, Hawaii, MSR-TR The World Wide Telescope: An Archetype for Online ScienceThe World Wide Telescope: An Archetype for Online Science, J. Gray; A. Szalay,, CACM, Vol. 45, No. 11, pp 50-54, Nov. 2002, MSR TR , The SDSS SkyServer: Public Access To The Sloan Digital Sky Server DataThe SDSS SkyServer: Public Access To The Sloan Digital Sky Server Data, A. S. Szalay, J. Gray, A. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, J. vandenBerg:, ACM SIGMOD 2002: MSR TR The World Wide TelescopeThe World Wide Telescope, A.S., Szalay, J., Gray, Science, V.293 pp Sept MS-TR Designing & Mining Multi-Terabyte Astronomy Archives: Sloan Digital Sky SurveyDesigning & Mining Multi-Terabyte Astronomy Archives: Sloan Digital Sky Survey, A. Szalay, P. Kunszt, A. Thakar, J. Gray, D. Slutz, P. Kuntz, June 1999, ACM SIGMOD 2000, MS-TR-99-30,