1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Experience Building The World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay.
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
3 September 2004NVO Coordination Meeting1 Grid-Technologies NVO and the Grid Reagan W. Moore George Kremenek Leesa Brieger Ewa Deelman Roy Williams John.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
14 October 2003ADASS 2003 – Strasbourg1 Resource Registries for the Virtual Observatory R.Plante (NCSA), G. Greene (STScI), R. Hanisch (STScI), T. McGlynn.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
eScience -- A Transformed Scientific Method"
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
MAHI Research Database Data Validation System Software Prototype Demonstration September 18, 2001
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Astronomical Data Query Language Simple Query Protocol for the Virtual Observatory Naoki Yasuda 1, William O'Mullane 2, Tamas Budavari 2, Vivek Haridas.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
EdSkyQuery-G Overview Brian Hills, December
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Science with the Virtual Observatory Brian R. Kent NRAO.
1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
Prototype system of the Japanese Virtual Observatory The Japanese Virtual Observatory (JVO) aims at providing easy access to federated astronomical databases.
1 Where The Rubber Meets the Sky Giving Access to Science Data Talk at National Institute of Informatics, Tokyo, Japan October 2005 Jim Gray Microsoft.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.
Web Services for the National Virtual Observatory Tamás Budavári Johns Hopkins University.
The International Virtual Observatory Alliance (IVOA) interoperability in action.
OWL Representing Information Using the Web Ontology Language.
Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF
German Astrophysical Virtual Observatory Overview and Results So Far W. Voges, G. Lemson, H.-M. Adorf.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.
12 Oct 2003VO Tutorial, ADASS Strasbourg, Data Access Layer (DAL) Tutorial Doug Tody, National Radio Astronomy Observatory T HE US N ATIONAL V IRTUAL.
Grids 2003 The Great Academia/Industry Grid Debate Dan Fay | Microsoft Research Grid, grid, everywhere a Grid Blocking out the scenery, breaking my mind.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand.
Data and storage services on the NGS.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Distributed Archives Interoperability Cynthia Y. Cheung NASA Goddard Space Flight Center IAU 2000 Commission 5 Manchester, UK August 12, 2000.
VO Data Access Layer IVOA Cambridge, UK 12 May 2003 Doug Tody, NRAO.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
How much information? Adapted from a presentation by:
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Jim Gray Alex Szalay SLAC Data Management Workshop
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Researcher Microsoft Research
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon Valley

2 First, an aside: 2 other projects TerraServer – joint with USGS Giga Byte File Transfers – joint with Caltech and CERN

3 KVM / IP TerraServer Seamless mosaic of US ~20 TB of imagery 30 M web hits/day A scalability laboratory TerraServer Bricks – A High Availability Cluster AlternativeTerraServer Bricks – A High Availability Cluster Alternative (2004) TerraServer Cluster and SAN ExperienceTerraServer Cluster and SAN Experience (2004) TerraService.NET: An Introduction to Web Services TerraService.NET: An Introduction to Web Services (2002) Microsoft TerraServer: A Spatial Data Warehouse Microsoft TerraServer: A Spatial Data Warehouse (1999) The Microsoft TerraServerTMThe Microsoft TerraServerTM (1998)

4 Giga Byte Per Second File Mover CERN to Pasadena –Windows TCP/IP, NTFS –Quantifying performance –Working on better algorithms –Opteron –Disk-to-Disk at 550MBps now (~2 TB/Hour). GOAL: 1GBps disk-to-disk. Gigabyte Bandwidth Enables Global Co-Laboratories Sequential Disk IO Tests for GBps Land Speed Record OC192 = 9.9 Gbps PCI -X limit tcp limit

5 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon Valley

6 The Evolution of Science Observational Science –Scientist gathers data by direct observation –Scientist analyzes data Analytical Science –Scientist builds analytical model –Makes predictions. Computational Science –Simulate analytical model –Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator –Processed by software –Placed in a database / files –Scientist analyzes database / files

7 Information Avalanche In science, industry, government,…. –better observational instruments and –and, better simulations producing a data avalanche Examples –BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information –CERN: LHC will generate 1GB/s.~10 PB/y –VLBA (NRAO) generates 1GB/s today –Pixar: 100 TB/Movie New emphasis on informatics: –Capturing, Organizing, Summarizing, Analyzing, Visualizing Image courtesy C. Meneveau & A. JHU BaBar, Stanford Space Telescope P&E Gene Sequencer From

8 The Big Picture Experiments & Instruments Simulations facts answers questions Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance –Execute queries in a minute –Batch query scheduling ? The Big Problems Literature Other Archives facts

9 FTP - GREP Download (FTP and GREP) are not adequate –You can GREP 1 MB in a second –You can GREP 1 GB in a minute –You can GREP 1 TB in 2 days –You can GREP 1 PB in 3 years. Oh!, and 1PB ~3,000 disks At some point we need indices to limit search parallel data search and analysis This is where databases can help Next generation technique: Data Exploration –Bring the analysis to the data!

10 The Speed Problem Many users want to search the whole DB ad hoc queries, often combinatorial Want ~ 1 minute response Brute force (parallel search): –1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB Indices (limit search, do column store) –1,000x less equipment: 1M$/PB Pre-compute answer –No one knows how do it for all questions.

11 Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at same rate, we can only keep up with N logN A way out? –Relax notion of optimal (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Combination of statistics & computer science

12 Analysis and Databases Much statistical analysis deals with –Creating uniform samples – –data filtering –Assembling relevant subsets –Estimating completeness –censoring bad data –Counting and building histograms –Generating Monte-Carlo subsets –Likelihood calculations –Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database Move Mohamed to the mountain, not the mountain to Mohamed.

13 Organization & Algorithms Use of clever data structures (trees, cubes): –Up-front creation cost, but only N logN access cost –Large speedup during the analysis –Tree-codes for correlations (A. Moore et al 2001) –Data Cubes for OLAP (all vendors) Fast, approximate heuristic algorithms –No need to be more accurate than cosmic variance –Fast CMB analysis by Szapudi et al (2001) N logN instead of N 3 => 1 day instead of 10 million years Take cost of computation into account –Controlled level of accuracy –Best result in a given time, given our computing resources

14 World Wide Telescope Virtual Observatory Premise: Most data is (or could be online) The Internet is the world’s best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). –It’s a smart telescope: links objects and data to literature on them.

15 Why Astronomy? Community has lots of data Data is real and well documented – High-dimensional (with confidence intervals) – Spatial, temporal Diverse and distributed – Many different instruments from many different places and many different times Community wants to share/cross compare –Can freely share data and algorithms. –“DataMining, Not Data MINE!!” Mark Ellisman, UCSD They are well organized Community is small and homogeneous No commercial or privacy concerns –All the problems are technical or social.

16 The WWT Components Data Sources –Literature –Archives Unified Definitions –Units, –Semantics/Concepts/Metrics, Representations, –Provenance Object model Classes and methods Portals

17 Data Sources Literature online and cross indexed –Simbad, ADS, NED, Many curated archives online –FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… –Typically files with English meta-data and some programs Groups, Researchers, Amateurs Publish –Datasets online in various formats –Data publications are ephemeral (may disappear) –Many have unknown provenance Documentation varies; some good and some none.

18 Unified Definitions Universal Content Definitions –Collated all table heads from all the literature –100,000 terms reduced to ~1,500 –Rough consensus that this is the right thing. –Refinement in progress as people use UCDs Defines –Units: gram, radian, second, janski... –Semantic Concepts / Metrics Std error, Chi 2 fit, magnitude, passband, velocity,

19 Provenance Most data will be derived. To do science, need to trace derived data back to source. So programs and inputs must be registered. Must be able to re-run them. Example: Space Telescope Calibrated Data –Run on demand –Can specify software version (to get old answers) Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science).

20 Object Model General acceptance of XML Recent acceptance of XML Schema (XSD over DTD) Wait-and-See about SOAP/WSDL/… –“ Web Services are just Corba with angle brackets.” –FTP is good enough for me. Personal opinion: –Web Services are much more than “Corba + <>” –Huge focus on interop –Huge focus on integrated tools But the community says “Show me!” –Many technologists convinced, but not yet the astronomers Your program Data In your address space Web Service soap object in xml Your program Web Server http Web page

21 Classes and Methods First Class: VO table –Represents an answer set in XML Defined by an XML Schema (XSD) Metadata (in terms of UCDs) Data representation (numbers and text) –First method Cone Search: Get objects in this cone Your program Data In your address space Web Service soap object in xml

22 Other Classes Space-Time class – Image Class (returns pixels) –SdssCutout –Simple Image Access Protocol –HyperAtlas Spectral –Simple Spectral Access Protocol –500K spectra available at Query Services –ADQL and SkyNode –And Registry: –see below Your program Data In your address space Web Service soap object in xml

23 The Registry UDDI seemed inappropriate –Complex –Irrelevant questions –Relevant questions missing Evolved Dublin Core –Represent Datasets, Services, Portals –Needs to be machine readable –Federation (DNS model) –Push & Pull: register then harvest

24 Demo SkyServer: –navigator showing cutout web service –List: showing many calls and variant use. SkyQuery: –Show integration of various archives. –Explain spatial join xMatch operator.

25 SkyServer.SDSS.org A modern Astronomy archive –Raw Pixel data lives in file servers –Catalog data (derived objects) lives in Database –Online query to any and all Also used for education –150 hours of online Astronomy –Implicitly teaches data analysis Interesting things –Spatial data search –Client query interface via Java Applet –Query interface via Emacs –Popular –Cloned by other surveys (a template design) –Web services are core of it.

26 SkyQuery.Net SkyQuery.Net A Prototype WWT Started with SDSS data and schema Imported12 other datasets into that spine schema. (a day per dataset plus load time) Unified them with a portal Implicit spatial join among the datasets. All built on Web Services –Pure XML –Pure SOAP –Used.NET toolkit

27 Federation: SkyQuery.NetSkyQuery.Net Combine 4 archives initially Added 9 more Send query to portal, portal joins data from archives. Problem: want to do multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “batch schedule” on portal server, Deposits answer in personal database.

28 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes –Schema Web Service –Database Web Service Portal is –Plans Query (2 phase) –Integrates answers –Is a web service

29 MyDB Portal allows federation of data but… Intermediate results may be large. Intermediate results feed into next analysis step. Sending them back-and-forth to client is costly and sometimes infeasible. Solution: create a working DB for client at Portal: MyDB

30 MyDB Anyone can create a personal DB at SkyServer portal. –It is about 100 MB –It is private Simple queries done immediately Complex queries done by batch scheduler All queries can create/read/write MyDB tables Very popular with “serious” users. MyDB will be sharable with by a group.

31 Open SkyQuery SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration). SkyNode basic archive object SkyQuery Language (VoQL) is evolving.

32 The WWT Components Outline Data Sources –Literature –Archives Unified Definitions –Units, –Semantics/Concepts/Metrics, Representations, –Provenance Object model Classes and methods Portals WWT is a poster child for the Data Grid. What we learned Astro is a community of 10,000 Homogenous & Cooperative If you can’t do it for Astro, do not bother with 3M bio-info. Agreement –Takes time –Takes endless meetings Big problems are non-technical –Legacy is a big problem. Plumbing and tools are there But… –What is the object model? –What do you want to save? –How document provenance?

33 References ( all are MSR TRs ) Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science When Database Systems Meet the Grid There Goes the Neighborhood: Relational Algebra for Spatial Data Search Extending the SDSS Batch Query System to the National Virtual Observatory Grid The World-Wide Telescope, an Archetype for Online Science Data Mining the SDSS SkyServer Database The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data Web Services for the Virtual Observatory Online Scientific Data Curation, Publication, and Archiving Petabyte Scale Data Mining: Dream or Reality? The World-Wide Telescope, an Archetype for Online Science Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey