Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.
Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
An Introductory Overview of the National Virtual Observatory & How It Might Benefit the AAVSO by Richard Doc Kinne, KQR 2008 Fall AAVSO Conference Nantucket,

The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
Virtual Observatory & Grid Technique ZHAO Yongheng (National Astronomical Observatories of China) CANS2002.
Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Data Mining – Intro.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 1 Slide Introduction to Data Mining and Business Intelligence.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
F. Genova, Berlin 7, Paris, 2 December 2009 The astronomical information network.
The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
Federated Discovery and Access in Astronomy Robert Hanisch (NIST), Ray Plante (NCSA)
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.
NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.
CMU-CS lunch talk, Gerard Lemson1 Computational and statistical problems for the Virtual Observatory With contributions from/thanks to: GAVO.
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
German Astrophysical Virtual Observatory Overview and Results So Far W. Voges, G. Lemson, H.-M. Adorf.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
LECTURE 2: DATA MINING. WHAT IS DATA MINING? 2 D ATA M INING AND D ATA W AREHOUSES ? It evolved in to being as the science of databases evolved Database.
1 eScience in Astronomy: Grid & VO GAVO III KickOff eScience in Astronomy: VO & GRID eScience: making the most advanced tools of IT available to scientists.
W.Murray PPD 1 Machine Learning Bill Murray RAL, CCLRC home.cern.ch/~murray IoP Half Day Meeting on Statistics in High Energy Physics.
How much information? Adapted from a presentation by:
Moving towards the Virtual Observatory Paolo Padovani, ST-ECF/ESO
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Jim Gray Alex Szalay SLAC Data Management Workshop
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy

Patterns of Scientific Progress Observational Science Scientist gathers data by direct observation Scientist analyzes data Analytical Science Scientist builds analytical model Makes predictions. Computational Science Simulate analytical model Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator Processed by software Placed in a database / files Scientist analyzes database / files Gray and Szalay, Communications of the ACM (2002)

Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, … → 1PB They mine it looking for new (kinds of) objects, more of interesting ones (quasars), density variations in 400-D space, correlations in 400-D space Data doubles every year, public after 1 year So, 50% of the data is public Same trend appears in all sciences

The Challenges Data Collection Discovery and Analysis Publishing Exponential data growth: Distributed collections Soon Petabytes New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators

Making Discoveries Where are discoveries made? At the edges and boundaries Going deeper, collecting more data, using more colors…. Metcalfe’s law Utility of computer networks grows as the number of possible connections: O(N 2 ) Szalay’s data federation law Federation of N archives has utility O(N 2 ) Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this Very early discoveries from SDSS, 2MASS, DPOSS

Data Analysis Today Download (FTP and GREP) are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point we need indices to limit search parallel data search and analysis This is where databases can help Next generation technique: Data Exploration Bring the analysis to the data!

Next-Generation Data Analysis Looking for Needles in haystacks – the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at same rate, we can only keep up with N logN A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don ’ t assume infinite computational resources or memory Requires combination of statistics & computer science

Why Is Astronomy Special? It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal Diverse and distributed Many different instruments from many different places and many different times The questions are interesting There is a lot of it (soon Petabytes)

The Virtual Observatory Many new surveys are coming SDSS is a dry run for the next ones LSST will be 5TB/night All the data will be on the Internet ftp, web services … Data and applications will be associated with the projects Distributed world wide, cross-indexed Federation is a must Will be the best telescope in the world World Wide Telescope Finds the “ needle in the haystack ” Successful demonstrations in Jan ’ 03

Dealing with the astronomy legacy u FITS data format u Software analysis systems Standards driven by evolving new technologies u Exchange of rich and structured data (XML…) u DB connectivity, Web Services, Grid computing Boundary Conditions Application to astronomy domain Data dictionaries (UCDs) Data models Protocols Registries and resource/service discovery Provenance, data quality Boundary conditions

Short History of the VO Driven by exponential data growth In the US it started with SDSS + GriPhyN In Europe started at CDS (Strasbourg) Continued with NVO + AVO Now: International Virtual Observatory Alliance Now in 14 countries Total data holdings >200TB Core services and standards adopted Getting ready for first deployment (mid04)

Data Analysis - Optimal Statistics Brute-force examples for optimal statistics have poor scaling Correlation functions N 2, likelihood techniques N 3 As data sizes grow at Moore ’ s law, computers can only keep up with at most N logN algorithms What goes? Notion of optimal is in the sense of statistical errors Assumes infinite computational resources Assumes that only source of error is statistical ‘ Cosmic Variance ’ : we can only observe the Universe from one location (finite sample size) Solutions require combination of Statistics and CS New algorithms: not worse than N logN

Organization & Algorithms Use of clever data structures (trees, cubes): Up-front creation cost, but only N logN access cost Large speedup during the analysis Tree-codes for correlations (A. Moore et al 2001) Data Cubes for OLAP (all vendors) Fast, approximate heuristic algorithms No need to be more accurate than cosmic variance Fast CMB analysis by Szapudi et al (2001) N logN instead of N 3 => 1 day instead of 10 million years Take cost of computation into account Controlled level of accuracy Best result in a given time, given our computing resources

Analysis and Databases Much statistical analysis deals with Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database Move Mohamed to the mountain, not the mountain to Mohamed

Cosmic Microwave Background Szapudi et al 2002

Data Exploration: A New Way of Doing Science Primary access to data is through databases Exponential data growth – distributed data Publication before analysis Large data: move analysis to where data is Distributed computing – data federation New algorithms are needed The Virtual Observatory is a good example Unavoidable, emerging in all sciences!