Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.
Trying to Use Databases for Science Jim Gray Microsoft Research
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Recommendations for a Table Access Protocol Ray Plante, Tamas Budavari, Gretchen Greene, John Goode, Tom McGlynn, Maria Nieto-Santistaban, Alex Szalay,

11 July 2002NVO EPO Workshop1 The National Virtual Observatory Robert Hanisch Space Telescope Science Institute Project Manager NSF NVO Project (with liberal.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
9 September 2005NVO Summer School Aspen Astronomical Dataset Query Language (ADQL) Ray Plante T HE US N ATIONAL V IRTUAL O BSERVATORY.
VO Standards – Catalog Access Tamás Budavári Johns Hopkins University.
Aus-VO Workshop 2003 International Virtual Observatory Alliance effort on Virtual Observatory Query Language Naoki Yasuda (JVO), VOQL WG.
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
CASDA Virtual Observatory CSIRO ASTRONOMY AND SPACE SCIENCE Arkadi Kosmynin 11 March 2014.
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
VO-DAS Chenzhou CUI Chao LIU, Haijun TIAN, Yang YANG, etc National Astronomical Observatories, CAS.
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
14 October 2003ADASS 2003 – Strasbourg1 Resource Registries for the Virtual Observatory R.Plante (NCSA), G. Greene (STScI), R. Hanisch (STScI), T. McGlynn.
Web + VO + Database Technologies = HLA Footprints STScI: Gretchen Greene, Steve Lubow, Brian McLean, Rick White and the HLA Team JHU: Alex Szalay and Tamas.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Looking Forward Mike Goodchild. Where is ESRI going? 9.0 –massively expanded toolbox –script management and metadata –Python, JScript, Perl –visual modeling.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
SAN DIEGO SUPERCOMPUTER CENTER Developing a CUAHSI HIS Data Node, as part of Cyberinfrastructure for the Hydrologic Sciences David Valentine Ilya Zaslavsky.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Astronomical Data Query Language Simple Query Protocol for the Virtual Observatory Naoki Yasuda 1, William O'Mullane 2, Tamas Budavari 2, Vivek Haridas.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
DISTRIBUTED COMPUTING
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
29-30 April 2004NVO Team Meeting NCSA1 Data Access Layer (DAL) SSA, SIA Enhancement Doug Tody National Radio Astronomy Observatory National Virtual Observatory.
Master Thesis Defense Jan Fiedler 04/17/98
Science with the Virtual Observatory Brian R. Kent NRAO.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
How to build your own SkyNode A quick tutorial by Alberto Conti & Bernie Shiao Space Telescope Science Institute Baltimore, MD
11/26/2003IVOA Small Projects Meeting Potential R&D focuses for China-VO Chenzhou Cui National Astronomical Observatory of China Chinese Virtual.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
26 October 2005HST Calibration Workshop1 The National Virtual Observatory and HST T HE US N ATIONAL V IRTUAL O BSERVATORY Robert Hanisch US National Virtual.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Federated Discovery and Access in Astronomy Robert Hanisch (NIST), Ray Plante (NCSA)
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Web Services for the National Virtual Observatory Tamás Budavári Johns Hopkins University.
30 October 2008 IVOA Interoperability Meeting -- Baltimore T HE I NTERNATIONAL V IRTUAL O BSERVATORY ALLIANCE VOTable interface with Registry Joint Apps/DM/Registry.
CMU-CS lunch talk, Gerard Lemson1 Computational and statistical problems for the Virtual Observatory With contributions from/thanks to: GAVO.
German Astrophysical Virtual Observatory Overview and Results So Far W. Voges, G. Lemson, H.-M. Adorf.
Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.
12 Oct 2003VO Tutorial, ADASS Strasbourg, Data Access Layer (DAL) Tutorial Doug Tody, National Radio Astronomy Observatory T HE US N ATIONAL V IRTUAL.
Web based spectrum databases and utilities László Dobos Tamás Budavári István Csabai MAGPOP kick-off meeting, January Cassis.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
William O’Mullane/ Tannu Malik - JHU IVOA Cambridge May 12-16, 2003 SkyQuery.Net SKYQUERY Federated Database Query System (using WebServices)
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
How to use the GALEX SkyNode*
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
Google Sky.
Presentation transcript:

Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living in an exponential world….)

Alex Szalay, Garching Outline Collecting Data Exponential Growth Making Discoveries Publishing Data VO: How will it work? Web Services Atomic vs Composite services Distributed queries with SkyQuery Cross-Matching Algorithm SkyNode Web Services + Portal Statistical Analysis of large data sets

Alex Szalay, Garching The World is Exponential Astrophysical data is growing exponentially Doubling every year (Moore s Law+): both data sizes and number of data sets Computational resources scale the same way Constant $$$ will keep up with the data Main problem is the software component Currently components are not reused Software costs are increasingly larger fraction Aggregate costs are growing exponentially

Alex Szalay, Garching Making Discoveries When and where are discoveries made? Always at the edges and boundaries Going deeper, using more colors …. Metcalfe s law Utility of computer networks grows as the number of possible connections: O(N 2 ) VO: Federation of N archives Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this Very early discoveries from SDSS, 2MASS, DPOSS

Alex Szalay, Garching Publishing Data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists

Alex Szalay, Garching Changing Roles Exponential growth: Projects last at least 3-5 years Data sent upwards only at the end of the project Data will be never centralized More responsibility on projects Becoming Publishers and Curators Larger fraction of budget spent on software Lot of development duplicated, wasted More standards are needed Easier data interchange, fewer tools More templates are needed Develop less software on your own

Alex Szalay, Garching Emerging New Concepts Standardizing distributed data Web Services, supported on all platforms Custom configure remote data dynamically XML: Extensible Markup Language SOAP: Simple Object Access Protocol WSDL: Web Services Description Language Standardizing distributed computing Grid Services Custom configure remote computing dynamically Build your own remote computer, and discard Virtual Data: new data sets on demand

Alex Szalay, Garching NVO: How Will It Work? Define commonly used `atomic services Build higher level toolboxes/portals on top We do not build `everything for everybody Use the rule: Define the standards and interfaces Build the framework Build the 10% of services that are used by 90% Let the users build the rest from the components

Alex Szalay, Garching Atomic Services Metadata information about resources Waveband Sky coverage Translation of names to universal dictionary (UCD) Simple search patterns on the resources Cone Search Image mosaic Unit conversions Simple filtering, counting, histogramming On-the-fly recalibrations

Alex Szalay, Garching Higher Level Services Built on Atomic Services Perform more complex tasks Examples Automated resource discovery Cross-identifications Photometric redshifts Outlier detections Visualization facilities Expectation: Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL)

Alex Szalay, Garching SkyQuery Distributed Query tool using a set of services Feasibility study, built in 6 weeks from scratch Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc) Implemented in C# and.NET Won 2 nd prize of Microsoft XML Contest Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

Alex Szalay, Garching Architecture Image cutout SkyNode SDSS SkyNode 2Mass SkyNode First SkyQuery Web Page

Alex Szalay, Garching Cross-id Steps Parse query Get counts Sort by counts Make plan Cross-match Recursively, from small to large Select necessary attributes only Return output Insert cutout image SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2 AND o.type=3

Alex Szalay, Garching Monte-Carlo Simulation Comparing different algorithms for 3-way xid Transmit all the data Transmit after filtering Recursive cross-match Surveys SDSS 2MASS First Random variables: Sky Area (0..10 sqdeg) Selectivity of each subselect (0..1) Efficiency of join (0.5..2) Selectivity of common select (0..1)

Alex Szalay, Garching SkyNode Metadata functions (SOAP) Info, Tables, Columns, Schema, Functions, Keysearch Query functions (SOAP) Dataset Query(String sqlCmd) Dataset Xmatch(Dataset input, String sqlCmd, float eps) Database MS SQL Server Upload dataset Very fast spatial search engine (HTM-based) crossmatch takes <3 ms/object over 15M in SDSS User defined functions and stored procedures

Alex Szalay, Garching Data Flow SkyNode 1 SkyQuery SkyNode 2 SkyNode 3 query

Alex Szalay, Garching Optimal Statistics The examples for optimal statistics have poor scaling Correlation functions N 2, likelihood techniques N 3 As data sizes grow at Moore s law, computers can only keep up with at most N logN algorithms What goes? Notion of optimal is in the sense of statistical errors Assumes infinite computational resources Assumes that only source of error is statistical `Cosmic Variance : we can only observe the Universe from one location (finite sample size) Solutions require combination of Statistics and CS New algorithms: not worse than N logN

Alex Szalay, Garching Clever Data Structures Heavy use of tree structures: Up-front cost, but only N logN Large speedup later Tree-codes for correlations (A. Moore et al 2001) Fast, approximate heuristic algorithms No need to be more accurate than cosmic variance Fast CMB analysis by Szapudi etal (2001) N logN instead of N 3 => 1 day instead of 10 million years Take cost of computation into account Controlled level of accuracy Best result in a given time, given our computing resources

Alex Szalay, Garching Angular Clustering with Photo-z w( ) by Peebles and Groth: The first example of publishing and analyzing large data Samples based on rest-frame quantities Strictly volume limited samples Largest angular correlation study to date Very clear detection of Luminosity and color dependence Results consistent with 3D clustering T. Budavari, A. Connolly, I. Csabai, I. Szapudi, A. Szalay, S. Dodelson, J. Frieman, R. Scranton, D. Johnston and the SDSS Collaboration

Alex Szalay, Garching The Samples 343k 254k185k 316k280k 326k185k 127k -20 > M r > k -21 > M r >-23931k 0.1<z< > M r2.2M -21 > M r >-22662k -22 > M r >-23269k 0.1<z< > M r3.1M 10M 10 stripes: 10M 15M m r <21 : 15M 50M All: 50M 2800 square degrees in 10 stripes, data in custom DB

Alex Szalay, Garching The Stripes 10 stripes over the SDSS area, covering about 2800 square degrees About 20% lost due to bad seeing Masks: seeing, bright stars, etc. Images generated from query by web service

Alex Szalay, Garching The Masks Stripe 11 + masks Masks are derived from the database Search and intersect extended objects with boundaries

Alex Szalay, Garching The Analysis eSpICE : I.Szapudi, S.Colombi and S.Prunet Integrated with the database by T. Budavari Extremely fast processing (N logN) 1 stripe with about 1 million galaxies is processed in 3 mins Usual figure was 10 min for 10,000 galaxies => 70 days Each stripe processed separately for each cut 2D angular correlation function computed w( ): average with rejection of pixels along the scan flat field vector causes mock correlations

Alex Szalay, Garching Angular Correlations I. Luminosity dependence: 3 cuts -20> M > > M > > M > -23

Alex Szalay, Garching Angular Correlations II. Color Dependence 4 bins by rest-frame SED type

Alex Szalay, Garching Summary Exponential data growth – distributed data Web Services – hierarchical architecture Use the rule (maybe 80-20) There are clever ways to federate datasets! Statistical analyses do not follow Moore s law Need to revisit optimal statistics Give interesting new tools into the hands of smart young people … They will quickly turn them into cutting edge science

Alex Szalay, Garching Virtual Observatory Astronomy with an attitude…