Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.

Similar presentations


Presentation on theme: "Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy."— Presentation transcript:

1 Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy

2 Patterns of Scientific Progress Observational Science Scientist gathers data by direct observation Scientist analyzes data Analytical Science Scientist builds analytical model Makes predictions. Computational Science Simulate analytical model Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator Processed by software Placed in a database / files Scientist analyzes database / files Gray and Szalay, Communications of the ACM (2002)

3 Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, … → 1PB They mine it looking for new (kinds of) objects, more of interesting ones (quasars), density variations in 400-D space, correlations in 400-D space Data doubles every year, public after 1 year So, 50% of the data is public Same trend appears in all sciences

4 The Challenges Data Collection Discovery and Analysis Publishing Exponential data growth: Distributed collections Soon Petabytes New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators

5 Making Discoveries Where are discoveries made? At the edges and boundaries Going deeper, collecting more data, using more colors…. Metcalfe’s law Utility of computer networks grows as the number of possible connections: O(N 2 ) Szalay’s data federation law Federation of N archives has utility O(N 2 ) Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this Very early discoveries from SDSS, 2MASS, DPOSS

6 Data Analysis Today Download (FTP and GREP) are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point we need indices to limit search parallel data search and analysis This is where databases can help Next generation technique: Data Exploration Bring the analysis to the data!

7 Next-Generation Data Analysis Looking for Needles in haystacks – the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at same rate, we can only keep up with N logN A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don ’ t assume infinite computational resources or memory Requires combination of statistics & computer science

8 Why Is Astronomy Special? It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal Diverse and distributed Many different instruments from many different places and many different times The questions are interesting There is a lot of it (soon Petabytes)

9 The Virtual Observatory Many new surveys are coming SDSS is a dry run for the next ones LSST will be 5TB/night All the data will be on the Internet ftp, web services … Data and applications will be associated with the projects Distributed world wide, cross-indexed Federation is a must Will be the best telescope in the world World Wide Telescope Finds the “ needle in the haystack ” Successful demonstrations in Jan ’ 03

10 Dealing with the astronomy legacy u FITS data format u Software analysis systems Standards driven by evolving new technologies u Exchange of rich and structured data (XML…) u DB connectivity, Web Services, Grid computing Boundary Conditions Application to astronomy domain Data dictionaries (UCDs) Data models Protocols Registries and resource/service discovery Provenance, data quality Boundary conditions

11 Short History of the VO Driven by exponential data growth In the US it started with SDSS + GriPhyN In Europe started at CDS (Strasbourg) Continued with NVO + AVO Now: International Virtual Observatory Alliance Now in 14 countries Total data holdings >200TB Core services and standards adopted Getting ready for first deployment (mid04)

12 Data Analysis - Optimal Statistics Brute-force examples for optimal statistics have poor scaling Correlation functions N 2, likelihood techniques N 3 As data sizes grow at Moore ’ s law, computers can only keep up with at most N logN algorithms What goes? Notion of optimal is in the sense of statistical errors Assumes infinite computational resources Assumes that only source of error is statistical ‘ Cosmic Variance ’ : we can only observe the Universe from one location (finite sample size) Solutions require combination of Statistics and CS New algorithms: not worse than N logN

13 Organization & Algorithms Use of clever data structures (trees, cubes): Up-front creation cost, but only N logN access cost Large speedup during the analysis Tree-codes for correlations (A. Moore et al 2001) Data Cubes for OLAP (all vendors) Fast, approximate heuristic algorithms No need to be more accurate than cosmic variance Fast CMB analysis by Szapudi et al (2001) N logN instead of N 3 => 1 day instead of 10 million years Take cost of computation into account Controlled level of accuracy Best result in a given time, given our computing resources

14 Analysis and Databases Much statistical analysis deals with Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database Move Mohamed to the mountain, not the mountain to Mohamed

15 Cosmic Microwave Background Szapudi et al 2002

16 Data Exploration: A New Way of Doing Science Primary access to data is through databases Exponential data growth – distributed data Publication before analysis Large data: move analysis to where data is Distributed computing – data federation New algorithms are needed The Virtual Observatory is a good example Unavoidable, emerging in all sciences!


Download ppt "Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy."

Similar presentations


Ads by Google