Download presentation
Presentation is loading. Please wait.
1
Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University
2
Living in an Exponential World Astronomers have a few hundred TB now –1 pixel (byte) / sq arc second ~ 4TB –Multi-spectral, temporal, … → 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year Data is public after 1 year Same access for everyone But: how long can this continue?
3
Evolving Science Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) synthesizing theory, experiment and computation with advanced data management and statistics
4
The Challenges Data Collection Discovery and Analysis Publishing Exponential data growth: Distributed collections Soon Petabytes New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators
5
Publishing Data Exponential growth: –Projects last at least 3-5 years –Data sent upwards only at the end of the project –Data will never be centralized More responsibility on projects –Becoming Publishers and Curators Data will reside with projects –Analyses must be close to the data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists
6
Accessing Data If there is too much data to move around, take the analysis to the data! Do all data manipulations at database –Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality –Databases & Procedures being unified –Example temporal and spatial indexing –Pixel processing Easy to reorganize the data –Multiple views, each optimal for certain analyses –Building hierarchical summaries are trivial Scalable to Petabyte datasets active databases!
7
Making Discoveries Where are discoveries made? –At the edges and boundaries –Going deeper, collecting more data, using more colors…. Metcalfe’s law –Utility of computer networks grows as the number of possible connections: O(N 2 ) Federating data –Federation of N archives has utility O(N 2 ) –Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this –Very early discoveries from SDSS, 2MASS, DPOSS
8
Federation Data Federations Massive datasets live near their owners: –Near the instrument’s software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes (web) services –Schema: documents the data –Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives –A common “global” schema
9
The Virtual Observatory Premise: most data is (or could be online) Federating the different surveys will provide opportunities for new science It’s a smart telescope: links objects and data to literature on them Software became the capital expense –Share, standardize, reuse.. It has to be SIMPLE You can form your own small collaborations
10
Strong International Collaboration Similar efforts now in 15 countries: –USA, UK, Canada, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Hungary, South Korea, ESO, Spain Total awarded funding world-wide is over $60M Active collaboration among projects –Standards, common demos –International VO roadmap being developed –Regular telecons over 10 timezones Formal collaboration International Virtual Observatory Alliance (IVOA)
11
Dealing with the astronomy legacy u FITS data format u Software systems Standards driven by evolving new technologies u Exchange of rich and structured data (XML…) u DB connectivity, Web Services, Grid computing External funding climate Boundary Conditions Application to astronomy domain –Data dictionaries (UCDs) –Data models –Protocols –Registries and resource/service discovery –Provenance, data quality, DATA CURATION!!!! Boundary conditions
12
Current VO Challenges How to avoid trying to be everything for everybody? Database connectivity is essential –Bring the analysis to the data Core web services, higher level applications on top Use the 90-10 rule: –Define the standards and interfaces –Build the framework –Build the 10% of services that are used by 90% –Let the users build the rest from the components Rapidly changing “outside world” Make it simple!!!
13
Where are we going? Relatively easy to predict until 2010 –Exponential growth continues –Most ground based observatories join the VO –More and more sky surveys in different wavebands –Simulations will have VO interfaces: can be ‘observed’ Much harder beyond 2010 –PetaSurveys are coming on line (PANSTarrs, VISTA, LSST) –Technological predictions much harder –Changing funding climate –Changing sociology
14
Similarities to HEP HEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time Domain 30-100m telescopes Similar trends with a 20 year delay, fewer and ever bigger projects… increasing fraction of cost is in software… more conservative engineering… Can the exponential continue, or will be logistic? What can astronomy learn from High Energy Physics?
15
Why Is Astronomy Different? Especially attractive for the wide public It has no commercial value – No privacy concerns, freely share results with others – Great for experimenting with algorithms Data has more dimensions –Spatial, temporal, cross-correlations Diverse and distributed – Many different instruments from many different places and many different times Many different interesting questions
16
Trends CMB Surveys 1990 COBE 1000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck10 Million Galaxy Redshift Surveys 1986 CfA 3500 1996 LCRS 23000 2003 2dF 250000 2005 SDSS 750000 Angular Galaxy Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS200M 2008 VISTA 1000M 2012 LSST 3000M Time Domain QUEST SDSS Extension survey Dark Energy Camera PanStarrs SNAP… LSST… Petabytes/year by the end of the decade…
17
Challenges Real-Time Detection for 3B objects Pixels (exponential growth slowing down) Size projection: 100PB by 2020 Data Transfer (grows slower than data) Data Access (hierarchical usage) Fault Tolerance and Data Protection Tier0Tier1 Fast 1% 10% 100% Tier2
18
SkyServer Sloan Digital Sky Survey: Pixels + Objects About 500 attributes per “object”, 300M objects Spectra for 1M objects Currently 2TB fully public Prototype eScience lab –Moving analysis to the data –Fast searches: color, spatial Visual tools –Join pixels with objects Prototype in data publishing –70 million web hits in 3.5 years http://skyserver.sdss.org/
19
Public Data Release: Versions! June 2001: EDR –Early Data Release July 2003: DR1 –Contains 30% of final data –150 million photo objects July 2005: DR4 at 3.5TB –60% of data 4 versions of the data –Target, best, runs, spectro Total catalog volume 5TB –See Terascale sneakernet paper… Published releases served forever –EDR, DR1, DR2, …. –Soon to include email archives, annotations O(N 2 ) – only possible because of Moore’s Law! EDR DR1 DR2 DR3
20
Spatial Features Precomputed Neighbors –All objects within 30” Boundaries, Masks and Outlines –27,000 spatial objects –Stored as spatial polygons Time Domain: Precomputed Match –All objects with 1”, observed at different times –Found duplicates due to telescope tracking errors –Manual fix, recorded in the database MatchHead –The first observation of the linked list used as unique id to chain of observations of the same object
21
Things Can Get Complex
22
3 Ways To Do Spatial Hierarchical Triangular Mesh (extension to SQL) –Uses table valued stored procedures –Acts as a new “spatial access method” –Ported to Yukon CLR for a 17x speedup. Zones: fits SQL well –Surprisingly simple & good on a fixed scale Constraints: a novel idea –Lets us do algebra on regions., implemented in pure SQL Paper: There Goes the Neighborhood: Relational Algebra for Spatial Data Search There Goes the Neighborhood: Relational Algebra for Spatial Data Search
23
2MASS 471 Mrec 140 GB USNOB 1.1 Brec 233 GB Next zone 0:-1 Source Tables Zones 2MASS:USNOB Zone:Zone Comparison 0:0 0:+1 64 Mrec 2 GB 260 Mrec 9 GB 26 Mrec 1 GB 350 Mrec 12 GB 350 Mrec 12 GB 2MASS→USNOB USNOB→2MASS Merge Answer Build Index 2 hours.5 hour Pipeline Parallelism: 2.5 hours Or… as fast as we can read USNOB +.5 hours Next zone
24
Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks ‘Optimal’ statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 –For large data sets main errors are not statistical As data and computers grow with Moore’s Law, we can only keep up with N logN A way out? –Discard notion of optimal (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Requires combination of statistics & computer science
25
Organization & Algorithms Use of clever data structures (trees, cubes): –Up-front creation cost, but only N logN access cost –Large speedup during the analysis –Tree-codes for correlations (A. Moore et al 2001) –Data Cubes for OLAP (all vendors) Fast, approximate heuristic algorithms –No need to be more accurate than cosmic variance –Fast CMB analysis by Szapudi et al (2001) N logN instead of N 3 => 1 day instead of 10 million years Take cost of computation into account –Controlled level of accuracy –Best result in a given time, given our computing resources
26
Today’s Questions Discoveries –need fast outlier detection Spatial statistics –Fast correlation and power spectrum codes (CMB + galaxies) –Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere) Time-domain: –Transients, supernovae, periodic variables –Moving objects, killer’ asteroids, Kuiper-belt objects….
27
Other Challenges Statistical noise is smaller and smaller –Error matrix larger and larger (Planck…) Systematic errors becoming dominant –De-sensitize against known systematic errors –Optimal subspace filtering (…SDSS stripes…) Comparisons of spectra to models –10 6 spectra vs 10 8 models (Charlot…) Detection of faint sources in multi-spectral images –How to use all information optimally (QUEST…) Efficient visualization of ensembles of 100M+ data points
28
Systematic Errors SDSS P(k), main issue: –Effects of zero points, flat field vectors result in large scale, correlated patterns Two tasks: –Estimate how large is the effect –De-sensitize statistics Monte-Carlo simulations: –100 million random points, assigned to stripes, runs, camcols, fields, x,y positions and redshifts => database –Build MC error matrix due to zeropoint errors Include error matrix in the KL basis –Some modes sensitive to zero points (# of free pmts) –Eliminate those modes from the analysis => projection Statistics insensitive to zero points afterwards
29
Simulations Cosmological simulations have 10 9 particles and produce over 30TB of data (Millennium) Build up dark matter halos Track merging history of halos Use it to assign star formation history Combination with spectral synthesis Too few realizations Hard to analyze the data afterwards What is the best way to compare to the real universe
30
Summary Databases became an essential part of astronomy: most data access will soon be via digital archives Data at separate locations, distributed worldwide, evolving in time: move analysis, not data! Good scaling of statistical algorithms essential Many outstanding problems in astronomy are statistical, current techniques inadequate, we need help! The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.