Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),

Similar presentations


Presentation on theme: "Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),"— Presentation transcript:

1 Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

2 The Age of Surveys CMB Surveys (pixels) 1990 COBE 1000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck10 Million Galaxy Redshift Surveys (obj) 1986 CfA 3500 1996 LCRS 23000 2003 2dF 250000 2008 SDSS 1000000 2012 BOSS 2000000 2012 LAMOST 2500000 Angular Galaxy Surveys (obj) 1970 Lick 1M 1990 APM 2M 2005 SDSS200M 2011 PS1 1000M 2020 LSST 30000M Time Domain QUEST SDSS Extension survey Dark Energy Camera Pan-STARRS LSST… Petabytes/year …

3 Sloan Digital Sky Survey “The Cosmic Genome Project” Two surveys in one –Photometric survey in 5 bands –Spectroscopic redshift survey Data is public –2.5 Terapixels of images => 5 Tpx –10 TB of raw data => 120TB processed –0.5 TB catalogs => 35TB in the end Started in 1992, finished in 2008 Extra data volume enabled by –Moore’s Law –Kryder’s Law

4 Analysis of Galaxy Spectra Sparse signal in large dimensions Much noise, and very rare events 4Kx1M SVD problem, perfect for randomized algorithms Motivated our work on robust incremental PCA

5 Galaxy Properties from Galaxy Spectra Continuum EmissionsSpectral Lines

6 Galaxy Diversity from PCA [Average Spectrum] [Stellar Continuum] [Finer Continuum Features + Age] [Age] Balmer series hydrogen lines [Metallicity] Mg b, Na D, Ca II Triplet 1st 2nd 3rd 4th 5th PC

7 Streaming PCA Initialization –Eigensystem of a small, random subset –Truncate at p largest eigenvalues Incremental updates –Mean and the low-rank A matrix –SVD of A yields new eigensystem Randomized algorithm! T. Budavari, D. Mishin 2011

8 Robust PCA PCA minimizes σ RMS of the residuals r = y – Py –Quadratic formula:  r 2 extremely sensitive to outliers We optimize a robust M-scale σ 2 (Maronna 2005) –Implicitly given by Fits in with the iterative method! Outliers can be processed separately

9 Eigenvalues in Streaming PCA Classic Robust 9

10 Examples with SDSS Spectra Built on top of the Incremental Robust PCA Principal Component Pursuit (I. Csabai et al) Importance sampling (C-W Yip et al)

11 Principal component pursuit Low rank approximation of data matrix: X Standard PCA: –works well if the noise distribution is Gaussian –outliers can cause bias Principal component pursuit –“sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low –NP-hard problem The L1 trick: –numerically feasible convex problem (Augmented Lagrange Multiplier) * E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)

12 Slowly varying continuum + absorption lines Highly variable “sparse” emission lines This is the simple version of PCP: the position of the lines are known but there are many of them, automatic detection can be useful spiky noise can bias standard PCA DATA: Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) SDSS 1M galaxy spectra Morphological subclasses Robust averages + first few PCA directions Testing on Galaxy Spectra

13 PCA PCA reconstruction Residual

14 Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03

15 Not Every Data Direction is Equal A = C X Galaxy ID Wavelength Galaxy ID Selected WavelengthsWavelength Procedure: 1. Perform SVD of A = U  V T 2. Pick number of eigenvectors = K 3. Calculate Leverage Score =  i ||V T ij || 2 / K Selected Wavelengths Mahoney and Drineas 2009

16 Wavelength Sampling Probability k = 2 c = 7 k = 4 c = 16 k = 6 c = 25 k = 8 c = 29

17 Ranking Astronomical Line Indices (Yip et al. 2012 in prep.)(Worthey et al. 94; Trager et al. 98) Subspace Analysis of Spectra Cutouts: -Othogonality -Divergence -Commonality

18 Identify Informative Regions “NewMethod” 1.Pick the λ with largest P λ 2.Define its region of influence using  λ P λ convergence. Mask λ’ s from future selection. 3.Go back to Step 1, or quit. “MahoneySecond” 1.Over-select λ’s from the targeted number. 2.Merge selected λ if two pixels lie within a certain distance 3.Quit.

19 Identifying New Line Indices, Objectively (Yip et al. 2012 in prep.)

20 New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)

21 NewMethod vs MahoneySecond NM M2

22 Gunawan & Neswan 2000)

23 Angle between Subspaces JHULick

24  λ P λ JHU Lick

25 Line Indices for Galaxy Parameter Estimations

26 Importance Sampling and Galaxies Lick indices are ad hoc The new indices are objective –Recover atomic lines –Recover molecular bands –Recover Lick indices –Informative regions are orthogonal to each other, in contrast to Lick Future –Emission line indices –More accurate parameter estimation of galaxies

27 Summary Non-Incremental changes on the way Science is moving increasingly from hypothesis- driven to data-driven discoveries Need randomized, incremental algorithms –Best result in 1 min, 1 hour, 1 day, 1 week New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics… Astronomy has always been data-driven…. now becoming more generally accepted


Download ppt "Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),"

Similar presentations


Ads by Google