Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),

Slides:



Advertisements
Similar presentations
Eötvös University Budapest in the Network.  Seniors: István Csabai (node coordinator): »Photometric redshift estimation, virtual observatories, science.
Advertisements

Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Hierarchical Clustering Leopoldo Infante Pontificia Universidad Católica de Chile Reunión Latinoamericana de Astronomía Córdoba, septiembre 2001.
J. Sánchez Almeida, J. A. L. Aguerri, C. Muñoz-Tuñón, A. de Automatic Unsupervised Spectral Classification of Galaxies for GTC.
Foreground cleaning in CMB experiments Carlo Baccigalupi, SISSA, Trieste.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Astrophysics with Terabytes of Data Alex Szalay The Johns Hopkins University.
An introduction to Principal Component Analysis (PCA)
Principal Component Analysis
July 7, 2008SLAC Annual Program ReviewPage 1 Weak Lensing of The Faint Source Correlation Function Eric Morganson KIPAC.
Dimensional reduction, PCA
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.
Source detection at Saclay Look for a fast method to find sources over the whole sky Provide list of positions, allowing to run maximum likelihood locally.
“ Testing the predictive power of semi-analytic models using the Sloan Digital Sky Survey” Juan Esteban González Birmingham, 24/06/08 Collaborators: Cedric.
Continuous Latent Variables --Bishop
Principal Component Analysis Principles and Application.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
Robust PCA in Stata Vincenzo Verardi FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.
Presented By Wanchen Lu 2/25/2013
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Weak Lensing 3 Tom Kitching. Introduction Scope of the lecture Power Spectra of weak lensing Statistics.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Cosmological Tests using Redshift Space Clustering in BOSS DR11 (Y. -S. Song, C. G. Sabiu, T. Okumura, M. Oh, E. V. Linder) following Cosmological Constraints.
Non Negative Matrix Factorization
1 Sparsity Control for Robust Principal Component Analysis Gonzalo Mateos and Georgios B. Giannakis ECE Department, University of Minnesota Acknowledgments:
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Dark Energy Probes with DES (focus on cosmology) Seokcheon Lee (KIAS) Feb Section : Survey Science III.
The clustering of galaxies detected by neutral hydrogen emission Sean Passmoor Prof. Catherine Cress Image courtesy of NRAO/AUI and Fabian Walter, Max.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Principal Component Analysis: Preliminary Studies Émille E. O. Ishida IF - UFRJ First Rio-Saclay Meeting: Physics Beyond the Standard Model Rio de Janeiro.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
Direct Robust Matrix Factorization Liang Xiong, Xi Chen, Jeff Schneider Presented by xxx School of Computer Science Carnegie Mellon University.
Gap-filling and Fault-detection for the life under your feet dataset.
PHY306 1 Modern cosmology 3: The Growth of Structure Growth of structure in an expanding universe The Jeans length Dark matter Large scale structure simulations.
G. Miknaitis SC2006, Tampa, FL Observational Cosmology at Fermilab: Sloan Digital Sky Survey Dark Energy Survey SNAP Gajus Miknaitis EAG, Fermilab.
Using Baryon Acoustic Oscillations to test Dark Energy Will Percival The University of Portsmouth (including work as part of 2dFGRS and SDSS collaborations)
EBEx foregrounds and band optimization Carlo Baccigalupi, Radek Stompor.
Ching-Wa Yip Johns Hopkins University.  Alex Szalay (JHU)  Rosemary Wyse (JHU)  László Dobos (ELTE)  Tamás Budavári (JHU)  Istvan Csabai (ELTE)
Streaming Problems in Astrophysics
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
3rd International Workshop on Dark Matter, Dark Energy and Matter-Antimatter Asymmetry NTHU & NTU, Dec 27—31, 2012 Likelihood of the Matter Power Spectrum.
J. Jasche, Bayesian LSS Inference Jens Jasche La Thuile, 11 March 2012 Bayesian Large Scale Structure inference.
Emission Line Galaxy Targeting for BigBOSS Nick Mostek Lawrence Berkeley National Lab BigBOSS Science Meeting Novemenber 19, 2009.
Luminous Red Galaxies in the SDSS Daniel Eisenstein ( University of Arizona) with Blanton, Hogg, Nichol, Tegmark, Wake, Zehavi, Zheng, and the rest of.
Locations. Soil Temperature Dataset Observations Data is – Correlated in time and space – Evolving over time (seasons) – Gappy (Due to failures) – Faulty.
Budapest Group Eötvös University MAGPOP kick-off meeting Cassis 2005 January
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Cheng Zhao Supervisor: Charling Tao
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
Advanced statistical and computational methods in astronomical research I. Csabai, L. Dobos, R. Beck, T. Budavari, C. Yip, A. Szalay.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Components Analysis
Geometric Camera Calibration
Spectral classification of galaxies of LAMOST DR3
Principal Component Analysis
ROBUST SUBSPACE LEARNING FOR VISION AND GRAPHICS
Photometric redshift estimation.
Motion Segmentation with Missing Data using PowerFactorization & GPCA
Outlier Processing via L1-Principal Subspaces
Jiannan Zhang, Yihan Song, Ali Luo NAOC, CHINA
Presentation transcript:

Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

The Age of Surveys CMB Surveys (pixels) 1990 COBE Boomerang 10, CBI 50, WMAP 1 Million 2008 Planck10 Million Galaxy Redshift Surveys (obj) 1986 CfA LCRS dF SDSS BOSS LAMOST Angular Galaxy Surveys (obj) 1970 Lick 1M 1990 APM 2M 2005 SDSS200M 2011 PS1 1000M 2020 LSST 30000M Time Domain QUEST SDSS Extension survey Dark Energy Camera Pan-STARRS LSST… Petabytes/year …

Sloan Digital Sky Survey “The Cosmic Genome Project” Two surveys in one –Photometric survey in 5 bands –Spectroscopic redshift survey Data is public –2.5 Terapixels of images => 5 Tpx –10 TB of raw data => 120TB processed –0.5 TB catalogs => 35TB in the end Started in 1992, finished in 2008 Extra data volume enabled by –Moore’s Law –Kryder’s Law

Analysis of Galaxy Spectra Sparse signal in large dimensions Much noise, and very rare events 4Kx1M SVD problem, perfect for randomized algorithms Motivated our work on robust incremental PCA

Galaxy Properties from Galaxy Spectra Continuum EmissionsSpectral Lines

Galaxy Diversity from PCA [Average Spectrum] [Stellar Continuum] [Finer Continuum Features + Age] [Age] Balmer series hydrogen lines [Metallicity] Mg b, Na D, Ca II Triplet 1st 2nd 3rd 4th 5th PC

Streaming PCA Initialization –Eigensystem of a small, random subset –Truncate at p largest eigenvalues Incremental updates –Mean and the low-rank A matrix –SVD of A yields new eigensystem Randomized algorithm! T. Budavari, D. Mishin 2011

Robust PCA PCA minimizes σ RMS of the residuals r = y – Py –Quadratic formula:  r 2 extremely sensitive to outliers We optimize a robust M-scale σ 2 (Maronna 2005) –Implicitly given by Fits in with the iterative method! Outliers can be processed separately

Eigenvalues in Streaming PCA Classic Robust 9

Examples with SDSS Spectra Built on top of the Incremental Robust PCA Principal Component Pursuit (I. Csabai et al) Importance sampling (C-W Yip et al)

Principal component pursuit Low rank approximation of data matrix: X Standard PCA: –works well if the noise distribution is Gaussian –outliers can cause bias Principal component pursuit –“sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low –NP-hard problem The L1 trick: –numerically feasible convex problem (Augmented Lagrange Multiplier) * E. Candes, et al. “Robust Principal Component Analysis”. preprint, Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)

Slowly varying continuum + absorption lines Highly variable “sparse” emission lines This is the simple version of PCP: the position of the lines are known but there are many of them, automatic detection can be useful spiky noise can bias standard PCA DATA: Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) SDSS 1M galaxy spectra Morphological subclasses Robust averages + first few PCA directions Testing on Galaxy Spectra

PCA PCA reconstruction Residual

Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03

Not Every Data Direction is Equal A = C X Galaxy ID Wavelength Galaxy ID Selected WavelengthsWavelength Procedure: 1. Perform SVD of A = U  V T 2. Pick number of eigenvectors = K 3. Calculate Leverage Score =  i ||V T ij || 2 / K Selected Wavelengths Mahoney and Drineas 2009

Wavelength Sampling Probability k = 2 c = 7 k = 4 c = 16 k = 6 c = 25 k = 8 c = 29

Ranking Astronomical Line Indices (Yip et al in prep.)(Worthey et al. 94; Trager et al. 98) Subspace Analysis of Spectra Cutouts: -Othogonality -Divergence -Commonality

Identify Informative Regions “NewMethod” 1.Pick the λ with largest P λ 2.Define its region of influence using  λ P λ convergence. Mask λ’ s from future selection. 3.Go back to Step 1, or quit. “MahoneySecond” 1.Over-select λ’s from the targeted number. 2.Merge selected λ if two pixels lie within a certain distance 3.Quit.

Identifying New Line Indices, Objectively (Yip et al in prep.)

New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)

NewMethod vs MahoneySecond NM M2

Gunawan & Neswan 2000)

Angle between Subspaces JHULick

 λ P λ JHU Lick

Line Indices for Galaxy Parameter Estimations

Importance Sampling and Galaxies Lick indices are ad hoc The new indices are objective –Recover atomic lines –Recover molecular bands –Recover Lick indices –Informative regions are orthogonal to each other, in contrast to Lick Future –Emission line indices –More accurate parameter estimation of galaxies

Summary Non-Incremental changes on the way Science is moving increasingly from hypothesis- driven to data-driven discoveries Need randomized, incremental algorithms –Best result in 1 min, 1 hour, 1 day, 1 week New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics… Astronomy has always been data-driven…. now becoming more generally accepted