Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

Applications of one-class classification
(Higher-order) Clustering in the SDSS Bob Nichol (Portsmouth) Gauri Kulkarni (CMU) SDSS collaboration.
How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.
VO-DAS Chenzhou CUI Chao LIU, Haijun TIAN, Yang YANG, etc National Astronomical Observatories, CAS.
Aggregating local image descriptors into compact codes
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
U N I V E R S I T Y O F S O U T H F L O R I D A Computing Distance Histograms Efficiently in Scientific Databases Yicheng Tu, * Shaoping Chen, *§ and Sagar.
Clustering II.
Constraining Astronomical Populations with Truncated Data Sets Brandon C. Kelly (CfA, Hubble Fellow, 6/11/2015Brandon C. Kelly,
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
July 7, 2008SLAC Annual Program ReviewPage 1 Weak Lensing of The Faint Source Correlation Function Eric Morganson KIPAC.
Bayesian Analysis of X-ray Luminosity Functions A. Ptak (JHU) Abstract Often only a relatively small number of sources of a given class are detected in.
New Software Bob Nichol ICG, Portsmouth Thanks to all my colleagues in SDSS, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon Richards, Brent.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
1 Efficient Algorithms for Non-Parametric Clustering With Clutter Weng-Keen Wong Andrew Moore.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Today Introduction to MCMC Particle filters and MCMC
Unsupervised Learning
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
High Redshift Galaxies (Ever Increasing Numbers).
Modeling the 3-point correlation function Felipe Marin Department of Astronomy & Astrophysics University of Chicago arXiv: Felipe Marin Department.
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
FLANN Fast Library for Approximate Nearest Neighbors
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,
Playing in High Dimensions Bob Nichol ICG, Portsmouth Thanks to all my colleagues in SDSS, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
The VAO is operated by the VAO, LLC. VAO: Archival follow-up and time series Matthew J. Graham, Caltech/VAO.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
WFMOS Feasibility Study Value-added Science Bob Nichol, ICG Portsmouth.
Statistical problems in network data analysis: burst searches by narrowband detectors L.Baggio and G.A.Prodi ICRR TokyoUniv.Trento and INFN IGEC time coincidence.
Spatial Indexing of large astronomical databases László Dobos, István Csabai, Márton Trencséni ELTE, Hungary.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Development of an object- oriented verification technique for QPF Michael Baldwin 1 Matthew Wandishin 2, S. Lakshmivarahan 3 1 Cooperative Institute for.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Clustering in the Sloan Digital Sky Survey Bob Nichol (ICG, Portsmouth) Many SDSS Colleagues.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Wiss. Beirat AIP, ClusterFinder & VO-Methods H. Enke German Astrophysical Virtual Observatory ClusterFinder VO Methods for Astronomical Applications.
Features-based Object Recognition P. Moreels, P. Perona California Institute of Technology.
Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,
A statistical test for point source searches - Aart Heijboer - AWG - Cern june 2002 A statistical test for point source searches Aart Heijboer contents:
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
3rd International Workshop on Dark Matter, Dark Energy and Matter-Antimatter Asymmetry NTHU & NTU, Dec 27—31, 2012 Likelihood of the Matter Power Spectrum.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Budapest Group Eötvös University MAGPOP kick-off meeting Cassis 2005 January
September 10, 2002M. Fechner1 Energy reconstruction in quasi elastic events unfolding physics and detector effects M. Fechner, Ecole Normale Supérieure.
Astronomy toolkits and data structures Andrew Jenkins Durham University.
Overview G. Jogesh Babu.
Clustering (3) Center-based algorithms Fuzzy k-means
Spatial Online Sampling and Aggregation
CSE572, CBS572: Data Mining by H. Liu
Fast and Exact K-Means Clustering
CSE572: Data Mining by H. Liu
Wellington Cabrera Advisor: Carlos Ordonez
Emulator of Cosmological Simulation for Initial Parameters Study
Presentation transcript:

Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture models (applications) Bayes network anomaly detection (application) Very high dimensional data NVO Problems

Collaborators Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi, Tomo Goto (Astro) Larry Wasserman, Chris Genovese, Wong Jang, Pierpaolo Brutti (Statistics) Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS) Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS) Pittsburgh Computational AstroStatistics (PiCA) Group (See

First Motivation Cosmology is moving from a “discovery” science into a “statistical” science Drive for ``high precision’’ measurements: Cosmological parameters to a few percent; Accurate description of the complex structure in the universe; Control of observational and sampling biases New statistical tools – e.g. non-parametric analyses – are often computationally intensive. Also, often want to re-sample or Monte Carlo data.

Second Motivation Last decade was dedicated to building more telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS, DPOSS, MAP). Also, larger simulations. We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5 nights! Petabytes by end of 00’s Highly correlated datasets and high dimensionality Existing statistics and algorithms do not scale into these regimes New Paradigm where we must build new tools before we can analyze & visualize data

SDSS

SDSS

SDSS Data FACTOR OF 12,000,000 Area10000 sq deg3 Objects2.5 billion200 Spectra1.5 million200 DepthR=2310 Attributes144 presently10 SDSS Science Most Distant Object! 100,000 spectra!

Start with tree data structures: Multi-resolutional kd-trees Scale to n-dimensions (although for very high dimensions use new tree structures) Use Cached Representation (store at each node summary sufficient statistics). Compute counts from these statistics Prune the tree which is stored in memory! See Moore et al (astro-ph/ ) Many applications; suite of algorithms! Goal to build new, fast & efficient statistical algorithms

Range Searches Fast range searches and catalog matching Prune cells outside range Also Prune cells inside! Greater saving in time

N-point correlation functions The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair of points over that expected from a poisson process. Also long history (as point processes) in Statistics: Similarly, the three-point is defined as (so on!)

Same 2pt, very different 3pt Naively, this is an n^N process, but all it is, is a set of range searches.

Dual Tree Approach Usually binned into annuli r min r max. Also, if d min > r min & d max <r max all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries. Extra speed-ups are possible doing multiple r’s together and controlled approximations

Time depends on density of points and binsize & scale N*N NlogN N*N*N

Fast Mixture Models Describe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than bandwidth!) The parameters of the model are then N gaussians each with a mean and covariance Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for 100,000 points on a PC!) Employ heuristic splitting algorithm as well Details in Connolly et al (astro-ph/ )

EM-Based Gaussian Mixture Clustering: 1

EM-Based Gaussian Mixture Clustering: 2

EM-Based Gaussian Mixture Clustering: 4

EM-Based Gaussian Mixture Clustering: 20

Applications Used in SDSS quasar selection (used to map the multi-color stellar locus) Gordon PSU Anomaly detector (look for low probability points in N-dimensions) Optimal smoothing of large-scale structure

SDSS QSO target selection in 4D color-space Cluster 9999 spectroscopically confirmed stars Cluster 8833 spectroscopically confirmed QSOs (33 gaussians) 99% for stars, 96% for QSOs

Bayes Net Anomaly Detector Instead of using a single joint probability function (fitted to data) factorize into a smaller set of conditional probabilities Directional and acyclical If we know graph and conditional probabilities, we have valid probability function to whole model

Use 1.5 million SDSS sources to learn model (25 variables each) Then evaluate the likelihood of each data being drawn from the model Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck

Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilities Therefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return again Issue of productivity!

Will Only Get Worse LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007) VISTA will collect 300 Terabytes of data (2005) Archival Science is upon us! HST database has 20GBytes per day downloaded (10 times more than goes in!)

Will Only Get Worse II Surveys spanning electromagnetic spectrum Combining these surveys is hard: different sensitivities, resolutions and physics Mixture of imaging, catalogs and spectra Difference between continuum and point processes Thousands of attributes per source

What is VO? The “Virtual Observatory” must: Federate multi-wavelength data sources (interoperability) Must empower everyone (democratise) Be fast, distributed and easy Allow input and output

Computer Science + Statistics! Scientists will need help through autonomous scientific discovery of large, multi-dimensional, correlated datasets Scientists will need fast databases Scientists will need distributed computing and fast networks Scientists will need new visualization tools CS and Statistics looking for new challenges: Also no data-rights & privacy issues New breed of students needed with IT skills Symbiotic Relationship

VO Prototype Ideally we would like all parts of the VO to be web-servises DBC#dym EMdymhttp.NET http

Lessons We Learnt Tough to marry research c code developed under linux to MS (pointers to memory).NET has “unsafe” memory.NET server is hard to set up! Migrate to using VOTables to perform all I/O. Have server running at CMU so we can control code

Very High Dimensions Using LLE and Isomap; looking for lower dimensional manifolds in higher dimensional spaces 500x2000 space from SDSS spectra

Summary Era of New Cosmology: Massive data sources and search for subtle features & high precision measurements Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different skills). Perfect synergy with Stats, CS, Physics Good algorithms are as good as faster and more computers! The “glue” to make a “virtual observatory” is hard and complex. Don’t under-estimate the job

Are the Features Real? (FDR)! This is an example of multiple hypothesis testing e.g. is every point consistent with a smooth p(k)?

Let us first look at a simulated example: consider a 1000x1000 image with sources. FDR sigma Bonferroni FDR makes 15 times few mistakes for the same power as traditional 2-sigma Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries

And it is adaptive to the size of the dataset

We used a FDR of 0.25 i.e. 25% of circled Points are in error Therefore, we can say with statistical rigor that most of these points a rejected and are thus ``features’’ No single point is a 3sigma deviation New statistics has enabled an astronomical discovery