Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.

Similar presentations


Presentation on theme: "Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture."— Presentation transcript:

1 Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture models (applications) Bayes network anomaly detection (application) Very high dimensional data NVO Problems

2 Collaborators Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi, Tomo Goto (Astro) Larry Wasserman, Chris Genovese, Wong Jang, Pierpaolo Brutti (Statistics) Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS) Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS) Pittsburgh Computational AstroStatistics (PiCA) Group (See http://www.picagroup.org)

3 First Motivation Cosmology is moving from a “discovery” science into a “statistical” science Drive for ``high precision’’ measurements: Cosmological parameters to a few percent; Accurate description of the complex structure in the universe; Control of observational and sampling biases New statistical tools – e.g. non-parametric analyses – are often computationally intensive. Also, often want to re-sample or Monte Carlo data.

4 Second Motivation Last decade was dedicated to building more telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS, DPOSS, MAP). Also, larger simulations. We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5 nights! Petabytes by end of 00’s Highly correlated datasets and high dimensionality Existing statistics and algorithms do not scale into these regimes New Paradigm where we must build new tools before we can analyze & visualize data

5 SDSS

6 SDSS

7 SDSS Data FACTOR OF 12,000,000 Area10000 sq deg3 Objects2.5 billion200 Spectra1.5 million200 DepthR=2310 Attributes144 presently10 SDSS Science Most Distant Object! 100,000 spectra!

8 Start with tree data structures: Multi-resolutional kd-trees Scale to n-dimensions (although for very high dimensions use new tree structures) Use Cached Representation (store at each node summary sufficient statistics). Compute counts from these statistics Prune the tree which is stored in memory! See Moore et al. 2001 (astro-ph/0012333) Many applications; suite of algorithms! Goal to build new, fast & efficient statistical algorithms

9

10 Range Searches Fast range searches and catalog matching Prune cells outside range Also Prune cells inside! Greater saving in time

11 N-point correlation functions The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair of points over that expected from a poisson process. Also long history (as point processes) in Statistics: Similarly, the three-point is defined as (so on!)

12 Same 2pt, very different 3pt Naively, this is an n^N process, but all it is, is a set of range searches.

13 Dual Tree Approach Usually binned into annuli r min r max. Also, if d min > r min & d max <r max all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries. Extra speed-ups are possible doing multiple r’s together and controlled approximations

14 Time depends on density of points and binsize & scale N*N NlogN N*N*N

15 Fast Mixture Models Describe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than bandwidth!) The parameters of the model are then N gaussians each with a mean and covariance Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for 100,000 points on a PC!) Employ heuristic splitting algorithm as well Details in Connolly et al. 2000 (astro-ph/0008187)

16 EM-Based Gaussian Mixture Clustering: 1

17 EM-Based Gaussian Mixture Clustering: 2

18 EM-Based Gaussian Mixture Clustering: 4

19 EM-Based Gaussian Mixture Clustering: 20

20

21 Applications Used in SDSS quasar selection (used to map the multi-color stellar locus) Gordon Richards @ PSU Anomaly detector (look for low probability points in N-dimensions) Optimal smoothing of large-scale structure

22 SDSS QSO target selection in 4D color-space Cluster 9999 spectroscopically confirmed stars Cluster 8833 spectroscopically confirmed QSOs (33 gaussians) 99% for stars, 96% for QSOs

23

24 Bayes Net Anomaly Detector Instead of using a single joint probability function (fitted to data) factorize into a smaller set of conditional probabilities Directional and acyclical If we know graph and conditional probabilities, we have valid probability function to whole model

25 Use 1.5 million SDSS sources to learn model (25 variables each) Then evaluate the likelihood of each data being drawn from the model Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck

26 Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilities Therefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return again Issue of productivity!

27 Will Only Get Worse LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007) VISTA will collect 300 Terabytes of data (2005) Archival Science is upon us! HST database has 20GBytes per day downloaded (10 times more than goes in!)

28 Will Only Get Worse II Surveys spanning electromagnetic spectrum Combining these surveys is hard: different sensitivities, resolutions and physics Mixture of imaging, catalogs and spectra Difference between continuum and point processes Thousands of attributes per source

29 What is VO? The “Virtual Observatory” must: Federate multi-wavelength data sources (interoperability) Must empower everyone (democratise) Be fast, distributed and easy Allow input and output

30 Computer Science + Statistics! Scientists will need help through autonomous scientific discovery of large, multi-dimensional, correlated datasets Scientists will need fast databases Scientists will need distributed computing and fast networks Scientists will need new visualization tools CS and Statistics looking for new challenges: Also no data-rights & privacy issues New breed of students needed with IT skills Symbiotic Relationship

31 VO Prototype Ideally we would like all parts of the VO to be web-servises DBC#dym EMdymhttp.NET http

32 Lessons We Learnt Tough to marry research c code developed under linux to MS (pointers to memory).NET has “unsafe” memory.NET server is hard to set up! Migrate to using VOTables to perform all I/O. Have server running at CMU so we can control code

33 Very High Dimensions Using LLE and Isomap; looking for lower dimensional manifolds in higher dimensional spaces 500x2000 space from SDSS spectra

34 Summary Era of New Cosmology: Massive data sources and search for subtle features & high precision measurements Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different skills). Perfect synergy with Stats, CS, Physics Good algorithms are as good as faster and more computers! The “glue” to make a “virtual observatory” is hard and complex. Don’t under-estimate the job

35 Are the Features Real? (FDR)! This is an example of multiple hypothesis testing e.g. is every point consistent with a smooth p(k)?

36 Let us first look at a simulated example: consider a 1000x1000 image with 40000 sources. FDR3038915059611958495 2sigma31497227288503937272 Bonferroni27137 012863960000 FDR makes 15 times few mistakes for the same power as traditional 2-sigma Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries

37 And it is adaptive to the size of the dataset

38 We used a FDR of 0.25 i.e. 25% of circled Points are in error Therefore, we can say with statistical rigor that most of these points a rejected and are thus ``features’’ No single point is a 3sigma deviation New statistics has enabled an astronomical discovery


Download ppt "Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture."

Similar presentations


Ads by Google