How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Presentation on theme: "How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing."— Presentation transcript:

How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)

What I do Often the most general and powerful statistical (or “machine learning”) methods are computationally infeasible. I design machine learning methods and fast algorithms to make such statistical methods possible on massive datasets (without sacrificing accuracy).

Quasar detection •Science motivation: use quasars to trace the distant/old mass in the universe •Thus we want lots of sky  SDSS DR1, 2099 square degrees, to g = 21 •Biggest quasar catalog to date: tens of thousands •Should be ~1.6M z<3 quasars to g=21

Classification •Traditional approach: look at 2-d color-color plot (UVX method) –doesn’t use all available information –not particularly accurate (~60% for relatively bright magnitudes) •Statistical approach: Pose as classification. 1.Training: Train a classifier on large set of known stars and quasars (‘training set’) 2.Prediction: The classifier will label an unknown set of objects (‘test set’)

Which classifier? 1.Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap 2.Computational question: We have 16,713 quasars from [Schneider et al. 2003] (.08 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/7/1878706/slides/slide_5.jpg", "name": "Which classifier.", "description": "1.Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap 2.Computational question: We have 16,713 quasars from [Schneider et al. 2003] (.08

Which classifier? •Popular answers: –logistic regression: fast but linear only –naïve Bayes classifier: fast but quadratic only –decision tree: fast but not the most accurate –support vector machine: accurate but O(N 3 ) –boosting: accurate but requires thousands of classifiers –neural net: reasonable compromise but awkward/human-intensive to train •The good nonparametric methods are also black boxes – hard/impossible to interpret

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

Optimal decision theory Optimal decision boundary Star density Quasar density x density f(x)

Bayes’ rule, for Classification

So how do you estimate an arbitrary density?

Kernel Density Estimation (KDE) for example (Gaussian kernel):

Kernel Density Estimation (KDE) • There is a principled way to choose the optimal smoothing parameter h • Guaranteed to converge to the true underlying density (consistency) • Nonparametric – distribution need not be known

Nonparametric Bayes Classifier (NBC) [1951] • Nonparametric – distribution can be arbitrary • This is Bayes-optimal, given the right densities • Very clear interpretation • Parameter choices are easy to understand, automatable • There’s a way to enter prior information Main obstacle:

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

kd-trees: most widely-used space- partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977] • Univariate axis-aligned splits • Split on widest dimension • O(N log N) to build, O(N) space

A kd-tree: level 1

A kd-tree: level 2

A kd-tree: level 3

A kd-tree: level 4

A kd-tree: level 5

A kd-tree: level 6

For higher dimensions: ball-trees (computational geometry)

We have a fast algorithm for Kernel Density Estimation (KDE) •Generalization of N-body algorithms (multipole expansions optional) •Dual kd-tree traversal: O(N) •Works in arbitrary dimension •The fastest method to date [Gray & Moore 2003]

We could just use the KDE algorithm for each class. But: •for the Gaussian kernel this is approximate •choosing the smoothing parameter to minimize (cross- validated) classification error is more accurate But we need a fast algorithm for the Nonparametric Bayes Classifier (NBC)

Leave-one-out cross-validation Observations: 1.Doing bandwidth selection requires only prediction. 2.To predict class label, we don’t need to compute the full densities. Just which one is higher.  We can make a fast exact algorithm for prediction

Fast NBC prediction algorithm 1. Build a tree for each class

Fast NBC prediction algorithm 2. Obtain bounds on P(C)f(x q |C) for each class P(C 1 )f(x q |C 1 )P(C 2 )f(x q |C 2 ) xqxq

Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference P(C 1 )f(x q |C 1 )P(C 2 )f(x q |C 2 ) xqxq

Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference P(C 1 )f(x q |C 1 )P(C 2 )f(x q |C 2 ) 50-100x speedup exact

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

Resulting quasar catalog •100,563 UVX quasar candidates •Of 22,737 objects w/ spectra, 97.6% are quasars. We estimate 95.0% efficiency overall. (aka “purity”: good/all) •94.7% completeness w.r.t. g<19.5 UVX quasars from DR1 (good/all true) •Largest mag. range ever: 14.2 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/7/1878706/slides/slide_33.jpg", "name": "Resulting quasar catalog •100,563 UVX quasar candidates •Of 22,737 objects w/ spectra, 97.6% are quasars.", "description": "We estimate 95.0% efficiency overall. (aka purity : good/all) •94.7% completeness w.r.t. g<19.5 UVX quasars from DR1 (good/all true) •Largest mag. range ever: 14.2

Cosmic magnification [Scranton et al. 2005] 13.5M galaxies, 195,000 quasars Most accurate measurement of cosmic magnification to date [Nature, April 2005] more flux more area

Next steps (in progress) •better accuracy via coordinate-dependent priors •5 magnitudes •use simulated quasars to push to higher redshift •use DR4 higher-quality data •faster bandwidth search •500k quasars easily, then 1M

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] •nonparametric regression •clustering: k-means and mixture models, more…

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest algs

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest alg

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest alg

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest alg

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] •nonparametric regression •clustering: k-means and mixture models, others •support vector machines, maybe fastest alg we’ll see…

Take-home messages •Estimating a density? Use kernel density estimation (KDE). •Classification problem? Consider the nonparametric Bayes classifier (NBC). •Want to do these on huge datasets? Talk to us, use our software. •Different computational/statistical problem? Grab me after the talk! agray@cc.gatech.edu

Download ppt "How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing."

Similar presentations