Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University.

Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University 1 Dept. of Computer Science 2 Dept. of Physics & Astronomy 3 Dept. of Applied Mathematics & Statistics

Photometric Redshifts  You know what they are  I did it on SDSS DR6 colors  z spec = f(u-g, g-r, r-i, i-z)  z phot = f(u-g, g-r, r-i, i-z)   = z phot - z spec  I did it with Random Forests  You know what they are  I did it on SDSS DR6 colors  z spec = f(u-g, g-r, r-i, i-z)  z phot = f(u-g, g-r, r-i, i-z)   = z phot - z spec  I did it with Random Forests ˆ

Regression Trees »A Binary Tree »It partitions input training data into clusters of similar objects »Each new test object is matched with the cluster to which it is “closest” in the input space »The output value is the mean of the output values of training objects in its cluster »A Binary Tree »It partitions input training data into clusters of similar objects »Each new test object is matched with the cluster to which it is “closest” in the input space »The output value is the mean of the output values of training objects in its cluster

Building a Regression Tree Starting at the root node choose a dimension on which to split Choose the point which “best” distinguishes clusters in that dimension Points left go in the left child, right go in the right child Repeat the process in each child node until all objects are in their own leaf node x1 x2 x3

How Do You Choose the Dimension and Split Point? The best split point in a dimension is the one which minimizes resubstitution error in that dimension The best dimension is the one with the lowest best resubstitution error

What’s Resubstitution Error? For a candidate split point, there are points left and points right  =  L ( x - x L ) 2 / N L +  R (x - x R ) 2 / N R That’s the resubstitution error Minimize it ¯¯

Randomizing a Regression Tree  Train it on a bootstrap sample  This is a sample of N objects chosen uniformly at random with replacement from the complete training set  Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions  Train it on a bootstrap sample  This is a sample of N objects chosen uniformly at random with replacement from the complete training set  Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions

Random Forest An ensemble of “randomized” Regression Trees Ensemble estimate is the mean of individual tree estimates This gives a distribution of iid estimation errors Central Limit Theorem gives the distribution of their mean Their mean is exactly z phot - z spec That means we have the error distribution for that object!

Implemented in R ◊More training data -> better estimates ◊Forests converge pretty quickly in forest size ◊Training set size, input space constrained by memory in R implementation ◊More training data -> better estimates ◊Forests converge pretty quickly in forest size ◊Training set size, input space constrained by memory in R implementation

Results RMS Error = 0.023 Training set size = 80,000

Error Distribution Standardized Error Distribution Since we know the error distribution * for each object, we can standardize them and the results should be standard normal over all test objects. Like in this plot! :) If the standardized errors are standard normal, then we can predict how many of the errors fall between the tails of the distribution for different tail sizes. Like in this plot! (mostly)

Summary FRandom Forest estimates come with Gaussian error distributions F0.023 RMS error is competitive with other methodologies FThis makes Random Forests good FRandom Forest estimates come with Gaussian error distributions F0.023 RMS error is competitive with other methodologies FThis makes Random Forests good

Future Work  CRLB says bigger N gives better estimates from the same estimator  80,000 objects is good, but we have way more than that available  Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation  So I’m writing a C# implementation  CRLB says bigger N gives better estimates from the same estimator  80,000 objects is good, but we have way more than that available  Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation  So I’m writing a C# implementation

Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University.

Similar presentations

Presentation on theme: "Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University.

Similar presentations

Presentation on theme: "Random Forest Photometric Redshift Estimation Samuel Carliles 1 Tamas Budavari 2, Sebastien Heinis 2, Carey Priebe 3, Alex Szalay 2 Johns Hopkins University."— Presentation transcript:

Similar presentations

About project

Feedback