Presentation on theme: "Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU Supercomputer Facility."— Presentation transcript:
Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU Supercomputer Facility
The MACHO project Woods data set Data exploration and data properties Data preprocessing Feature sets Classification using additive models Training process Web site
The MACHO Project To find evidence of dark matter from gravitational lensing effect Observations at Mt Stromlo ^7 observed stars CDD images
Woods Data Set 792 stars identified as long period variable Chosen from the full MACHO data set Original data processed by SODOPHOT to give red and blue light curves Missing data Large errors Unequal sampling
Stars from the Woods data set
Two typical long-period stars
Data Preprocessing Data sampling is not uniform so cannot use Fourier transforms. Periodic stars satisfy f(t+p) = f(t) for some period p, say. Long period variable starts are not exactly periodic e.g. f(t)=f(t+p)+g(t) where g is small compared with f. Use periodic smoothing to estimate missing data.
Periodic Smoothing An estimate for f can be determined by minimizing the function The function is f is modeled as a piecewise linear function. In practice p is not known but it can be estimated by a method such as Pisarenkos method. For now the second penalty function multiplier is much smaller than the first.
Feature Sets Features are calculated to characterize the light curves. Magnitudes are observed for both red and blue frequency range. The difference between these is the logarithm of the ratio of intensities of blue and red light. Called the colour index. Summary features of the light curves are obtained from the colour and magnitudes by forming the average (or median) over time, the amplitude of the fluctuations, the average frequency or time scale and a measure of the time scale fluctuations.
Features contd. Correlation between red and blue magnitudes. 9 features calculated and stored for each light curve. Use these features as predictor variables for the classifier, NOT original light curve data.
Classification using additive models ANOVA decomposition, Friedman (MARS 1991), Hastie- Tibshirani (GAM 1990), Wahba 1990 For example, such a function could approximate a classification function to decide which of two classes (0 or 1) a particular star belongs.
Additive Models In general an additive model is expressed as a sum of unknown smooth functions that have to be estimated from the data. The model is fitted by using a local scoring algorithm which iteratively fits weighted additive models by a back-fitting algorithm. This is a Gauss-Seidel method which iteratively smooths partial residuals by weighted linear least squares.
Possible basis functions for the approximation space in 1D. Indicator functions Hat functions Hierarchical hat fns ADDFIT uses 1D basis functions
Boosting Boosting is a machine learning procedure which improves the accuracy of any learning algorithm. The AdaBoost procedure used in this code calls a weak learning procedure several times and maintains a distribution of weights over the training set. Initially all weights are zero but then weights of incorrectly classified examples are increased so that the weak learner concentrates more on them.
Training Program Start with an initial training set of accepted stars, that is, stars of the type of interest. Helpful to also have a set of unacceptable stars to help the trainer. Additive models are used to form a classification function using the feature set data from the initial training set. This function is then applied to the full data set and the stars ordered based on the function values.
The light curves are displayed in decreasing order of function value. Ideally the training set stars should appear first. Further acceptable and unacceptable stars can be chosen by clicking on the relevant button and then a new classification carried out. Continue the process until satisfied with the star sorting.
Web based data mining tool Software link to Macho demo. This software contains Python code to read ASCII star data files, process them by removing any with insufficient good data then calculate several features from each star. These features are then used for the training program to select groups of like stars. The programs have incorporated a method of caching data so that it is kept in binary form for quicker access. The caching software was written by Ole Nielsen and can be downloaded from the ANU datamining web page.
Procedure to run: Determine initial training set. When prompted enter the star numbers for acceptable stars. Stars 1 and 2 are already entered as a default. When the web browser appears with the top ranked 60 stars, those that have already been deemed acceptable will have the accept button disabled and those that have been rejected will have the reject button disabled.
The user can then choose more acceptable or unacceptable stars by clicking on the relevant button. Previous decisions can be changed. After choosing a few stars click on the continue button to see the next 60 top ranked stars or go down to further pages to make more choices. Continue until satisfied with the initial ranked stars. Stop by clicking quit or restart.