Presentation on theme: "September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL."— Presentation transcript:
September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY
September 2005NVO Summer School2 How do we know what we want in the VO? Pretend the VO exists. What is the science we are doing with it? Now try to do that science and see what gets in the way.
September 2005NVO Summer School3 Can we classify ROSAT X-ray sources? All RASS Sources (124,730) Classified RASS Sources ~7,000 Total RASS Sources ~130,000
September 2005NVO Summer School4 What do we want to do? Find counterparts to ROSAT X-ray sources in optical, IR, radio. Train a classifier to use multiwavelength information to determine type of objects. Classify all of the objects seen by ROSAT.
September 2005NVO Summer School5 What is classification? Translation from observables to distinct physical processes. Each element classified independently of others Is classification different from measurement? Classification versus cataloging Usually classify objects but also… –Events: GRBs, solar flares, … –Simulated data –Pixels/regions in an image: Earth and planetary studies, shocked regions, …
September 2005NVO Summer School6 Its not just us. http://aria.arizona.edu/courses/tutorials/class/html/class.html A typical plot of objects to be classified? There is lots of information and discussion of classification outside astronomy.
September 2005NVO Summer School7 Examples Moving versus fixed stars Classes of stellar spectra (ordered by strength of Balmer lines). –Substitute for a measurement –Cf. Dwarf versus giant Osterbrock diagram: AGN versus star-forming emission line galaxies. Bautz-Morgan types of clusters of galaxies –Dominance of cluster by central galaxy. Types of x-ray sources: AGN, SNR, pulsars, XRBs, …
September 2005NVO Summer School8 Galaxy Classification
September 2005NVO Summer School9 Why do we classify? Understand a given field. Generate statistical samples. Compare different regions/observations. Find rare objects. Remove unwanted backgrounds. Plan subsequent observations. …
September 2005NVO Summer School10 Do we know what we are looking for? Yes: We have a good idea of the kinds of objects that are in the field. –Supervised classification –Find out which regions of observable phase space belong to which classes and use that knowledge to classify new sources. No: We dont really know what were looking at. –Unsupervised classification –Is there any structure in the phase space distribution?
September 2005NVO Summer School11 Supervised versus unsupervised classification Supervised and Unsupervised Land Use Classification, Chris Banman http://www.emporia.edu/earthsci/student/banm an5/perry3.html
September 2005NVO Summer School12 Supervised classification Often has a training phase where a priori knowledge is used to tune the classifier algorithm. Training takes most of the time. –But Osterbrock diagram based on theoretical modeling. We specify a list of output classes. May give a list of probabilities of membership in more than one class. Algorithms: Neural networks, nearest neighbor, decision trees
September 2005NVO Summer School13 Supervised classifier training Neural NetworksOblique Decision Trees
September 2005NVO Summer School14 Unsupervised classification Tries to find natural groupings of data. User often specifies number of classes to find. Classes found are anonymous – it is up to user to define physical meaning. Self-organizing maps, K-means, C- means hierarchical clustering, gaussian mixtures
September 2005NVO Summer School15 Self-organizing maps Catalogs in VizieR K-means Fuzzy C- means
September 2005NVO Summer School16 Some key questions. 1.(S) What output classes are we interested in, and what degree of resolution do we want? Star versus galaxy or A0V versus SBa (U) How many classes might we expect? 2.What input data sets are we going to use? 3.How are we going to get them? 4.How do we combine them? 5.What observables are available? Which are useful? 6.(S) What training sets are available? (U) How do we understand the output classes? 7.What algorithm are we going to use in classification? 8.How can we test the results so that we believe them?
September 2005NVO Summer School17 Specification/Count of Output Classes We werent sure how detailed we could do classifications and had to play with the classifiers to see what might be feasible. Does the VO help? Not directly. This will often be implicit in the problem. By making other aspects in classification easier, the VO makes playing around with this choice easier.
September 2005NVO Summer School18 What input data sets are we going to use? We knew which datasets we were going to use but we added one along the way. Does the VO help? Maybe. VO registries can help find resources but these will often be implicit in the problem.
September 2005NVO Summer School19 We used custom interfaces to get data from different resources, but VOTables were developed early enough for us to use. (Perl VOTable parser from ClassX effort) This took a fair bit of work. Does the VO help? A lot. Just a few standard ways to get the data and nice standard ways of defining them. Limits on some services are still annoying. New libraries can make this part really easy. Large XML files are cumbersome to process in many tools. How are we going to get the data?
September 2005NVO Summer School20 How do we combine them? We used custom software. This took a lot of work but we had to deal with the issue of multiple counterparts to each X-ray sources. Does the VO help? A lot. XMatch does a lot of what we want though not everything. Note spatial matching capabilities in TOPCAT allow merging of data from ConeSearch too.
September 2005NVO Summer School21 What observables are available? Which are useful? This took a lot of work. Understanding what variables were available and getting full descriptions was difficult. Does the VO help? A little. Visualization tools like Mirage are nice for getting a feel for the data, but non-VO tools (e.g., IDL itself) may do this just as well. Documentation in the VO is probably not better than before but a common framework for getting information to users is available if providers ever get around to providing adequate documentation.
September 2005NVO Summer School22 Classification needs right information, not all information. Hughes Effect Classification of Multi-Spectral Data by Join Supervised-Unsupervised Learning (Shahshahani & Landgrebe)
September 2005NVO Summer School23 Training set/ground truth data We knew most of the training data in advance. Does the VO help? VO registry may point out some possibilities but training or truth data may be implicit in the problem.
September 2005NVO Summer School24 What algorithm are we going to use in classification? We had experience with oblique decision trees. Does the VO help? A little. VOStat provides a few capabilities for unsupervised classification, but the Web interface is a little flakey. Web service interfaces to a few standard classifiers might be nice. VO could do a lot more here.
September 2005NVO Summer School25 VOStat See www.vostat.org Statistics routines on-line with VO interface. Downloadable library Fairly minimal Web interface Includes K-means and hierarchical clustering tools.
September 2005NVO Summer School26 How can we test the results so that we believe them? We found a number of independently classified sets of objects and checked for consistency. Does the VO help? Yes. This is probably where we can most effectively use VO resources we discover in the registry. However a couple of the samples we used were not yet published.
September 2005NVO Summer School27 Testing the results Classify independently classified datasets. Check faint sources?
September 2005NVO Summer School28 Overall… A lot of progress since we started ClassX but plenty of issues still remain.
September 2005NVO Summer School29 A ClassX phase space slice \
September 2005NVO Summer School30 Science Probalistic classifications of all ROSAT X-ray sources: McGlynn, et. al 2004ApJ...616.1284M New HMXRBs: Suchkov and Hanisch 2004ApJ...612..437S