Download presentation

Presentation is loading. Please wait.

Published byJesus Chase Modified over 3 years ago

1
Current Trends in Machine Learning and Data Mining Dr Marcus Gallagher School of Information Technology and Electrical Engineering University of Queensland 4072 Australia marcusg@itee.uq.edu.au

2
2 Australian Virtual Observatory Workshop 2003 Talk Overview Machine Learning Data mining Where machine learning fits into data mining Where machine learning fits into data mining Scientific Data Mining Machine learning and the scientific process Machine learning and the scientific process Current topics & future directions Current topics & future directionsSummary

3
3 Australian Virtual Observatory Workshop 2003 Machine Learning …the study of computer algorithms capable of learning to improve their performance on a task on the basis of their own experience. Often this is learning from data. Often this is learning from data. A sub-discipline of artificial intelligence, with large overlaps into statistics, pattern recognition, visualization, robotics, control, …

4
4 Australian Virtual Observatory Workshop 2003 Data Mining …the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

5
5 Australian Virtual Observatory Workshop 2003 Data Mining Modern science is driven by data analysis like never before. We have an ability to collect and process data that is increasing exponentially! Modern science is driven by data analysis like never before. We have an ability to collect and process data that is increasing exponentially!

6
6 Australian Virtual Observatory Workshop 2003 Data mining: increase in CPU performance (1/execution time)

7
7 Australian Virtual Observatory Workshop 2003 Data mining: trends in memory capacity and cost

8
8 Australian Virtual Observatory Workshop 2003 Data mining General trends: Increase in the size of datasets (in terms of observations and dimensionality). Increase in the size of datasets (in terms of observations and dimensionality). What to do when #dimensions > # observations? Interest in data mining for non-numerical data. Interest in data mining for non-numerical data. We are drowning in information and starving for knowledge. – Rutherford D. Roger We are drowning in information and starving for knowledge. – Rutherford D. Roger

9
9 Australian Virtual Observatory Workshop 2003 Data Mining Data collection Define problem Data preparation Data modelling Interpretation/ Evaluation Implement/ Deploy model

10
10 Australian Virtual Observatory Workshop 2003 Data Mining Data collection Define problem Data preparation Data modelling Interpretation/ Evaluation Implement/ Deploy model Machine Learning

11
11 Australian Virtual Observatory Workshop 2003 Data modelling and the scientific process Data modelling plays an important role at several stages in the scientific process: 1. Observe and explore interesting phenomena. 2. Generate hypotheses. 3. Formulate model to explain phenomena. 4. Test predictions made by the theory. 5. Modify theory and repeat (at 2 or 3). The explosion of data suggests that we need to (partially) automate numerous aspects of the scientific process.

12
12 Australian Virtual Observatory Workshop 2003 1. Observe and explore interesting phenomena. Problems here are typically referred to as unsupervised learning in the ML community: Given a set of d-dimensional data vectors (X 1,…,X n ), X i = (x 1,…,x d ), build a model of the data to infer properties of the underlying distribution (process) that generated the data. Given a set of d-dimensional data vectors (X 1,…,X n ), X i = (x 1,…,x d ), build a model of the data to infer properties of the underlying distribution (process) that generated the data. Key problems: Dimensionality reduction: developing algorithms that can reduce a dataset of hundreds or thousands of dimensions to just a few for visualization, while retaining as much of the information as possible in the original dataset. Dimensionality reduction: developing algorithms that can reduce a dataset of hundreds or thousands of dimensions to just a few for visualization, while retaining as much of the information as possible in the original dataset. Clustering of data – outlier detection: Identifying trends and/or anomalies in datasets. Clustering of data – outlier detection: Identifying trends and/or anomalies in datasets.

13
13 Australian Virtual Observatory Workshop 2003 1. Observe and explore interesting phenomena Current ML approaches: Independent Component Analysis – decompose multivariate data with the aim of producing components that are as statistically independent as possible. Independent Component Analysis – decompose multivariate data with the aim of producing components that are as statistically independent as possible. Related to PCA and factor analysis. Gaussian mixture models for clustering – uses a semi-parametric probability density estimator that is trained iteratively on data. Gaussian mixture models for clustering – uses a semi-parametric probability density estimator that is trained iteratively on data. Implements a soft version of k-means.

14
14 Australian Virtual Observatory Workshop 2003 1. Observe and explore interesting phenomena Self-organizing maps and topographic mapping – similar to clustering but where the cluster-centres are constrained to lie in a low- dimensional manifold (and so have a spatial relationship). Self-organizing maps and topographic mapping – similar to clustering but where the cluster-centres are constrained to lie in a low- dimensional manifold (and so have a spatial relationship).

15
15 Australian Virtual Observatory Workshop 2003 3. Formulate model to explain phenomena Problems here are typically referred to as supervised learning in the ML community: Given a training set of pairs of input and output data vectors {(X i,Y i ),…,(X n,Y n )}, where the input values are thought to have some influence on the corresponding output values, build a model of the data that can predict the outputs of unseen (test) inputs. Given a training set of pairs of input and output data vectors {(X i,Y i ),…,(X n,Y n )}, where the input values are thought to have some influence on the corresponding output values, build a model of the data that can predict the outputs of unseen (test) inputs. Key problems: Regression, classification, forecasting. Regression, classification, forecasting.

16
16 Australian Virtual Observatory Workshop 2003 3. Formulate model to explain phenomena Established ML approaches: Neural networks Neural networks Decision-trees and rule-based classifiers Decision-trees and rule-based classifiers Example applications in astronomy: Star/galaxy classification – on the basis of optical data. Star/galaxy classification – on the basis of optical data. Photometric redshift evaluation. Photometric redshift evaluation. Noise identification and removal in gravitational waves detectors. Noise identification and removal in gravitational waves detectors.

17
17 Australian Virtual Observatory Workshop 2003 Current/New Machine Learning Models Current ML approaches: Support vector machines – uses insights from computation learning theory and geometry to produce predictors with powerful discrimination and good generalization. Support vector machines – uses insights from computation learning theory and geometry to produce predictors with powerful discrimination and good generalization. Ensemble methods – improving the accuracy of predictions by using multiple models and bootstrap sampling. Ensemble methods – improving the accuracy of predictions by using multiple models and bootstrap sampling. Example: boosting – incrementally constructs an ensemble of weak models, where each model is forced to concentrate on the mistakes made by previous models.

18
18 Australian Virtual Observatory Workshop 2003 Current/New Machine Learning Models Gaussian Processes – use the machinery of Bayesian inference to model data using stochastic processes. Gaussian Processes – use the machinery of Bayesian inference to model data using stochastic processes. All information is represented as a probability distribution. Incorporates uncertainty associated with prior information and predictions made. Probabilistic graphical models – the model is a probability distribution where dependencies are explicitly encoded. Probabilistic graphical models – the model is a probability distribution where dependencies are explicitly encoded. Generative models.

19
19 Australian Virtual Observatory Workshop 2003 Current/New Problems in Machine Learning Active learning: Assume that data can be generated (measured or labelled) on demand – build a learning algorithm that learns on the basis of self-selected data points (queries). Assume that data can be generated (measured or labelled) on demand – build a learning algorithm that learns on the basis of self-selected data points (queries). Aims to reduce the amount of data required, training time of the model, or amount of data that must be manually labelled. Aims to reduce the amount of data required, training time of the model, or amount of data that must be manually labelled.

20
20 Australian Virtual Observatory Workshop 2003 Current/New Problems in Machine Learning Semi-supervised learning: Learning from both labelled (expensive, scarce) and unlabelled (abundant, cheap) data. Learning from both labelled (expensive, scarce) and unlabelled (abundant, cheap) data. Aims are similar to active learning. Aims are similar to active learning. Transductive learning: All unlabelled data points belong to the test set. All unlabelled data points belong to the test set. The algorithm is able to take advantage of the spatial distribution of the data that the real-world generates (and that it will be tested on). The algorithm is able to take advantage of the spatial distribution of the data that the real-world generates (and that it will be tested on).

21
21 Australian Virtual Observatory Workshop 2003 Current/New Problems in Machine Learning Non-numerical (unreal!) data Examples: Examples: Strings – text, DNA, computer programs. Trees – parse trees, XML, phylogenetic trees. Graphs – organic molecules, go board positions. Embedding unreal data in continuous space and applying standard ML techniques has limitations. Need to define appropriate kernels, distance metrics over these data spaces.

22
22 Australian Virtual Observatory Workshop 2003 Summary Automated data mining plays an increasingly important role in the scientific process. Machine Learning techniques are driven by the problems in data mining and provide some effective solutions. Machine Learning is a highly active area of research – new & evolving algorithms. The scope of learning problems is widening to deal with important real-world data.

23
23 Australian Virtual Observatory Workshop 2003 References 1.E. Mjolsness and D. DeCoste. Machine Learning for Science: State of the Art and Future Prospects. Science (293):2051-2055, 2001. 2.D. Hand et al. Principles of Data Mining. MIT Press, 2001. 3.T. Hastie et al. The Elements of Statistical Learning. Springer, 2001. 4.R. Tagliaferri et al. Neural networks in astronomy. Neural Networks (16):297-319, 2003. 5.S. Harmeling et al. Kernel-based nonlinear blind source separation. Neural Computation v15, n5, 1089-1124, 2003. 6.T. Graepel. Getting Real with Unreal Data: Lessons Learned and the Way Ahead. NIPS Workshop on Unreal Data, 2002. 7.S. J. Hong and S. Weiss: Advances in predictive models for data mining. Pattern Recognition Letters 22(1): 55-61 (2001).

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google