Presentation on theme: "Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation, Institute of Biomedical Engineering,"— Presentation transcript:
Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation, Institute of Biomedical Engineering, University of Oxford Information Driven Healthcare: Data Visualization & Classification Lecture 1: Introduction & preprocessing Centre for Doctoral Training in Healthcare Innovation
The course A practical overview of (a subset of) classifiers and visualization tools Data preparation, PCA, K-means Clustering, KNN Statistics, regression, LDA, logistic regression Neural Networks Gaussian Mixture models, EM Support Vector Machines Labs – try to 1) Classify flowers (classic dataset), … then 2) Predict mortality in the ICU! (... & publish if you do well!)
Workload Two lectures each morning Five 4-hour labs (each afternoon) Read one article each eve (optional)
Assessment /assignments Class interaction Lab diary – write up notes as you perform investigations – submit lab code (m-file) and Word/OO doc answering the questions at 5pm each day … No paper please! Absolutely no homework!... but you can write a paper afterwards if your results are good!
Course texts Ian Nabney, Netlab, Algorithms for Pattern Recognition, in their series Advances in Pattern Recognition. Springer (2001) ISBN 1-85233-440-1 http://www.ncrg.aston.ac.uk/netlab/book.php http://www.ncrg.aston.ac.uk/netlab/book.php Christopher M. Bishop, Pattern Recognition and Machine Learning Springer (2006) ISBN 0-38-731073-8 http://research.microsoft.com/en- us/um/people/cmbishop/PRML/index.htmhttp://research.microsoft.com/en- us/um/people/cmbishop/PRML/index.htm Press, Teukolsky, Vetterling & Flannery, Numerical Recipes in C, the Art of Scientific Computing, 2nd Edition, Cambrige University Press, 1992. [Ch. 2.6, 10.5(p414-417), 11.0(p465-460), 15.4(p671-688), 15.5(p681- 688), 15.6&15.7(p689-700)] Online at http://www.nrbook.com/a/bookcpdf.phphttp://www.nrbook.com/a/bookcpdf.php L. Tarassenko, A Guide to Neural Computation, John Wiley & Sons (February 1998) Ch. 7 (p77-101) Ian Nabney, Netlab2? - when available! Ian Nabney
Syllabus – Week 1 Monday Data exploration [GDC] Lecture 1: (9.30:10.30am) Introduction, probabilities, entropy, preprocessing, normalization, segmenting data (PCA, ICA) Lecture 2: (11am-12pm) Feature extraction, visualization, (K-means, SOM, GTM, Neuroscale). Lab 1 (1-5pm) Preprocessing of data & visualization - segmentation (train, test, evaluation), PCA & K-means with 2 classes Reading for tomorrow: Bishop PRML, Ch4.1 p179-196, Ch4.3.2 p205-206, Ch2.3.7 p102-103,691, Netlab Ch3.5-3.6 p101-107 Tuesday Clinical Statistics & Classifiers [IS] Lecture 3: (9.30:10.30am) Clinical statistics: t-test, X2 test, Wilcoxon rank sum test, Linear regression, bootstrap, jackknife. Lecture 4: (11am-12pm) Clinical classifiers: LDA, KNN, Logistic Regression Lab 2 (1-5pm) – P-values, statistical testing, LDA, KNN and Logistic regression. Reading for tomorrow: Netlab: Ch5.1-5.6 p165-167, Ch6 p191-221 Wednesday Optimization and Neural Networks [GDC] Lecture 5: (9.30:10.30am) ANNs - RBFs and MLPs - choosing an architecture, balancing the data. Lecture 6: (11am-12pm) Training & optimization, N-fold validation. Lab 3 (1-5pm) Training an MLP to classify flower types and then mortality - partitioning and balancing data, Reading for tomorrow: Netlab: Ch3.1-3.4 p79-100 Thursday Probabilistic Methods [DAC] Lecture 7: (9.30:10.30am) GMM, MCMC, Density Estimation Lecture 8: (11am-12pm) EM, Variation Bayes, missing data Lab 4 (1-5pm) GMM and EM Reading for tomorrow: Bishop: Ch7 p325-345 (SVM) Friday Support Vector Machines [CO/GDC] Lecture 9: (9.30:10.30am) SVMs and constrained optimization Lecture 10: (11am-12pm) Wrap-up Lab 5 (1-5pm) Use SVM toolbox and vary 2 parameters for regression & classification (1 class death and then alive), then 2 class.
Overview of data for lab You will be given two datasets: 1. A simple dataset for learning – Fishers Iris dataset 2. A complex ICU database (if this works – publish!!!) In each lab you will use dataset 1 to understand the problem, then dataset 2 to see how you can apply this to more challenging data
So lets start … what are we doing? Trying to learn classes from data so when we see new data, we can make a good guess concerning its class membership (e.g. is this patient part of the set of people likely to die and if so, can we change his/her treatment) How do we do this? Supervised – use labelled data to train an algorithm. Unsupervised – use heuristics or metrics to look for clusters in data (K-means clustering, KNN, SOMs, GMM, …)
Data preprocessing/manipulation Filter data to remove outliers (reject obvious large/small values) Zero-mean, unit variance data if parameters are not in same units! Compress data into lower dimensions to reduce workload or to visualize data relationships Rotate data, or expand into higher dimensions to improve the separation between classes.
The curse of dimensionality Richard Bellman (1953) coined the term The Curse of Dimensionality (or Hughes effect) Its the problem caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical) space. Bellman gives the following example: Given 100 evenly-spaced sample points suffice to sample a unit interval with no more than 0.01 distance between points; An equivalent sampling of a 10D unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020 sample points: Therefore, at this spatial sampling resolution, the 10-dimensional hypercube is a factor of 1018 larger than the unit interval. muppet.wikia.com
So what does that mean for us? Need to think about how much data we have and how many parameters we use. Rule of thumb: need to have at least 10 training samples of each class per input feature dimension (although this depends on separability of data and can be up to 30 for complex problems and as low as 2-5 for simple problems [*]) So for the Iris dataset – we have 4 measured features on 50 examples of each of the three classes … so we have enough! For ICU data we have 1400 patients, 970 survived and 430 died … so taking the minimum of these we could use up to 43 of the 112 features Generally though you need more data … Or you compress the data into a smaller number of dimensions [*] Thomas G. Van Niel, Tim R. McVicarb and Bisun Datt, On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification, Remote Sensing of Environment, Volume 98, Issue 4, 30 October 2005, Pages 468-480 doi:10.1016/j.rse.2005.08.011
Principal Component Analysis (PCA) Standard signal/noise separation method Compress data into lower dimensions to reduce workload or to visualize data relationships Rotate data to improve the separation between classes Also known as Karhunen-Loève (KL) transform or the Hotelling transform or Singular Value Decomposition (SVD) – although SVD is actually a mathematical method of PCA
Principal Component Analysis (PCA) A form of Blind source Separation – an observation, X, can be broken down into a mixing matrix, A, and a set of basis functions, Z : X=AZ Second order decorrelation = independence Find a set of orthogonal axes in the data (independence metric = variance) Project data onto these axes to decorrelate Independence is forced onto the data through the orthogonality of axes
Two dimensional example Where are the principal components? Hint: axes of maximum variation, and orthogonal
Two dimensional example Gives best axis to project minimum RMS error Data becomes sphered or whitened / decorrelated
Singular Value Decomposition (SVD) Decompose observation X=AZ into…. X=USV T S is a diagonal matrix of singular values with elements arranged in descending order of magnitude (the singular spectrum) The columns of V are the eigenvectors of C=X T X (the orthogonal subspace … dot(v i,v j )=0 ) … they demix or rotate the data U is the matrix of projections of X onto the eigenvectors of C … the source estimates
Eigenspectrum of decomposition S = singular matrix … zeros except on the leading diagonal S ij (i=j) are the eigenvalues ½ Placed in order of descending magnitude Correspond to the magnitude of projected data along each eigenvector Eigenvectors are the axes of maximal variation in the data Variance = power (analogous to Fourier components in power spectra) [stem(diag(S).^2)] Eigenspectrum= Plot of eigenvalues
SVD: Method for PCA See BSS notes and example at end of presentation
SVD for noise/signal separation To perform SVD filtering of a signal, use a truncated SVD decomposition (using the first p eigenvectors) Y=US p V T [Reduce the dimensionality of the data by discarding noise projections S noise =0 Then reconstruct the data with just the signal subsapce] Most of the signal is contained in the first few principal components. Discarding these and projecting back into the original observation space effects a noise-filtering or a noise/signal separation
e.g. Imagine a spectral decomposition of the matrix: = xx u1u1 u2u2 1 2 v1v1 v2v2
SVD – Dimensionality reduction How exactly is dimension reduction performed? A: Set the smallest singular values to zero: = xx
SVD - Dimensionality reduction … and resultant matrix is an approximation using only 3 eigenvectors ~
Real ECG data example X X p =US p V T S2S2 X p … p=2 n X p … p=4
Recap - PCA Second order decorrelation = independence Find a set of orthogonal axes in the data (independence metric = variance) Project data onto these axes to decorrelate Independence is forced onto the data through the orthogonality of axes Conventional noise / signal separation technique Often used as a method of initializing weights for neural network and other learning algorithms (see Wed lectures).
Appendix Worked example (see lecture notes) http://www.robots.ox.ac.uk/~gari/cdt/IDH/docs/ch1 4_ICASVDnotes_2009.pdf