Applying Statistical Machine Learning to Retinal Electrophysiology Matt Boardman January, 2006 Faculty of Computer Science.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Applications of one-class classification
Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Visual Recognition Tutorial
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
x – independent variable (input)
Classification and risk prediction
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Support Vector Machines Kernel Machines
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
SVM Support Vectors Machines
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
An Introduction to Support Vector Machines Martin Law.
Crash Course on Machine Learning
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Efficient Model Selection for Support Vector Machines
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Linear Models for Classification
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
SVMs in a Nutshell.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
Support Vector Machine
Chapter 7. Classification and Prediction
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Linear Discrimination
Uncertainty Propagation
Presentation transcript:

Applying Statistical Machine Learning to Retinal Electrophysiology Matt Boardman January, 2006 Faculty of Computer Science

Discussions Axotomy ERG Data Sets Axotomy ERG Data Sets Classification using Support Vector Machines (SVM) Classification using Support Vector Machines (SVM) Assessing Waveform Significance Assessing Waveform Significance Probability Density Estimation Probability Density Estimation Confidence Measures Confidence Measures

Axotomy ERG Data Sets (from F. Tremblay, Retinal Electrophysiology) Data Set A: Data Set A: 19 axotomy subjects, 19 control subjects (total 38) 19 axotomy subjects, 19 control subjects (total 38) time between control & axotomy? time between control & axotomy? Multifocal ERG: 145 data points (mean of all locations) Multifocal ERG: 145 data points (mean of all locations) 1000 Hz (?) sample rate 1000 Hz (?) sample rate Data Set B: Data Set B: 6 axotomy subjects, 8 control subjects (total 14) 6 axotomy subjects, 8 control subjects (total 14) measurements approximately six weeks after axotomy measurements approximately six weeks after axotomy Multifocal ERG: 14,935 data points (103 locations x 145 ms) Multifocal ERG: 14,935 data points (103 locations x 145 ms) Corneal and Optic Nerve readings (control subjects only) Corneal and Optic Nerve readings (control subjects only)

Classification using Support Vector Machines SVM use statistical machine learning SVM use statistical machine learning Constrained optimization problem: Constrained optimization problem: Objective: Find a hyperplane which maximizes margin Objective: Find a hyperplane which maximizes margin Higher dimensional mappings provide flexibility Higher dimensional mappings provide flexibility Non-separable data: a “cost” parameter controls the tradeoff between outlier detection and generalization performance Non-separable data: a “cost” parameter controls the tradeoff between outlier detection and generalization performance Non-linear SVM (Polynomial, Sigmoid, Gaussian kernels) Non-linear SVM (Polynomial, Sigmoid, Gaussian kernels)

Data Normalization Balanced training data Balanced training data Number of positive samples = number of negative samples Number of positive samples = number of negative samples Data set A is already balanced Data set A is already balanced Keep data set B balanced through combination, i.e. 8 C 6 =28 Keep data set B balanced through combination, i.e. 8 C 6 =28 Independently and identically distributed (iid) data Independently and identically distributed (iid) data Independence not true: Independence not true: e.g. value of point x 17 most likely depends on x 16 e.g. value of point x 17 most likely depends on x 16 Not Identically distributed: Not Identically distributed: e.g. x 26 is always positive (P1 wave), but x 40 is always negative (N2 wave) e.g. x 26 is always positive (P1 wave), but x 40 is always negative (N2 wave) Approximate iid data by subtracting mean from each dimension, then dividing each dimension by its maximum magnitude Approximate iid data by subtracting mean from each dimension, then dividing each dimension by its maximum magnitude results in zero mean for all dimensions, with all values between -1 and +1 results in zero mean for all dimensions, with all values between -1 and +1 No zero-setting necessary! No zero-setting necessary! e.g. subtracting mean tail value does not affect classification accuracy! e.g. subtracting mean tail value does not affect classification accuracy!

Parameter Selection for Classification Selection of best gamma (γ) and cost ( c ) values obtained by exhaustive search of log e -space Selection of best gamma (γ) and cost ( c ) values obtained by exhaustive search of log e -space try all possible parameter values, choose best points (red circles) try all possible parameter values, choose best points (red circles) accuracy-weighted centre of mass gives optimal point (green circle) accuracy-weighted centre of mass gives optimal point (green circle) Training / Testing: Training / Testing: 75% / 25% 75% / 25% “Leave one out” “Leave one out” Better searches: Better searches: “3 strikes” “3 strikes” Simulated annealing (?) Simulated annealing (?)

Classification Results Data set A (38 samples x 145 data points): Data set A (38 samples x 145 data points):94.7% Data set B (14 samples x 145 data points): Data set B (14 samples x 145 data points):99.4% Data set B (14 samples x 14,935 data points): Data set B (14 samples x 14,935 data points):90.8%

Classification Benchmarks How does this method perform on industry-standard classification benchmark data sets? How does this method perform on industry-standard classification benchmark data sets? Wisconsin Breast Cancer Database Wisconsin Breast Cancer Database O.L. Mangasarian, W.H. Wolberg, “Cancer diagnosis via linear programming,” SIAM News, 23(5):1-18, O.L. Mangasarian, W.H. Wolberg, “Cancer diagnosis via linear programming,” SIAM News, 23(5):1-18, Iris Plants Database Iris Plants Database R.A. Fisher, “The use of multiple measurements in taxonomic problems,” Annual Eugenics, 7(2):179-88, R.A. Fisher, “The use of multiple measurements in taxonomic problems,” Annual Eugenics, 7(2):179-88, 1936.

Classification Benchmarks Wisconsin: 96.9%, σ=0.18Iris (Class 1 or not): 100.0% Iris (Class 2 or not): 96.9%, σ=0.55Iris (Class 3 or not): 97.1%, σ=0.77

Assessing Waveform Significance Which are the most important parts of the waveform, with respect to classification accuracy? Which are the most important parts of the waveform, with respect to classification accuracy? Fisher Ratio Fisher Ratio distance between means over sum of variance (linear) distance between means over sum of variance (linear) Pearson Correlation Coefficients Pearson Correlation Coefficients strength of association between variables (linear) strength of association between variables (linear) Kolmogorov-Smirnoff Kolmogorov-Smirnoff distance between cumulative distributions (non-linear) distance between cumulative distributions (non-linear) Linear SVM Linear SVM classification on one dimension only(linear) classification on one dimension only(linear) Cross-Entropy Cross-Entropy mutual information measure (non-linear) mutual information measure (non-linear) SVM Sensitivity SVM Sensitivity Monte Carlo simulation using SVM (non-linear) Monte Carlo simulation using SVM (non-linear)

Comparison of All Measures (Dataset B)

Probability Density Estimation Goal: define a measure to show how “sure” the classifier is with the result Goal: define a measure to show how “sure” the classifier is with the result Density Estimation is known to be a “hard” problem Density Estimation is known to be a “hard” problem Generally need large number of samples for accuracy Generally need large number of samples for accuracy Small deviations in sample points have magnified effect Small deviations in sample points have magnified effect How do we estimate a probability distribution? How do we estimate a probability distribution? Best-Fit Gaussian Best-Fit Gaussian Assume Gaussian distribution, find sigmoid that fits best Assume Gaussian distribution, find sigmoid that fits best Kernel Smoothing Kernel Smoothing Part of MATLAB’s Statistics Toolbox Part of MATLAB’s Statistics Toolbox SVM Density Estimation (RSDE method) SVM Density Estimation (RSDE method) Special case of SVM Regression Special case of SVM Regression

Comparison of Estimation Techniques

Confidence Measures “Support” is the overall distribution of the sample “Support” is the overall distribution of the sample Denote p(x) Denote p(x) Density: H p(x) dx = 1 Density: H p(x) dx = 1 “Confidence” is defined as the posterior probability “Confidence” is defined as the posterior probability Probability that sample x is of class C Probability that sample x is of class C Denote p(C|x) Denote p(C|x) Can we combine these measures somehow? Can we combine these measures somehow?

Confidence Measures

References SVM Tutorial (mathematical but practical): SVM Tutorial (mathematical but practical): C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, 2(2):121-67, C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, 2(2):121-67, SVM Density Estimation (RSDE algorithm): SVM Density Estimation (RSDE algorithm): Mark Girolami, Chao He, “Probability Density Estimation from Optimally Condensed Data Samples,” IEEE Trans. Pattern Analysis and Machine Intelligence, 25(10): , Mark Girolami, Chao He, “Probability Density Estimation from Optimally Condensed Data Samples,” IEEE Trans. Pattern Analysis and Machine Intelligence, 25(10): , MATLAB versions: MATLAB versions: LIBSVM: LIBSVM: SVM light : SVM light : An excellent online SVM demo (Java applet): An excellent online SVM demo (Java applet):

Data Representation We can represent the input data in many ways: We can represent the input data in many ways: Unprocessed vector (145 dimensions as is) Unprocessed vector (145 dimensions as is) Second order information (first time derivative) Second order information (first time derivative) Third order information (second time derivative) Third order information (second time derivative) Frequency information (Power Spectral Density) Frequency information (Power Spectral Density) Wavelet transforms (Daubechies, Symlet) Wavelet transforms (Daubechies, Symlet) Result: Only small differences in accuracy! Result: Only small differences in accuracy!

Data Representation Example: Wavelet representations Example: Wavelet representations i.e. some indications, but nothing statistically significant (±5%) i.e. some indications, but nothing statistically significant (±5%)

Cross Entropy

SVM Sensitivity Analysis

SVM Sensitivity Analysis (Windowed)

Comparison of Estimation Techniques