Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext
Machine Learning: as a Tool for Classifying Patterns What is the difference between you and me? Tentative answer 1: You are pretty, and I am ugly A vague answer, not very useful Tentative answer 2: You have a tiny mouth, and I have a big one A lot more useful, but what if we are viewed from the side? In general, can we use a single feature difference to distinguish one pattern from another?
Old Philosophical Debates What makes a cup a cup? Philosophical views Plato: the ideal type Aristotle: the collection of all cups Wittgenstein: family resemblance
Machine Learning Viewpoint Represent each object with a set of features: Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc. Each pattern is taken as a conglomeration of sample points or feature vectors
A B Two types of sample points Patterns as Conglomerations of sample Points
ML Viewpoint (Cnt’d) Training phase: Want to learn pattern differences among conglomerations of labeled samples Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc. Have to estimate parameters involved in the model Testing phase: Have to classify at acceptable accuracy rates
Models Neural networks Support vector machines Classification and regression tree AdaBoost Statistical models Prototype classifiers
Neural Networks
Back-Propagation Neural Networks Layers: Input: number of nodes = dimension of feature vector Output: number of nodes = number of class types Hidden: number of nodes > dimension of feature vector Direction of data migration Training: backward propagation Testing: forward propagation Training problems Overfitting Convergence
Illustration
Support Vector Machines (SVM)
SVM Gives rise to the optimal solution to binary classification problem Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types Things to tune up with: Kernel functions: defining the similarity measure of two sample vectors Tolerance for misclassification Parameters associated with the kernel function
Illustration
Classification and Regression Tree (CART)
Illustration
AdaBoost
Can be thought as a linear combination of the same classifier c(·, ·) with varying weights The Idea: Iteratively apply the same classifier c to a set of samples At iteration m, the samples erroneously classified at (m- 1) st iteration are duplicated at a rate γ m The weight β m is related to γ m in a certain way
Statistical Models
Bayesian Approach Given: Training samples X = {x 1, x 2, …, x n } Probability density p(t|Θ) t is an arbitrary vector (a test sample) Θ is the set of parameters Θ is taken as a set of random variables
Bayesian Approach (Cnt’d) Posterior density: Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given test sample t
A Bayesian Model with Hidden Variables In addition to the observed data X, there exist some hidden data H H is taken as a set of random variables We want to optimize with both Θ and H as unknown Some iterative procedure (EM algorithm) is required to do this
Hidden Markov Model (HMM) HMM is a Bayesian model with hidden variables The observed data consist of sequences of samples The hidden variables are sequences of consecutive states
Boltzmann-Gibbs Distribution Given: States s 1, s 2, …, s n Density p(s) = p s Maximum entropy principle: Without any information, one chooses the density p s to maximize the entropy subject to the constraints
Boltzmann-Gibbs (Cnt’d) Consider the Lagrangian Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor
Boltzmann-Gibbs (Cnt’d) Maximum entropy (ME) Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features f i Use the optimal posterior to classify
Boltzmann-Gibbs (Cnt’d) Maximum entropy Markov model (MEMM) The posterior consists of transition probability densities p(s | s´, X) Conditional random field (CRF) The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)
References R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 nd Ed., Wiley Interscience, T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Appraoch, The MIT Press, 2001.