Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Similar presentations


Presentation on theme: "Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"— Presentation transcript:

1 Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw

2 Machine Learning: as a Tool for Classifying Patterns What is the difference between you and me? Tentative answer 1: You are pretty, and I am ugly A vague answer, not very useful Tentative answer 2: You have a tiny mouth, and I have a big one A lot more useful, but what if we are viewed from the side? In general, can we use a single feature difference to distinguish one pattern from another?

3 Old Philosophical Debates What makes a cup a cup? Philosophical views Plato: the ideal type Aristotle: the collection of all cups Wittgenstein: family resemblance

4 Machine Learning Viewpoint Represent each object with a set of features: Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc. Each pattern is taken as a conglomeration of sample points or feature vectors

5 A B Two types of sample points Patterns as Conglomerations of sample Points

6 ML Viewpoint (Cnt’d) Training phase: Want to learn pattern differences among conglomerations of labeled samples Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc. Have to estimate parameters involved in the model Testing phase: Have to classify at acceptable accuracy rates

7 Models Neural networks Support vector machines Classification and regression tree AdaBoost Statistical models Prototype classifiers

8 Neural Networks

9 Back-Propagation Neural Networks Layers: Input: number of nodes = dimension of feature vector Output: number of nodes = number of class types Hidden: number of nodes > dimension of feature vector Direction of data migration Training: backward propagation Testing: forward propagation Training problems Overfitting Convergence

10 Illustration

11 Support Vector Machines (SVM)

12 SVM Gives rise to the optimal solution to binary classification problem Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types Things to tune up with: Kernel functions: defining the similarity measure of two sample vectors Tolerance for misclassification Parameters associated with the kernel function

13 Illustration

14 Classification and Regression Tree (CART)

15 Illustration

16 AdaBoost

17 Can be thought as a linear combination of the same classifier c(·, ·) with varying weights The Idea: Iteratively apply the same classifier c to a set of samples At iteration m, the samples erroneously classified at (m- 1) st iteration are duplicated at a rate γ m The weight β m is related to γ m in a certain way

18 Statistical Models

19 Bayesian Approach Given: Training samples X = {x 1, x 2, …, x n } Probability density p(t|Θ) t is an arbitrary vector (a test sample) Θ is the set of parameters Θ is taken as a set of random variables

20 Bayesian Approach (Cnt’d) Posterior density: Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given test sample t

21 A Bayesian Model with Hidden Variables In addition to the observed data X, there exist some hidden data H H is taken as a set of random variables We want to optimize with both Θ and H as unknown Some iterative procedure (EM algorithm) is required to do this

22 Hidden Markov Model (HMM) HMM is a Bayesian model with hidden variables The observed data consist of sequences of samples The hidden variables are sequences of consecutive states

23 Boltzmann-Gibbs Distribution Given: States s 1, s 2, …, s n Density p(s) = p s Maximum entropy principle: Without any information, one chooses the density p s to maximize the entropy subject to the constraints

24 Boltzmann-Gibbs (Cnt’d) Consider the Lagrangian Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor

25 Boltzmann-Gibbs (Cnt’d) Maximum entropy (ME) Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features f i Use the optimal posterior to classify

26 Boltzmann-Gibbs (Cnt’d) Maximum entropy Markov model (MEMM) The posterior consists of transition probability densities p(s | s´, X) Conditional random field (CRF) The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)

27 References R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 nd Ed., Wiley Interscience, 2001. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Appraoch, The MIT Press, 2001.


Download ppt "Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"

Similar presentations


Ads by Google