Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Supervised Learning Recap
A (very) brief introduction to multivoxel analysis “stuff” Jo Etzel, Social Brain Lab
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Speaker Adaptation for Vowel Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Model: Extension of Markov Chains
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Efficient Model Selection for Support Vector Machines
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graphical models for part of speech tagging
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
John Lafferty Andrew McCallum Fernando Pereira
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Web-Mining Agents Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Übungen)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSSE463: Image Recognition Day 14
PREDICT 422: Practical Machine Learning
Deep Feedforward Networks
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Data Mining Lecture 11.
Pattern Classification, Chapter 3
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hidden Markov Models Part 2: Algorithms
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext

Machine Learning: as a Tool for Classifying Patterns What is the difference between you and me? Tentative answer 1: You are pretty, and I am ugly A vague answer, not very useful Tentative answer 2: You have a tiny mouth, and I have a big one A lot more useful, but what if we are viewed from the side? In general, can we use a single feature difference to distinguish one pattern from another?

Old Philosophical Debates What makes a cup a cup? Philosophical views Plato: the ideal type Aristotle: the collection of all cups Wittgenstein: family resemblance

Machine Learning Viewpoint Represent each object with a set of features: Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc. Each pattern is taken as a conglomeration of sample points or feature vectors

A B Two types of sample points Patterns as Conglomerations of sample Points

ML Viewpoint (Cnt’d) Training phase: Want to learn pattern differences among conglomerations of labeled samples Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc. Have to estimate parameters involved in the model Testing phase: Have to classify at acceptable accuracy rates

Models Neural networks Support vector machines Classification and regression tree AdaBoost Statistical models Prototype classifiers

Neural Networks

Back-Propagation Neural Networks Layers: Input: number of nodes = dimension of feature vector Output: number of nodes = number of class types Hidden: number of nodes > dimension of feature vector Direction of data migration Training: backward propagation Testing: forward propagation Training problems Overfitting Convergence

Illustration

Support Vector Machines (SVM)

SVM Gives rise to the optimal solution to binary classification problem Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types Things to tune up with: Kernel functions: defining the similarity measure of two sample vectors Tolerance for misclassification Parameters associated with the kernel function

Illustration

Classification and Regression Tree (CART)

Illustration

AdaBoost

Can be thought as a linear combination of the same classifier c(·, ·) with varying weights The Idea: Iteratively apply the same classifier c to a set of samples At iteration m, the samples erroneously classified at (m- 1) st iteration are duplicated at a rate γ m The weight β m is related to γ m in a certain way

Statistical Models

Bayesian Approach Given: Training samples X = {x 1, x 2, …, x n } Probability density p(t|Θ) t is an arbitrary vector (a test sample) Θ is the set of parameters Θ is taken as a set of random variables

Bayesian Approach (Cnt’d) Posterior density: Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given test sample t

A Bayesian Model with Hidden Variables In addition to the observed data X, there exist some hidden data H H is taken as a set of random variables We want to optimize with both Θ and H as unknown Some iterative procedure (EM algorithm) is required to do this

Hidden Markov Model (HMM) HMM is a Bayesian model with hidden variables The observed data consist of sequences of samples The hidden variables are sequences of consecutive states

Boltzmann-Gibbs Distribution Given: States s 1, s 2, …, s n Density p(s) = p s Maximum entropy principle: Without any information, one chooses the density p s to maximize the entropy subject to the constraints

Boltzmann-Gibbs (Cnt’d) Consider the Lagrangian Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor

Boltzmann-Gibbs (Cnt’d) Maximum entropy (ME) Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features f i Use the optimal posterior to classify

Boltzmann-Gibbs (Cnt’d) Maximum entropy Markov model (MEMM) The posterior consists of transition probability densities p(s | s´, X) Conditional random field (CRF) The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)

References R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 nd Ed., Wiley Interscience, T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Appraoch, The MIT Press, 2001.