8. Machine Learning, Support Vector Machines

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
S UPPORT V ECTOR M ACHINES Jianping Fan Dept of Computer Science UNC-Charlotte.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Linear Classifiers/SVMs

An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Computational Methods for Data Analysis
CS Machine Learning.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
CS 391L: Machine Learning Introduction
An Introduction to Support Vector Machines Martin Law.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
1 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.” –Herbert Simon “Learning.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
For Friday Read chapter 18, sections 3-4 Homework: –Chapter 14, exercise 12 a, b, d.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 CS 512 Machine Learning Berrin Yanikoglu Slides are expanded from the Machine Learning-Mitchell book slides Some of the extra slides thanks to T. Jaakkola,
Machine Learning An Introduction. What is Learning?  Herbert Simon: “Learning is any process by which a system improves performance from experience.”
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
1 Mining in geographic data Original slides:Raymond J. Mooney University of Texas at Austin.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Introduction to Artificial Intelligence and Soft Computing
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
An Introduction to Support Vector Machine (SVM)
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Machine Learning Introduction. Class Info Office Hours –Monday:11:30 – 1:00 –Wednesday:10:00 – 1:00 –Thursday:11:30 – 1:00 Course Text –Tom Mitchell:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Data Mining and Decision Support
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Support Vector Machines
Support Vector Machines
Overview of Machine Learning
COSC 4335: Other Classification Techniques
Support Vector Machines
Presentation transcript:

8. Machine Learning, Support Vector Machines Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences) Rohit Kate 8. Machine Learning, Support Vector Machines Many of the slides have been adapted from Ray Mooney’s Machine Learning course at UT Austin.

Reading Chapter 3, Computational Intelligence in Biomedical Engineering by Rezaul Begg, Daniel T.H. Lai, Marimuthu Palaniswami, CRC Press 2007.  

What is Learning? Herbert Simon: “Learning is any process by which a system improves performance from experience.” What is the task? Classification Problem solving / planning / control

Classification Assign object/event to one of a given finite set of categories. Medical diagnosis Radiology images Credit card applications or transactions Fraud detection in e-commerce Worm detection in network packets Spam filtering in email Recommended articles in a newspaper Recommended books, movies, music, or jokes Financial investments DNA sequences Spoken words Handwritten letters Astronomical images

Problem Solving / Planning / Control Performing actions in an environment in order to achieve a goal. Solving calculus problems Playing checkers, chess, or backgammon Balancing a pole Driving a car or a jeep Flying a plane, helicopter, or rocket Controlling an elevator Controlling a character in a video game Controlling a mobile robot

Measuring Performance Classification Accuracy Solution correctness Solution quality (length, efficiency) Speed of performance

Why Study Machine Learning? Engineering Better Computing Systems Develop systems that are too difficult/expensive to construct manually because they require specific detailed skills or knowledge tuned to a specific task (knowledge engineering bottleneck). Develop systems that can automatically adapt and customize themselves to individual users. Personalized news or mail filter Personalized tutoring Discover new knowledge from large databases (data mining). Market basket analysis (e.g. diapers and beer) Medical text mining (e.g. drugs and their adverse effects)

Why Study Machine Learning? Cognitive Science Computational studies of learning may help us understand learning in humans and other biological organisms. Hebbian neural learning “Neurons that fire together, wire together.” Human’s relative difficulty of learning disjunctive concepts vs. conjunctive ones. Power law of practice log(perf. time) log(# training trials)

Why Study Machine Learning? The Time is Ripe Many basic effective and efficient algorithms available. Large amounts of on-line data available. Large amounts of computational resources available.

Related Disciplines Artificial Intelligence Data Mining Probability and Statistics Information theory Numerical optimization Computational complexity theory Control theory (adaptive) Psychology (developmental, cognitive) Neurobiology Linguistics Philosophy

Defining the Learning Task Improve on task, T, with respect to performance metric, P, based on experience, E. T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels

Designing a Learning System Choose the training experience Training examples Features Choose exactly what is too be learned, i.e. the target function. Choose how to represent the target function. Choose a learning algorithm to infer the target function from the experience.

An Example of Learning Task Task: Predict the class of Iris plant (Iris Setosa, Iris Versicolor, Iris Vriginica) from the dimensions of its sepals and petals http://archive.ics.uci.edu/ml/datasets/Iris Features: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm Manually (expert) label some examples which become the training examples

An Example of Learning Task Example Training Examples Example Sepal length Sepal width Petal length Petal width Class 1. 5.1 3.5 1.4 0.2 Setosa 2. 4.9 3.0 3. 7.0 3.2 4.7 Versicolor 4. 6.3 3.3 6.0 2.5 Virginica 5. 5.8 2.7 1.9 …

An Example of Learning Task Example Training Examples Example Sepal length Sepal width Petal length Petal width Class 1. 5.1 3.5 1.4 0.2 Setosa 2. 4.9 3.0 3. 7.0 3.2 4.7 Versicolor 4. 6.3 3.3 6.0 2.5 Virginica 5. 5.8 2.7 1.9 … Features

An Example of Learning Task Example Training Examples Example Sepal length Sepal width Petal length Petal width Class 1. 5.1 3.5 1.4 0.2 Setosa 2. 4.9 3.0 3. 7.0 3.2 4.7 Versicolor 4. 6.3 3.3 6.0 2.5 Virginica 5. 5.8 2.7 1.9 … Features Feature values

An Example of Learning Task Example Training Examples Expert labels Example Sepal length Sepal width Petal length Petal width Class 1. 5.1 3.5 1.4 0.2 Setosa 2. 4.9 3.0 3. 7.0 3.2 4.7 Versicolor 4. 6.3 3.3 6.0 2.5 Virginica 5. 5.8 2.7 1.9 … Features Feature values

An Example of Learning Task Example Training Examples Example Sepal length Sepal width Petal length Petal width Class 1. 5.1 3.5 1.4 0.2 Setosa 2. 4.9 3.0 3. 7.0 3.2 4.7 Versicolor 4. 6.3 3.3 6.0 2.5 Virginica 5. 5.8 2.7 1.9 … Unknown labels Test Examples Example Sepal length Sepal width Petal length Petal width Class 1. 6.4 2.7 5.3 1.9 ?? 2. 6.9 3.2 5.7 2.3 3. 5.0 3.3 1.4 0.2

An Example of Learning Task Example How to relate features to the class (target)? Decide a representation for the target function: f(SL,SW,PL,PW) = {Setosa,Versicolor,Virginica} An example, x = w0+w1*SL + w2*SW + w3*PL + w4*PW y = v0+v1*SL + v2*SW + v3*PL + v4*PW f(sl,sw,pl,pw)= Setosa if x < 0 Versicolor if x >= 0 and y < 0 Virginica if x >=0 and y >=0 Find the values of the parameters w0, w1, w2, w3, w4, v0, v1, v2, v3, v4 that fits the training data using a machine learning method If the test examples are from the same distribution as the training examples (similar), then the learned function should predict classes for the test examples with a good accuracy

Classification and Regression Most learning tasks fall under two categories Classification: The value to be predicted is a nominal value, for example, class of the plant, positive or negative diagnosis Regression: The value to be predicted is a numerical value, for example, stock prices, energy expenditure Most machine learning methods have both classification and regression versions

Feature Engineering Besides the machine learning method employed, the performance depends largely on the features used It is a skill to come up with the best features, called feature engineering If the relevant features are not used then the machine learning method will never be able to learn to predict the correct class Extraneous features may confuse the machine learning methods, although they usually have some robustness to certain level Methods exist to automatically search the possible space of features to select the best features, feature selection methods

Lessons Learned about Learning Learning can be viewed as using experience to approximate a chosen target function. Function approximation can be viewed as a search through a space of hypotheses (representations of functions) for one that best fits a set of training data. Different learning methods assume different hypothesis spaces (representation languages) and/or employ different search techniques.

Various Function Representations Numerical functions Linear regression Neural networks Support vector machines Symbolic functions Decision trees Rules in propositional logic Rules in first-order predicate logic Instance-based functions Nearest-neighbor Case-based Probabilistic Graphical Models Naïve Bayes Bayesian networks Hidden-Markov Models (HMMs) Probabilistic Context Free Grammars (PCFGs) Markov networks

Various Search Algorithms Gradient descent Perceptron Backpropagation Dynamic Programming HMM Learning PCFG Learning Divide and Conquer Decision tree induction Rule learning Evolutionary Computation Genetic Algorithms (GAs) Genetic Programming (GP) Neuro-evolution

Evaluation of Learning Systems Experimental Conduct controlled cross-validation experiments to compare various methods on a variety of benchmark datasets. Gather data on their performance, e.g. test accuracy, training-time, testing-time. Analyze differences for statistical significance. Theoretical Analyze algorithms mathematically and prove theorems about their: Computational complexity (how fast the algorithm runs) Ability to fit training data Sample complexity (number of training examples needed to learn an accurate function)

History of Machine Learning Samuel’s checker player Selfridge’s Pandemonium 1960s: Neural networks: Perceptron Pattern recognition Learning in the limit theory Minsky and Papert prove limitations of Perceptron 1970s: Symbolic concept induction Winston’s arch learner Expert systems and the knowledge acquisition bottleneck Quinlan’s ID3 Michalski’s AQ and soybean diagnosis Scientific discovery with BACON Mathematical discovery with AM

History of Machine Learning (cont.) Advanced decision tree and rule learning Explanation-based Learning (EBL) Learning and planning and problem solving Utility problem Analogy Cognitive architectures Resurgence of neural networks (connectionism, backpropagation) Valiant’s PAC Learning Theory Focus on experimental methodology 1990s Data mining Adaptive software agents and web applications Text learning Reinforcement learning (RL) Inductive Logic Programming (ILP) Ensembles: Bagging, Boosting, and Stacking Bayes Net learning

History of Machine Learning (cont.) Support vector machines Kernel methods Graphical models Statistical relational learning Transfer learning Sequence labeling Collective classification and structured outputs Computer Systems Applications Compilers Debugging Graphics Security (intrusion, virus, and worm detection) E mail management Personalized assistants that learn Learning in robotics and vision

Support Vector Machine (SVM)

Linear Separators Binary classification can be viewed as the task of separating classes in feature space: wTx + b = 0 wTx + b > 0 wTx + b < 0 f(x) = sign(wTx + b)

Linear Separators Which of the linear separators is optimal?

Classification Margin Distance from example xi to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support vectors. ρ r

Maximum Margin Classification Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable.

Linear SVMs Mathematically Formulate the optimization problem: Which can be reformulated as: Find w and b such that is maximized and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1 Find w and b such that Φ(w) = ||w||2=wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

Solving the Optimization Problem Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems for which several (non-trivial) algorithms exist. The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every inequality constraint in the primal (original) problem: Find w and b such that Φ(w) =wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1 Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi

Soft Margin Classification What if the training set is not linearly separable? Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξi ξi

Soft Margin Classification Mathematically The old formulation: Modified formulation incorporates slack variables: Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. Find w and b such that Φ(w) =wTw is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

Linear SVMs: Overview The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi. Both in the dual formulation of the problem and in the solution training points appear only inside inner products: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi f(x) = ΣαiyixiTx + b

Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: x x x2 x

Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

The “Kernel Trick” The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj) A kernel function is a function that is eqiuvalent to an inner product in some feature space. Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2, Need to show that K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2= = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = = φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] Thus, a kernel function implicitly maps data to a high-dimensional space (without the need to compute each φ(x) explicitly).

Examples of Kernel Functions Linear: K(xi,xj)= xiTxj Mapping Φ: x → φ(x), where φ(x) is x itself Polynomial of power p: K(xi,xj)= (1+ xiTxj)p Mapping Φ: x → φ(x), where φ(x) has dimensions Gaussian (radial-basis function): K(xi,xj) = Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped to a function (a Gaussian); combination of functions for support vectors is the separator. Higher-dimensional space still has intrinsic dimensionality d, but linear separators in it correspond to non-linear separators in original space.

Non-linear SVMs Mathematically Dual problem formulation: The solution is: Optimization techniques for finding αi’s remain the same! Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi f(x) = ΣαiyiK(xi, xj)+ b

SVM applications SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99] Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.