Overview over different methods – Supervised Learning

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
G53MLE | Machine Learning | Dr Guoping Qiu
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Artificial Neural Networks
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification Neural Networks 1
Perceptron.
Machine Learning Neural Networks
Neural Networks (II) Simple Learning Rule
Simple Neural Nets For Pattern Classification
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Artificial Neural Networks
20.5 Nerual Networks Thanks: Professors Frank Hoffmann and Jiawei Han, and Russell and Norvig.
Biological neuron artificial neuron.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Before we start ADALINE
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 4700: Foundations of Artificial Intelligence
Multi Layer Perceptron
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
1 Pattern Classification X. 2 Content General Method K Nearest Neighbors Decision Trees Nerual Networks.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Classification with Perceptrons
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Chapter 2 Single Layer Feedforward Networks
SUPERVISED LEARNING NETWORK
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21: Perceptron training and convergence.
Supervised learning network G.Anuradha. Learning objectives The basic networks in supervised learning Perceptron networks better than Hebb rule Single.
Artificial Neural Network
EEE502 Pattern Recognition
Chapter 6 Neural Network.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Neural networks.
Fall 2004 Backpropagation CS478 - Machine Learning.
Real Neurons Cell structures Cell body Dendrites Axon
Classification with Perceptrons Reading:
Artificial Neural Networks
CSC 578 Neural Networks and Deep Learning
Data Mining with Neural Networks (HK: Chapter 7.5)
Classification Neural Networks 1
Seminar on Machine Learning Rada Mihalcea
Presentation transcript:

Overview over different methods – Supervised Learning And many more You are here !

Some more basics: Threshold Logic Unit (TLU) inputs weights u1 w1 output activation w2 u2  v . q a=i=1n wi ui wn un 1 if a  q v= 0 if a < q {

Activation Functions threshold linear v v a a piece-wise linear sigmoid v v a a

Decision Surface of a TLU 1 1 Decision line w1 u1 + w2 u2 = q u2 > q 1 u1 1 < q

Scalar Products & Projections w j w j w j w • v > 0 w • v = 0 w • v < 0 u w j w • u = |w||u| cos j

Geometric Interpretation The relation w•u=q implicitly defines the decision line u2 Decision line w1 u1 + w2 u2 = q w v=1 w•u=q |uw|=q/|w| uw u1 u v=0

Geometric Interpretation In n dimensions the relation w•u=q defines a n-1 dimensional hyper-plane, which is perpendicular to the weight vector w. On one side of the hyper-plane (w•u>q) all patterns are classified by the TLU as “1”, while those that get classified as “0” lie on the other side of the hyper-plane. If patterns can be not separated by a hyper-plane then they cannot be correctly classified with a TLU.

Linear Separability u2 u2 w1=? w2=? q= ? w1=1 w2=1 q=1.5 1 1 u1 u1 1 1 1 u1 u1 1 Logical XOR Logical AND u1 u2 a v 1 2 u1 u2 v 1

{ Threshold as Weight  . q=wn+1 a= i=1n+1 wi ui un+1=-1 u1 w1 wn+1 v . a= i=1n+1 wi ui wn un 1 if a  0 v= 0 if a <0 {

Geometric Interpretation The relation w•u=0 defines the decision line u2 Decision line w w•u=0 v=1 u1 v=0 u

Training ANNs Training set S of examples {u,vt} Iterative process u is an input vector and vt the desired target output Example: Logical And S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1} Iterative process Present a training example u , compute network output v , compare output v with target vt, adjust weights and thresholds Learning rule Specifies how to change the weights w and thresholds q of the network as a function of the inputs u, output v and target vt.

Adjusting the Weight Vector w’ = w + mu j>90 mu w w Target vt =1 Output v=0 Move w in the direction of u u u w -mu j<90 w w’ = w - mu Target vt =0 Output v=1 Move w away from the direction of u

Perceptron Learning Rule w’=w + m (vt-v) u Or in components w’i = wi + Dwi = wi + m (vt-v) ui (i=1..n+1) With wn+1 = q and un+1=-1 The parameter m is called the learning rate. It determines the magnitude of weight updates Dwi . If the output is correct (vt=v) the weights are not changed (Dwi =0). If the output is incorrect (vt  v) the weights wi are changed such that the output of the TLU for the new weights w’i is closer/further to the input ui.

Perceptron Training Algorithm Repeat for each training vector pair (u, vt) evaluate the output y when u is the input if vvt then form a new weight vector w’ according to w’=w + m (vt-v) u else do nothing end if end for Until v=vt for all training vector pairs

Perceptron Learning Rule vt=-1 vt=1 v=1 (u,vt)=([2,1],-1) v=sgn(0.45-0.6+0.3) =1 w=[0.25 –0.1 0.5] u2 = 0.2 u1 – 0.5 v=-1 w=[0.2 –0.2 –0.2] (u,vt)=([-1,-1],1) v=sgn(0.25+0.1-0.5) =-1 (u,vt)=([1,1],1) v=sgn(0.25-0.7+0.1) =-1 w=[0.2 0.2 0.2] w=[-0.2 –0.4 –0.2]

Perceptron Convergence Theorem The algorithm converges to the correct classification if the training data is linearly separable and m is sufficiently small If two classes of vectors {u1} and {u2} are linearly separable, the application of the perceptron training algorithm will eventually result in a weight vector w0, such that w0 defines a TLU whose decision hyper-plane separates {u1} and {u2} (Rosenblatt 1962). Solution w0 is not unique, since if w0 u =0 defines a hyper-plane, so does w’0 = k w0.

Linear Separability u2 u2 w1=? w2=? q= ? w1=1 w2=1 q=1.5 1 1 u1 u1 1 1 1 u1 u1 1 Logical XOR Logical AND u1 u2 a v 1 2 u1 u2 v 1

Multiple TLUs . . . . . . u1 u2 u3 un w’ji = wji + m (vtj-vj) ui Handwritten alphabetic character recognition 26 classes : A,B,C…,Z First TLU distinguishes between “A”s and “non-A”s, second TLU between “B”s and “non-B”s etc. . . . v1 v2 v26 wji connects ui with vj . . . u1 u2 u3 un w’ji = wji + m (vtj-vj) ui Essentially this makes the output and target a vector, too.

Generalized Perceptron Learning Rule If we do not include the threshold as an input we use the follow description of the perceptron with symmetrical outputs (this does not matter much, though): Then we get for the learning rule: and This implies: Hence, if vt=1 and v=-1 the weight change increase the term w.u-q and vice versa. This is what we need to compensate the error!

Linear Unit – no Threshold! inputs weights u1 w1 output activation w2 u2  v . v= a = i=1n wi vi a=i=1n wi ui wn un Lets abbreviate the target output (vectors) by t in the next slides

Gradient Descent Learning Rule Consider linear unit without threshold and continuous output v (not just –1,1) v=w0 + w1 u1 + … + wn un Train the wi’s such that they minimize the squared error E[w1,…,wn] = ½ dD (td-vd)2 where D is the set of training examples and t the target outputs.

Gradient Descent D={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>} (w1,w2) Gradient: E[w]=[E/w0,… ,E/wn] w=-m E[w] (w1+w1,w2 +w2) -1/m wi= - E/wi =/wi 1/2d(td-vd)2 = /wi 1/2d(td-i wi ui)2 = d(td- vd)(-ui)

Gradient Descent Gradient-Descent(training_examples, m) Each training example is a pair of the form {(u1,…un),t} where (u1,…,un) is the vector of input values, and t is the target output value, m is the learning rate (e.g. 0.1) Initialize each wi to some small random value Until the termination condition is met, Do Initialize each wi to zero For each {(u1,…un),t} in training_examples Do Input the instance (u1,…,un) to the linear unit and compute the output v For each linear unit weight wi Do wi= wi + m (t-v) ui wi=wi+wi

Incremental Stochastic Gradient Descent Batch mode : gradient descent w=w - m ED[w] over the entire data D ED[w]=1/2d(td-vd)2 Incremental (stochastic) mode: gradient descent w=w - m Ed[w] over individual training examples d Ed[w]=1/2 (td-vd)2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if m is small enough. This is the d-Rule

Perceptron vs. Gradient Descent Rule perceptron rule w’i = wi + m (tp-vp) uip derived from manipulation of decision surface. gradient descent rule derived from minimization of error function E[w1,…,wn] = ½ p (tp-vp)2 by means of gradient descent.

Perceptron vs. Gradient Descent Rule Perceptron learning rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate m. Linear unit training rules using gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate m Even when training data contains noise Even when training data not separable by hyperplane.

Presentation of Training Examples Presenting all training examples once to the ANN is called an epoch. In incremental stochastic gradient descent training examples can be presented in Fixed order (1,2,3…,M) Randomly permutated order (5,2,7,…,3) Completely random (4,1,7,1,5,4,……)

Neuron with Sigmoid-Function inputs weights x1 w1 output activation w2 x2  y . a=i=1n wi xi wn xn y=s(a) =1/(1+e-a)

Sigmoid Unit  . x0=-1 u1 w1 w0 w2 u2 v wn un a=i=0n wi ui v=(a)=1/(1+e-a) w2 u2  v . (x) is the sigmoid function: 1/(1+e-x) wn d(x)/dx= (x) (1- (x)) un Derive gradient decent rules to train: one sigmoid function E/wi = -p(tp-v) v (1-v) uip Multilayer networks of sigmoid units backpropagation:

Gradient Descent Rule for Sigmoid Output Function Ep[w1,…,wn] = ½ (tp-vp)2 Ep/wi = /wi ½ (tp-vp)2 = /wi ½ (tp- s(Si wi uip))2 = (tp-up) s‘(Si wi uip) (-uip) for v=s(a) = 1/(1+e-a) s’(a)= e-a/(1+e-a)2=s(a) (1-s(a)) a s’ a w’i= wi + wi = wi + m v(1-v)(tp-vp) uip

Gradient Descent Learning Rule vj wji ui wi = m vjp(1-vjp) (tjp-vjp) uip activation of pre-synaptic neuron learning rate error dj of post-synaptic neuron derivative of activation function

ALVINN Automated driving at 70 mph on a public highway Camera image 30 outputs for steering 30x32 weights into one out of four hidden unit 4 hidden units 30x32 pixels as inputs