UNIT V: LEARNING.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Reinforcement Learning
Learning from Observations Chapter 18 Section 1 – 3.
Slides from: Doug Gray, David Poole
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Artificial Intelligence
© Franz Kurfess Learning 1 CSC 480: Artificial Intelligence Dr. Franz J. Kurfess Computer Science Department Cal Poly.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
For Wednesday Read chapter 19, sections 1-3 No homework.
Data Mining Classification: Alternative Techniques
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Supervised Learning Recap
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Kostas Kontogiannis E&CE
Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Machine Learning Neural Networks
Machine Learning CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 5.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Learning From Observations
Neural Networks Marco Loog.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Three kinds of learning
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
LEARNING DECISION TREES
Learning: Introduction and Overview
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Radial Basis Function Networks
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 Neural Network.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining and Decision Support
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Anifuddin Azis LEARNING. Why is learning important? So far we have assumed we know how the world works Rules of queens puzzle Rules of chess Knowledge.
Statistical Learning Methods
Machine Learning Supervised Learning Classification and Regression
Learning from Observations
Learning from Observations
Deep Feedforward Networks
Introduce to machine learning
Learning in Neural Networks
Presented By S.Yamuna AP/CSE
Data Mining Lecture 11.
Announcements Homework 3 due today (grace period through Friday)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Learning from Observations
Learning from Observations
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

UNIT V: LEARNING

LEARNING Learning from Observation Inductive Learning Decision Trees Explanation based Learning Statistical Learning methods Reinforcement Learning

Learning An agent tries to improve its behavior through observation, reasoning, or reflection learning from experience memorization of past percepts, states, and actions generalizations, identification of similar experiences forecasting prediction of changes in the environment theories generation of complex models based on observations and reasoning © 2000-2012 Franz Kurfess Learning

Learning from Observation Learning Agents Inductive Learning Learning Decision Trees © 2000-2012 Franz Kurfess Learning

Learning Agents based on previous agent designs, such as reflexive, model-based, goal-based agents those aspects of agents are encapsulated into the performance element of a learning agent a learning agent has an additional learning element usually used in combination with a critic and a problem generator for better learning most agents learn from examples inductive learning © 2000-2012 Franz Kurfess Learning

Learning Agent Model Agent Environment Sensors Effectors Critic Performance Standard Critic Feedback Changes Learning Element Performance Element Knowledge Learning Goals Problem Generator Agent Effectors Environment © 2000-2012 Franz Kurfess Learning

Components Learning Agent learning element performance element critic problem generator © 2000-2012 Franz Kurfess Learning

Learning Element responsible for making improvements uses knowledge about the agent and feedback on its actions to improve performance © 2000-2012 Franz Kurfess Learning

Performance Element selects external actions collects percepts, decides on actions incorporated most aspects of our previous agent design © 2000-2012 Franz Kurfess Learning

Critic informs the learning element about the performance of the action must use a fixed standard of performance should be from the outside an internal standard could be modified to improve performance sometimes used by humans to justify or disguise low performance © 2000-2012 Franz Kurfess Learning

Problem Generator suggests actions that might lead to new experiences may lead to some sub-optimal decisions in the short run in the long run, hopefully better actions may be discovered otherwise no exploration would occur © 2000-2012 Franz Kurfess Learning

Learning Element Design Issues selections of the components of the performance elements that are to be improved representation mechanisms used in those components availability of feedback availability of prior information © 2000-2012 Franz Kurfess Learning

Performance Element Components multitude of different designs of the performance element corresponding to the various agent types discussed earlier candidate components for learning mapping from conditions to actions methods of inferring world properties from percept sequences changes in the world exploration of possible actions utility information about the desirability of world states goals to achieve high utility values © 2000-2012 Franz Kurfess Learning

Component Representation many possible representation schemes weighted polynomials (e.g. in utility functions for games) propositional logic predicate logic probabilistic methods (e.g. belief networks) learning methods have been explored and developed for many representation schemes © 2000-2012 Franz Kurfess Learning

Forms of Learning supervised learning unsupervised learning an agent tries to find a function that matches examples from a sample set each example provides an input together with the correct output a teacher provides feedback on the outcome the teacher can be an outside entity, or part of the environment unsupervised learning the agent tries to learn from patterns without corresponding output values reinforcement learning the agent does not know the exact output for an input, but it receives feedback on the desirability of its behavior the feedback can come from an outside entity, the environment, or the agent itself the feedback may be delayed, and not follow the respective action immediately © 2000-2012 Franz Kurfess Learning

Feedback provides information about the actual outcome of actions supervised learning both the input and the output of a component can be perceived by the agent directly the output may be provided by a teacher reinforcement learning feedback concerning the desirability of the agent’s behavior is available not in the form of the correct output may not be directly attributable to a particular action feedback may occur only after a sequence of actions the agent or component knows that it did something right (or wrong), but not what action caused it © 2000-2012 Franz Kurfess Learning

Prior Knowledge background knowledge available before a task is tackled can increase performance or decrease learning time considerably many learning schemes assume that no prior knowledge is available in reality, some prior knowledge is almost always available but often in a form that is not immediately usable by the agent © 2000-2012 Franz Kurfess Learning

Inductive Learning tries to find a function h (the hypothesis) that approximates a set of samples defining a function f the samples are usually provided as input-output pairs (x, f(x)) supervised learning method relies on inductive inference, or induction conclusions are drawn from specific instances to more general statements © 2000-2012 Franz Kurfess Learning

Hypotheses finding a suitable hypothesis can be difficult since the function f is unknown, it is hard to tell if the hypothesis h is a good approximation the hypothesis space describes the set of hypotheses under consideration e.g. polynomials, sinusoidal functions, propositional logic, predicate logic, ... the choice of the hypothesis space can strongly influence the task of finding a suitable function while a very general hypothesis space (e.g. Turing machines) may be guaranteed to contain a suitable function, it can be difficult to find it Ockham’s razor: if multiple hypotheses are consistent with the data, choose the simplest one © 2000-2012 Franz Kurfess Learning

Example Inductive Learning 1 f(x) input-output pairs displayed as points in a plane the task is to find a hypothesis (functions) that connects the points either all of them, or most of them various performance measures number of points connected minimal surface lowest tension © 2000-2012 Franz Kurfess Learning

Example Inductive Learning 2 f(x) hypothesis is a function consisting of linear segments fully incorporates all sample pairs goes through all points very easy to calculate has discontinuities at the joints of the segments moderate predictive performance © 2000-2012 Franz Kurfess Learning

Example Inductive Learning 3 f(x) hypothesis expressed as a polynomial function incorporates all samples more complicated to calculate than linear segments no discontinuities better predictive power © 2000-2012 Franz Kurfess Learning

Example Inductive Learning 4 f(x) hypothesis is a linear functions does not incorporate all samples extremely easy to compute low predictive power © 2000-2012 Franz Kurfess Learning

Learning and Decision Trees based on a set of attributes as input, predicted output value, the decision is learned it is called classification learning for discrete values regression for continuous values Boolean or binary classification output values are true or false conceptually the simplest case, but still quite powerful making decisions a sequence of test is performed, testing the value of one of the attributes in each step when a leaf node is reached, its value is returned good correspondence to human decision-making © 2000-2012 Franz Kurfess Learning

Boolean Decision Trees compute yes/no decisions based on sets of desirable or undesirable properties of an object or a situation each node in the tree reflects one yes/no decision based on a test of the value of one property of the object the root node is the starting point leaf nodes represent the possible final decisions branches are labeled with possible values the learning aspect is to predict the value of a goal predicate (also called goal concept) a hypothesis is formulated as a function that defines the goal predicate © 2000-2012 Franz Kurfess Learning

Terminology example or sample sample set training validation describes the values of the attributes and the goal a positive sample has the value true for the goal predicate, a negative sample has false sample set collection of samples used for training and validation training the training set consists of samples used for constructing the decision tree validation the test set is used to determine if the decision tree performs correctly ideally, the test set is different from the training set © 2000-2012 Franz Kurfess Learning

Restaurant Sample Set This sample set has a few non-binary attributes, such as “Patrons”, “Price”, “Type”, and “Estimated Wait”.Decision trees can handle n-ary attributes, and it is straightforward to convert them into a corresponding set of binary attributes. © 2000-2012 Franz Kurfess Learning

Decision Tree Example Patrons? No Yes EstWait? Yes Hungry? Bar? No Yes None Full Some No Yes EstWait? > 60 0-10 30-60 10-30 Yes Hungry? Bar? No No Yes No Yes Yes Alternative? No Alternative? No This decision tree is human- generated, and has a few inconsistencies with the data set. For example, “Alternative”, “Walkable”, and “Driveable” are variations of the same attribute. No Yes Yes Yes Walkable? Yes Driveable? No No Yes Yes Yes No Yes No To wait, or not to wait? © 2000-2012 Franz Kurfess Learning

Learning Decision Trees Problem: find a decision tree that agrees with the training set trivial solution: construct a tree with one branch for each sample of the training set works perfectly for the samples in the training set may not work well for new samples (generalization) results in relatively large trees better solution: find a concise tree that still agrees with all samples corresponds to the simplest hypothesis that is consistent with the training set © 2000-2012 Franz Kurfess Learning

Ockham’s Razor The most likely hypothesis is the simplest one that is consistent with all observations. general principle for inductive learning a simple hypothesis that is consistent with all observations is more likely to be correct than a complex one © 2000-2012 Franz Kurfess Learning

Constructing Decision Trees in general, constructing the smallest possible decision tree is an intractable problem algorithms exist for constructing reasonably small trees basic idea: test the most important attribute first attribute that makes the most difference for the classification of an example can be determined through information theory hopefully will yield the correct classification with few tests © 2000-2012 Franz Kurfess Learning

Decision Tree Algorithm recursive formulation select the best attribute to split positive and negative examples if only positive or only negative examples are left, we are done if no examples are left, no such examples were observed return a default value calculated from the majority classification at the node’s parent if we have positive and negative examples left, but no attributes to split them, we are in trouble samples have the same description, but different classifications may be caused by incorrect data (noise), or by a lack of information, or by a truly non-deterministic domain © 2000-2012 Franz Kurfess Learning

Performance of Decision Tree Learning quality of predictions predictions for the classification of unknown examples that agree with the correct result are obviously better can be measured easily after the fact it can be assessed in advance by splitting the available examples into a training set and a test set learn the training set, and assess the performance via the test set size of the tree a smaller tree (especially depth-wise) is a more concise representation © 2000-2012 Franz Kurfess Learning

Noise and Over-fitting the presence of irrelevant attributes (“noise”) may lead to more degrees of freedom in the decision tree the hypothesis space is unnecessarily large overfitting makes use of irrelevant attributes to distinguish between samples that have no meaningful differences e.g. using the day of the week when rolling dice over-fitting is a general problem for all learning algorithms decision tree pruning identifies attributes that are likely to be irrelevant very low information gain cross-validation splits the sample data in different training and test sets results are averaged © 2000-2012 Franz Kurfess Learning

Ensemble Learning Multiple hypotheses (an ensemble) are generated, and their predictions combined by using multiple hypotheses, the likelihood for misclassification is hopefully lower also enlarges the hypothesis space Boosting is a frequently used ensemble method each example in the training set has a weight associated the weights of incorrectly classified examples are increased, and a new hypothesis is generated from this new weighted training set the final hypothesis is a weighted-majority combination of all the generated hypotheses © 2000-2012 Franz Kurfess Learning

Explanation-Based Learning Learning complex concepts using Induction procedures typically requires a substantial number of training instances. But people seem to be able to learn quite a bit from single examples. An EBL system attempts to learn from a single example x by explaining why x is an example of the target concept. The explanation is then generalized, and then system’s performance is improved through the availability of this knowledge.

EBL EBL programs as accepting the following as input: A training example A goal concept: A high level description of what the program is supposed to learn An operational criterion- A description of which concepts are usable. A domain theory: A set of rules that describe relationships between objects and actions in a domain. From this EBL computes a generalization of the training example that is sufficient to describe the goal concept, and also satisfies the operationality criterion. Explanation-based generalization (EBG) is an algorithm for EBL and has two steps: (1) explain, (2) generalize During the explanation step- prune away all the unimportant aspects of the training example with respect to the goal concept – gives explanation The next step is to generalize the explanation as far as possible while still describing the goal concept.

Statistical Learning Methods

Statistical Learning Data – instantiations of some or all of the random variables describing the domain; they are evidence Hypotheses – probabilistic theories of how the domain works The Surprise candy example: two flavors in very large bags of 5 kinds, indistinguishable from outside h1: 100% cherry – P(c|h1) = 1, P(l|h1) = 0 h2: 75% cherry + 25% lime h3: 50% cherry + 50% lime h4: 25% cherry + 75% lime h5: 100% lime

Problem formulation Bayesian learning Given a new bag, random variable H denotes the bag type (h1 – h5); Di is a random variable (cherry or lime); after seeing D1, D2, …, DN, predict the flavor (value) of DN-1. Bayesian learning Calculates the probability of each hypothesis, given the data and makes predictions on that basis P(hi|d) = αP(d|hi)P(hi), where d are observed values of D Predictions use a likelihood-weighted average over hypotheses hi are intermediaries between raw data and predictions No need to pick one best-guess hypothesis

Learning with Complete Data Parameter learning - to find the numerical parameters for a probability model whose structure is fixed Data are complete when each data point contains values for every variable in the model Maximum-likelihood parameter learning: discrete model With complete data, ML parameter learning problem for a Bayesian network decomposes into separate learning problems, one for each parameter

Naïve Bayes models The most common Bayesian network model used in machine learning It assumes that the attributes are conditionally independent of each other, given class A deterministic prediction can be obtained by choosing the most likely class P(C|x1,x2,…,xn) = αP(C) Πi P(xi|C) NBC has no difficulty with noisy data

Learning with Hidden Variables Many real-world problems have hidden variables which are not observable in the data available for learning. Question: If a variable (disease) is not observed, why not construct a model without it? Answer: Hidden variables can dramatically reduce the number of parameters required to specify a Bayesian network. This results in the reduction of needed amount of data for learning.

EM: Learning mixtures of Gaussians The unsupervised clustering problem If we knew which component generated each xj, we can get , If we knew the parameters of each component, we know which ci should xj belong to. However, we do not know either, … EM – expectation and maximization Pretend we know the parameters of the model and then to infer the probability that each xj belongs to each component; iterate until convergence.

E-step computes the expected value pij of the hidden indicator variables Zij, where Zij is 1 if xj was generated by i-th component, 0 otherwise M-step finds the new values of the parameters that maximize the log likelihood of the data, given the expected values of Zij

Instance-based Learning Parametric vs. nonparametric learning Learning focuses on fitting the parameters of a restricted family of probability models to an unrestricted data set Parametric learning methods are often simple and effective, but can oversimplify what’s really happening Nonparametric learning allows the hypothesis complexity to grow with the data IBL is nonparametric as it constructs hypotheses directly from the training data.

Nearest-neighbor models The key idea: Neighbors are similar Density estimation example: estimate x’s probability density by the density of its neighbors Connecting with table lookup, NBC, decision trees, … How define neighborhood N If too small, no any data points If too big, density is the same everywhere A solution is to define N to contain k points, where k is large enough to ensure a meaningful estimate For a fixed k, the size of N varies The effect of size of k For most low-dimensional data, k is usually between 5-10

K-NN for a given query x Which data point is nearest to x? We need a distance metric, D(x1, x2) Euclidean distance DE is a popular one When each dimension measures something different, it is inappropriate to use DE (Why?) Important to standardize the scale for each dimension Mahalanobis distance is one solution Discrete features should be dealt with differently Hamming distance Use k-NN to predict High dimensionality poses another problem

Summary Bayesian learning formulates learning as a form of probabilistic inference, using the observations to update a prior distribution over hypotheses. Maximum a posteriori (MAP) selects a single most likely hypothesis given the data. Maximum likelihood simply selects the hypothesis that maximizes the likelihood of the data (= MAP with a uniform prior). EM can find local maximum likelihood solutions for hidden variables. Instance-based models use the collection of data to represent a distribution. Nearest-neighbor method

Reinforcement Learning In which we examine how an agent can learn from success and failure, reward and punishment.

Introduction Learning to ride a bicycle: The goal given to the Reinforcement Learning system is simply to ride the bicycle without falling over Begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html

Learning to ride a bicycle: Introduction Learning to ride a bicycle: RL system turns the handle bars to the LEFT Result: CRASH!!! Receives negative reinforcement RL system turns the handle bars to the RIGHT

Learning to ride a bicycle: Introduction Learning to ride a bicycle: RL system has learned that the “state” of being titled 45 degrees to the right is bad Repeat trial using 40 degree to the right By performing enough of these trial-and-error interactions with the environment, the RL system will ultimately learn how to prevent the bicycle from ever falling over

Passive Learning in a Known Environment Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.

Passive Learning in a Known Environment In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

Passive Learning in a Known Environment the object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning)

Active Learning in an Unknown Environment An active agent must consider : what actions to take what their outcomes may be how they will affect the rewards received

Active Learning in an Unknown Environment Minor changes to passive learning agent : environment model now incorporates the probabilities of transitions to other states given a particular action maximize its expected utility agent needs a performance element to choose an action at each step

Learning in Neural Networks Neurons and the Brain Neural Networks Perceptrons Multi-layer Networks Applications © 2000-2012 Franz Kurfess Learning

Neural Networks complex networks of simple computing elements capable of learning from examples with appropriate learning methods collection of simple elements performs high-level operations thought reasoning consciousness © 2000-2012 Franz Kurfess Learning

Neural Networks and the Brain set of interconnected modules performs information processing operations at various levels sensory input analysis memory storage and retrieval reasoning feelings consciousness neurons basic computational elements heavily interconnected with other neurons [Russell & Norvig, 1995] © 2000-2012 Franz Kurfess Learning

Artificial Neuron Diagram [Russell & Norvig, 1995] weighted inputs are summed up by the input function the (nonlinear) activation function calculates the activation value, which determines the output © 2000-2012 Franz Kurfess Learning

Common Activation Functions [Russell & Norvig, 1995] Stept(x) = 1 if x >= t, else 0 Sign(x) = +1 if x >= 0, else –1 Sigmoid(x) = 1/(1+e-x) © 2000-2012 Franz Kurfess Learning

Network Structures in principle, networks can be arbitrarily connected occasionally done to represent specific structures semantic networks logical sentences makes learning rather difficult layered structures networks are arranged into layers interconnections mostly between two layers some networks have feedback connections © 2000-2012 Franz Kurfess Learning

Perceptrons single layer, feed-forward network [Russell & Norvig, 1995] single layer, feed-forward network historically one of the first types of neural networks late 1950s the output is calculated as a step function applied to the weighted sum of inputs capable of learning simple functions linearly separable © 2000-2012 Franz Kurfess Learning

Perceptrons and Learning perceptrons can learn from examples through a simple learning rule calculate the error of a unit Erri as the difference between the correct output Ti and the calculated output Oi Erri = Ti - Oi adjust the weight Wj of the input Ij such that the error decreases Wij := Wij + α *Iij * Errij α is the learning rate this is a gradient descent search through the weight space lead to great enthusiasm in the late 50s and early 60s until Minsky & Papert in 69 analyzed the class of representable functions and found the linear separability problem © 2000-2012 Franz Kurfess Learning

Multi-Layer Networks research in the more complex networks with more than one layer was very limited until the 1980s learning in such networks is much more complicated the problem is to assign the blame for an error to the respective units and their weights in a constructive way the back-propagation learning algorithm can be used to facilitate learning in multi-layer networks © 2000-2012 Franz Kurfess Learning

Diagram Multi-Layer Network two-layer network input units Ik usually not counted as a separate layer hidden units aj output units Oi usually all nodes of one layer have weighted connections to all nodes of the next layer Oi Wji aj Wkj Ik © 2000-2012 Franz Kurfess Learning

Back-Propagation Algorithm assigns blame to individual units in the respective layers essentially based on the connection strength proceeds from the output layer to the hidden layer(s) updates the weights of the units leading to the layer essentially performs gradient-descent search on the error surface relatively simple since it relies only on local information from directly connected units has convergence and efficiency problems © 2000-2012 Franz Kurfess Learning

Capabilities of Multi-Layer Neural Networks expressiveness weaker than predicate logic good for continuous inputs and outputs computational efficiency training time can be exponential in the number of inputs depends critically on parameters like the learning rate local minima are problematic can be overcome by simulated annealing, at additional cost generalization works reasonably well for some functions (classes of problems) no formal characterization of these functions © 2000-2012 Franz Kurfess Learning

Capabilities of Multi-Layer Neural Networks (cont.) sensitivity to noise very tolerant they perform nonlinear regression transparency neural networks are essentially black boxes there is no explanation or trace for a particular answer tools for the analysis of networks are very limited some limited methods to extract rules from networks prior knowledge very difficult to integrate since the internal representation of the networks is not easily accessible © 2000-2012 Franz Kurfess Learning

Applications domains and tasks where neural networks are successfully used handwriting recognition control problems juggling, truck backup problem series prediction weather, financial forecasting categorization sorting of items (fruit, characters, phonemes, …) © 2000-2012 Franz Kurfess Learning