Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 37 of 42 Monday, 20 November.

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Multi-Layer Perceptron (MLP)

Slides from: Doug Gray, David Poole

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Artificial Neural Networks

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Computing & Information Sciences Kansas State University Lecture 35 of 42 CIS 530 / 730 Artificial Intelligence Lecture 35 of 42 Machine Learning: Artificial.

Tuomas Sandholm Carnegie Mellon University Computer Science Department

CS 4700: Foundations of Artificial Intelligence

Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)

Classification Neural Networks 1

Machine Learning Neural Networks

Overview over different methods – Supervised Learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

20.5 Nerual Networks Thanks: Professors Frank Hoffmann and Jiawei Han, and Russell and Norvig.

Connectionist Modeling Some material taken from cspeech.ucd.ie/~connectionism and Rich & Knight, 1991.

Back-Propagation Algorithm

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Artificial Neural Networks

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

Data Mining with Neural Networks (HK: Chapter 7.5)

Artificial Neural Networks

LOGO Classification III Lecturer: Dr. Bo Yuan

CS 4700: Foundations of Artificial Intelligence

CS 484 – Artificial Intelligence

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Artificial Neural Networks

Computer Science and Engineering

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 08 February 2007.

Chapter 9 Neural Network.

CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.

Machine Learning Chapter 4. Artificial Neural Networks

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, September 16, 1999.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 37 of 42 Monday, 20 November.

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 15 February 2008 William.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, February 9, 2000.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Monday, 11 February 2008 William.

Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

For Friday No reading Take home exam due Exam 2. For Monday Read chapter 22, sections 1-3 FOIL exercise due.

SUPERVISED LEARNING NETWORK

Computing & Information Sciences Kansas State University Friday, 17 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 36 of 42 Friday, 17 November.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Artificial Neural Network

EEE502 Pattern Recognition

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 14 February 2007.

1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.

Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Lecture 13 Multi-Layer Perceptrons and Backpropagation of Error

Machine Learning Supervised Learning Classification and Regression

Neural networks.

Learning with Perceptrons and Neural Networks

Artificial Neural Networks

Machine Learning Today: Reading: Maria Florina Balcan

Classification Neural Networks 1

Seminar on Machine Learning Rada Mihalcea

Presentation transcript:

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 37 of 42 Monday, 20 November 2006 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: Course web site: Instructor home page: Reading for Next Class: Sections 4.3 and 20.5, Russell & Norvig 2 nd edition More Artificial Neural Networks Discussion: Problem Set 7

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture Outline Today’s Reading: Section 20.5, R&N 2e Next Monday’s Reading: Section 4.3 and 20.5, R&N 2e Decision Trees  Induction  Greedy learning  Entropy Perceptrons  Definitions, representation  Limitations Multi-Layer Perceptrons  Definitions, representation  Limitations

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Artificial Neural Networks Reference: Sec , Mitchell; Chapter 4, Bishop; Rumelhart et al. Multi-Layer Networks  Nonlinear transfer functions  Multi-layer networks of nonlinear units (sigmoid, hyperbolic tangent) Backpropagation of Error  The backpropagation algorithm Relation to error gradient function for nonlinear units Derivation of training rule for feedfoward multi-layer networks  Training issues Local optima Overfitting in ANNs Hidden-Layer Representations Examples: Face Recognition and Text-to-Speech Advanced Topics (Brief Survey) Next Week: Chapter 5 and Sections , Mitchell; Quinlan paper

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Connectionist (Neural Network) Models Human Brains  Neuron switching time: ~ (10 -3 ) second  Number of neurons: ~ billion (10 10 – )  Connections per neuron: ~ thousand (10 4 – 10 5 )  Scene recognition time: ~0.1 second  100 inference steps doesn’t seem sufficient!  highly parallel computation Definitions of Artificial Neural Networks (ANNs)  “… a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes.” - DARPA (1988)  NN FAQ List: Properties of ANNs  Many neuron-like threshold switching units  Many weighted interconnections among units  Highly parallel, distributed process  Emphasis on tuning weights automatically

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence When to Consider Neural Networks Input: High-Dimensional and Discrete or Real-Valued  e.g., raw sensor input  Conversion of symbolic data to quantitative (numerical) representations possible Output: Discrete or Real Vector-Valued  e.g., low-level control policy for a robot actuator  Similar qualitative/quantitative (symbolic/numerical) conversions may apply Data: Possibly Noisy Target Function: Unknown Form Result: Human Readability Less Important Than Performance  Performance measured purely in terms of accuracy and efficiency  Readability: ability to explain inferences made using model; similar criteria Examples  Speech phoneme recognition [Waibel, Lee]  Image classification [Kanade, Baluja, Rowley, Frey]  Financial prediction

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Autonomous Learning Vehicle in a Neural Net (ALVINN) Pomerleau et al   Drives 70mph on highways Hidden-to-Output Unit Weight Map (for one hidden unit) Input-to-Hidden Unit Weight Map (for one hidden unit)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence The Perceptron x1x1 x2x2 xnxn w1w1 w2w2 wnwn  x 0 = 1 w0w0 Perceptron: Single Neuron Model  aka Linear Threshold Unit (LTU) or Linear Threshold Gate (LTG)  Net input to unit: defined as linear combination  Output of unit: threshold (activation) function on net input (threshold  = w 0 ) Perceptron Networks  Neuron is modeled using a unit connected by weighted links w i to other units  Multi-Layer Perceptron (MLP): next lecture

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Decision Surface of a Perceptron Perceptron: Can Represent Some Useful Functions  LTU emulation of logic gates (McCulloch and Pitts, 1943)  e.g., What weights represent g(x 1, x 2 ) = AND(x 1, x 2 )? OR(x 1, x 2 )? NOT(x)? Some Functions Not Representable  e.g., not linearly separable  Solution: use networks of perceptrons (LTUs) Example A x1x1 x2x2 + + Example B - - x1x1 x2x2

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Learning Rules for Perceptrons Learning Rule  Training Rule  Not specific to supervised learning  Context: updating a model Hebbian Learning Rule (Hebb, 1949)  Idea: if two units are both active (“firing”), weights between them should increase  w ij = w ij + r o i o j where r is a learning rate constant  Supported by neuropsychological evidence Perceptron Learning Rule (Rosenblatt, 1959)  Idea: when a target output value is provided for a single neuron with fixed input, it can incrementally update weights to learn to produce the output  Assume binary (boolean-valued) input/output units; single LTU  where t = c(x) is target output value, o is perceptron output, r is small learning rate constant (e.g., 0.1)  Can prove convergence if D linearly separable and r small enough

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Perceptron Learning Algorithm Simple Gradient Descent Algorithm  Applicable to concept learning, symbolic learning (with proper representation) Algorithm Train-Perceptron (D  { })  Initialize all weights w i to random values  WHILE not all examples correctly predicted DO FOR each training example x  D Compute current output o(x) FOR i = 1 to n w i  w i + r(t - o)x i // perceptron learning rule Perceptron Learnability  Recall: can only learn h  H - i.e., linearly separable (LS) functions  Minsky and Papert, 1969: demonstrated representational limitations e.g., parity (n-attribute XOR: x 1  x 2  …  x n ) e.g., symmetry, connectedness in visual pattern recognition Influential book Perceptrons discouraged ANN research for ~10 years  NB: $64K question - “Can we transform learning problems into LS ones?”

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Linear Separators Linearly Separable (LS) Data Set x1x1 x2x Functional Definition  f(x) = 1 if w 1 x 1 + w 2 x 2 + … + w n x n  , 0 otherwise   : threshold value Linearly Separable Functions  NB: D is LS does not necessarily imply c(x) = f(x) is LS!  Disjunctions: c(x) = x 1 ’  x 2 ’  …  x m ’  m of n: c(x) = at least 3 of (x 1 ’, x 2 ’, …, x m ’ )  Exclusive OR (XOR): c(x) = x 1  x 2  General DNF: c(x) = T 1  T 2  …  T m ; T i = l 1  l 1  …  l k Change of Representation Problem  Can we transform non-LS problems into LS ones?  Is this meaningful? Practical?  Does it represent a significant fraction of real-world problems?    

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Perceptron Convergence Perceptron Convergence Theorem  Claim: If there exist a set of weights that are consistent with the data (i.e., the data is linearly separable), the perceptron learning algorithm will converge  Proof: well-founded ordering on search region (“wedge width” is strictly decreasing) - see Minsky and Papert,  Caveat 1: How long will this take?  Caveat 2: What happens if the data is not LS? Perceptron Cycling Theorem  Claim: If the training data is not LS the perceptron learning algorithm will eventually repeat the same set of weights and thereby enter an infinite loop  Proof: bound on number of weight changes until repetition; induction on n, the dimension of the training example vector - MP, How to Provide More Robustness, Expressivity?  Objective 1: develop algorithm that will find closest approximation (today)  Objective 2: develop architecture to overcome representational limitation (next lecture)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Gradient Descent: Principle Understanding Gradient Descent for Linear Units  Consider simpler, unthresholded linear unit:  Objective: find “best fit” to D Approximation Algorithm  Quantitative objective: minimize error over training data set D  Error function: sum squared error (SSE) How to Minimize?  Simple optimization  Move in direction of steepest gradient in weight-error space Computed by finding tangent i.e. partial derivatives (of E) with respect to weights (w i )

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Gradient Descent: Derivation of Delta/LMS (Widrow-Hoff) Rule Definition: Gradient Modified Gradient Descent Training Rule

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Gradient Descent: Algorithm using Delta/LMS Rule Algorithm Gradient-Descent (D, r)  Each training example is a pair of the form, where x is the vector of input values and t(x) is the output value. r is the learning rate (e.g., 0.05)  Initialize all weights w i to (small) random values  UNTIL the termination condition is met, DO Initialize each  w i to zero FOR each in D, DO Input the instance x to the unit and compute the output o FOR each linear unit weight w i, DO  w i   w i + r(t - o)x i w i  w i +  w i  RETURN final w Mechanics of Delta Rule  Gradient is based on a derivative  Significance: later, will use nonlinear activation functions (aka transfer functions, squashing functions)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence LS Concepts: Can Achieve Perfect Classification  Example A: perceptron training rule converges Non-LS Concepts: Can Only Approximate  Example B: not LS; delta rule converges, but can’t do better than 3 correct  Example C: not LS; better results from delta rule Weight Vector w = Sum of Misclassified x  D  Perceptron: minimize w  Delta Rule: minimize error  distance from separator (I.e., maximize ) Gradient Descent: Perceptron Rule versus Delta/LMS Rule Example A x1x1 x2x2 + + Example B - - x1x1 x2x2 Example C x1x1 x2x

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Nonlinear Units  Recall: activation function sgn (w  x)  Nonlinear activation function: generalization of sgn Multi-Layer Networks  A specific type: Multi-Layer Perceptrons (MLPs)  Definition: a multi-layer feedforward network is composed of an input layer, one or more hidden layers, and an output layer  “Layers”: counted in weight layers (e.g., 1 hidden layer  2-layer network)  Only hidden and output layers contain perceptrons (threshold or nonlinear units) MLPs in Theory  Network (of 2 or more layers) can represent any function (arbitrarily small error)  Training even 3-unit multi-layer ANNs is NP-hard (Blum and Rivest, 1992) MLPs in Practice  Finding or designing effective networks for arbitrary functions is difficult  Training is very computation-intensive even when structure is “known” Multi-Layer Networks of Nonlinear Units x1x1 x2x2 x3x3 Input Layer u 11 h1h1 h2h2 h3h3 h4h4 Hidden Layer o1o1 o2o2 v 42 Output Layer

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Sigmoid Activation Function  Linear threshold gate activation function: sgn (w  x)  Nonlinear activation (aka transfer, squashing) function: generalization of sgn   is the sigmoid function  Can derive gradient rules to train One sigmoid unit Multi-layer, feedforward networks of sigmoid units (using backpropagation) Hyperbolic Tangent Activation Function Nonlinear Activation Functions x1x1 x2x2 xnxn w1w1 w2w2 wnwn  x 0 = 1 w0w0

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Error Gradient for a Sigmoid Unit Recall: Gradient of Error Function Gradient of Sigmoid Activation Function But We Know: So:

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Backpropagation Algorithm Intuitive Idea: Distribute Blame for Error to Previous Layers Algorithm Train-by-Backprop (D, r)  Each training example is a pair of the form, where x is the vector of input values and t(x) is the output value. r is the learning rate (e.g., 0.05)  Initialize all weights w i to (small) random values  UNTIL the termination condition is met, DO FOR each in D, DO Input the instance x to the unit and compute the output o(x) =  (net(x)) FOR each output unit k, DO FOR each hidden unit j, DO Update each w = u i,j (a = h j ) or w = v j,k (a = o k ) w start-layer, end-layer  w start-layer, end-layer +  w start-layer, end-layer  w start-layer, end-layer  r  end-layer a end-layer  RETURN final u, v x1x1 x2x2 x3x3 Input Layer u 11 h1h1 h2h2 h3h3 h4h4 Hidden Layer o1o1 o2o2 v 42 Output Layer

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Backpropagation and Local Optima Gradient Descent in Backprop  Performed over entire network weight vector  Easily generalized to arbitrary directed graphs  Property: Backprop on feedforward ANNs will find a local (not necessarily global) error minimum Backprop in Practice  Local optimization often works well (can run multiple times)  Often include weight momentum   Minimizes error over training examples - generalization to subsequent instances?  Training often very slow: thousands of iterations over D (epochs)  Inference (applying network after training) typically very fast Classification Control

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Feedforward ANNs: Representational Power and Bias Representational (i.e., Expressive) Power  Backprop presented for feedforward ANNs with single hidden layer (2-layer)  2-layer feedforward ANN Any Boolean function (simulate a 2-layer AND-OR network) Any bounded continuous function (approximate with arbitrarily small error) [Cybenko, 1989; Hornik et al, 1989]  Sigmoid functions: set of basis functions; used to compose arbitrary functions  3-layer feedforward ANN: any function (approximate with arbitrarily small error) [Cybenko, 1988]  Functions that ANNs are good at acquiring: Network Efficiently Representable Functions (NERFs) - how to characterize? [Russell and Norvig, 1995] Inductive Bias of ANNs  n-dimensional Euclidean space (weight space)  Continuous (error function smooth with respect to weight parameters)  Preference bias: “smooth interpolation” among positive examples  Not well understood yet (known to be computationally hard)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Hidden Units and Feature Extraction  Training procedure: hidden unit representations that minimize error E  Sometimes backprop will define new hidden features that are not explicit in the input representation x, but which capture properties of the input instances that are most relevant to learning the target function t(x)  Hidden units express newly constructed features  Change of representation to linearly separable D’ A Target Function (Sparse aka 1-of-C, Coding)  Can this be learned? (Why or why not?) Learning Hidden Layer Representations

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Training: Evolution of Error and Hidden Unit Encoding error D (o k ) h j ( ), 1  j  3

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Input-to-Hidden Unit Weights and Feature Extraction  Changes in first weight layer values correspond to changes in hidden layer encoding and consequent output squared errors  w 0 (bias weight, analogue of threshold in LTU) converges to a value near 0  Several changes in first 1000 epochs (different encodings) Training: Weight Evolution u i1, 1  i  8

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Convergence of Backpropagation No Guarantee of Convergence to Global Optimum Solution  Compare: perceptron convergence (to best h  H, provided h  H; i.e., LS)  Gradient descent to some local error minimum (perhaps not global minimum…)  Possible improvements on backprop (BP) Momentum term (BP variant with slightly different weight update rule) Stochastic gradient descent (BP algorithm variant) Train multiple nets with different initial weights; find a good mixture  Improvements on feedforward networks Bayesian learning for ANNs (e.g., simulated annealing) - later Other global optimization methods that integrate over multiple networks Nature of Convergence  Initialize weights near zero  Therefore, initial network near-linear  Increasingly non-linear functions possible as training progresses

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Overtraining in ANNs Error versus epochs (Example 2) Recall: Definition of Overfitting  h’ worse than h on D train, better on D test Overtraining: A Type of Overfitting  Due to excessive iterations  Avoidance: stopping criterion (cross-validation: holdout, k-fold)  Avoidance: weight decay Error versus epochs (Example 1)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Overfitting in ANNs Other Causes of Overfitting Possible  Number of hidden units sometimes set in advance  Too few hidden units (“underfitting”) ANNs with no growth Analogy: underdetermined linear system of equations (more unknowns than equations)  Too many hidden units ANNs with no pruning Analogy: fitting a quadratic polynomial with an approximator of degree >> 2 Solution Approaches  Prevention: attribute subset selection (using pre-filter or wrapper)  Avoidance Hold out cross-validation (CV) set or split k ways (when to stop?) Weight decay: decrease each weight by some factor on each epoch  Detection/recovery: random restarts, addition and deletion of weights, units

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence 90% Accurate Learning Head Pose, Recognizing 1-of-20 Faces Example: Neural Nets for Face Recognition 30 x 32 Inputs Left Straight Right Up Hidden Layer Weights after 1 Epoch Hidden Layer Weights after 25 Epochs Output Layer Weights (including w 0 =  ) after 1 Epoch

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Example: NetTalk Sejnowski and Rosenberg, 1987 Early Large-Scale Application of Backprop  Learning to convert text to speech Acquired model: a mapping from letters to phonemes and stress marks Output passed to a speech synthesizer  Good performance after training on a vocabulary of ~1000 words Very Sophisticated Input-Output Encoding  Input: 7-letter window; determines the phoneme for the center letter and context on each side; distributed (i.e., sparse) representation: 200 bits  Output: units for articulatory modifiers (e.g., “voiced”), stress, closest phoneme; distributed representation  40 hidden units; weights total Experimental Results  Vocabulary: trained on 1024 of 1463 (informal) and 1000 of (dictionary)  78% on informal, ~60% on dictionary

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Recurrent Networks Representing Time Series with ANNs  Feedforward ANN: y(t + 1) = net (x(t))  Need to capture temporal relationships Solution Approaches  Directed cycles  Feedback Output-to-input [Jordan] Hidden-to-input [Elman] Input-to-input  Captures time-lagged relationships Among x(t’  t) and y(t + 1) Among y(t’  t) and y(t + 1)  Learning with recurrent ANNs Elman, 1990; Jordan, 1987 Principe and deVries, 1992 Mozer, 1994; Hsu and Ray, 1998

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Some Current Issues and Open Problems in ANN Research Hybrid Approaches  Incorporating knowledge and analytical learning into ANNs Knowledge-based neural networks [Flann and Dietterich, 1989] Explanation-based neural networks [Towell et al, 1990; Thrun, 1996]  Combining uncertain reasoning and ANN learning and inference Probabilistic ANNs Bayesian networks [Pearl, 1988; Heckerman, 1996; Hinton et al, 1997] - later Global Optimization with ANNs  Markov chain Monte Carlo (MCMC) [Neal, 1996] - e.g., simulated annealing  Relationship to genetic algorithms - later Understanding ANN Output  Knowledge extraction from ANNs Rule extraction Other decision surfaces  Decision support and KDD applications [Fayyad et al, 1996] Many, Many More Issues (Robust Reasoning, Representations, etc.)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Some ANN Applications Diagnosis  Closest to pure concept learning and classification  Some ANNs can be post-processed to produce probabilistic diagnoses Prediction and Monitoring  aka prognosis (sometimes forecasting)  Predict a continuation of (typically numerical) data Decision Support Systems  aka recommender systems  Provide assistance to human “subject matter” experts in making decisions Design (manufacturing, engineering) Therapy (medicine) Crisis management (medical, economic, military, computer security) Control Automation  Mobile robots  Autonomic sensors and actuators Many, Many More (ANNs for Automated Reasoning, etc.)

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence NeuroSolutions Demo

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Terminology Multi-Layer ANNs  Focused on one species: (feedforward) multi-layer perceptrons (MLPs)  Input layer: an implicit layer containing x i  Hidden layer: a layer containing input-to-hidden unit weights and producing h j  Output layer: a layer containing hidden-to-output unit weights and producing o k  n-layer ANN: an ANN containing n - 1 hidden layers  Epoch: one training iteration  Basis function: set of functions that span H Overfitting  Overfitting: h does better than h’ on training data and worse on test data  Overtraining: overfitting due to training for too many epochs  Prevention, avoidance, and recovery techniques Prevention: attribute subset selection Avoidance: stopping (termination) criteria (CV-based), weight decay Recurrent ANNs: Temporal ANNs with Directed Cycles

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Summary Points Multi-Layer ANNs  Focused on feedforward MLPs  Backpropagation of error: distributes penalty (loss) function throughout network  Gradient learning: takes derivative of error surface with respect to weights Error is based on difference between desired output (t) and actual output (o) Actual output (o) is based on activation function Must take partial derivative of   choose one that is easy to differentiate Two  definitions: sigmoid (aka logistic) and hyperbolic tangent (tanh) Overfitting in ANNs  Prevention: attribute subset selection  Avoidance: cross-validation, weight decay ANN Applications: Face Recognition, Text-to-Speech Open Problems Recurrent ANNs: Can Express Temporal Depth (Non-Markovity) Next: Statistical Foundations and Evaluation, Bayesian Learning Intro