Statistical learning and optimal control:

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Internal models, adaptation, and uncertainty
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
Modeling Uncertainty over time Time series of snapshot of the world “state” we are interested represented as a set of random variables (RVs) – Observable.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Integration of sensory modalities
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
Observers and Kalman Filters
The loss function, the normal equation,
Reading population codes: a neural implementation of ideal observers Sophie Deneve, Peter Latham, and Alexandre Pouget.
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
Tracking using the Kalman Filter. Point Tracking Estimate the location of a given point along a sequence of images. (x 0,y 0 ) (x n,y n )
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Adaptive Signal Processing
RLSELE Adaptive Signal Processing 1 Recursive Least-Squares (RLS) Adaptive Filters.
1 Formation et Analyse d’Images Session 7 Daniela Hall 7 November 2005.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Optimality in Motor Control By : Shahab Vahdat Seminar of Human Motor Control Spring 2007.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
EM and expected complete log-likelihood Mixture of Experts
Statistical learning and optimal control:
Probabilistic Robotics Bayes Filter Implementations Gaussian filters.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr State estimation theory.
Statistical learning and optimal control: A framework for biological learning and motor control Lecture 2: Models of biological learning and sensory- motor.
Statistical learning and optimal control: A framework for biological learning and motor control Lecture 4: Stochastic optimal control Reza Shadmehr Johns.
Learning Theory Reza Shadmehr Optimal feedback control stochastic feedback control with and without additive noise.
Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.
An Introduction to Kalman Filtering by Arthur Pece
An Introduction to Optimal Estimation Theory Chris O´Dell AT652 Fall 2013.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Machine Learning 5. Parametric Methods.
Tracking with dynamics
CY3A2 System identification Input signals Signals need to be realisable, and excite the typical modes of the system. Ideally the signal should be persistent.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Review of statistical modeling and probability theory Alan Moses ML4bio.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Kalman Filter with Process Noise Gauss- Markov.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
École Doctorale des Sciences de l'Environnement d’Île-de-France Année Universitaire Modélisation Numérique de l’Écoulement Atmosphérique et Assimilation.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Learning Theory Reza Shadmehr Optimal control
STATISTICAL ORBIT DETERMINATION Kalman (sequential) filter
Ch3: Model Building through Regression
Lecture 10: Observers and Kalman Filters
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Homework 1 (parts 1 and 2) For the general system described by the following dynamics: We have the following algorithm for generating the Kalman gain and.
Filtering and State Estimation: Basic Concepts
Mathematical Foundations of BME Reza Shadmehr
Integration of sensory modalities
EE513 Audio Signals and Systems
Bayes and Kalman Filter
Kalman Filtering COS 323.
Learning Theory Reza Shadmehr
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME Reza Shadmehr
Mathematical Foundations of BME
Kalman Filter: Bayes Interpretation
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Statistical learning and optimal control: A framework for biological learning and motor control Lecture 1: Iterative learning and the Kalman filter Reza Shadmehr Johns Hopkins School of Medicine

Belief about state of body and world Measured sensory consequences Stochastic optimal control State change Goal selector Motor command generator Body + environment Belief about state of body and world Predicted sensory consequences Kalman filter Parameter estimation Integration Forward model Sensory system Proprioception Vision Audition Measured sensory consequences

Results from classical conditioning

Effect of time on memory: spontaneous recovery

Effect of time on memory: inter-trial interval and retention ITI=14 ITI=2 ITI=98 Performance during training Test at 1 week Testing at 1 day or 1 week (averaged together) Effect of time on memory: inter-trial interval and retention

Integration of predicted state with sensory feedback

Choice of motor commands: optimality in saccades and reaching movements eye velocity deg/sec 0.05 0.1 0.15 0.2 0.25 100 200 300 400 500 Time (sec) 5 10 15 30 40 50 Saccade size

Helpful reading: Mathematical background Raul Rojas, The Kalman Filter. Freie Universitat Berlin. N.A. Thacker and A.J. Lacey, Tutorial: The Kalman Filter. University of Manchester. Application to animal learning Peter Dayan and Angela J. Yu (2003) Uncertainty and learning. IETE Journal of Research 49:171-182. Application to sensorimotor control D. Wolpert, Z. Ghahramani, MI Jordan (1995) An internal model for sensorimotor integration. Science

Linear regression, maximum likelihood, and parameter uncertainty A noisy process produces n data points and we form an ML estimate of w. We run the noisy process again with the same sequence of x’s and re-estimate w: The distribution of the resulting w will have a var-cov that depends only on the sequence of inputs, the bases that encode those inputs, and the noise sigma.

Bias of the parameter estimates for a given X How does the ML estimate behave in the presence of noise in y? The “true” underlying process What we measured Our model of the process nx1 vector ML estimate: Because e is normally distributed: In other words:

Variance of the parameter estimates for a given X Matrix of constants vector of random variables Assume: For a given X, the ML (or least square) estimate of our parameter has this normal distribution: mxm

The Gaussian distribution and its var-cov matrix A 1-D Gaussian distribution is defined as In n dimensions, it generalizes to When x is a vector, the variance is expressed in terms of a covariance matrix C, where ρij corresponds to the degree of correlation between variables xi and xj

x1 and x2 are positively correlated x1 and x2 are not correlated x1 and x2 are negatively correlated -2 -1 1 2 3 -3 -2 -1 1 2 3 4 -2 -1 1 2 3

Parameter uncertainty: Example 1 Input history: 1 0.5 -0.5 0.5 1 1.5 2 x1 was “on” most of the time. I’m pretty certain about w1. However, x2 was “on” only once, so I’m uncertain about w2.

Parameter uncertainty: Example 2 Input history: 1 0.5 -0.5 0.5 1 1.5 2 x1 and x2 were “on” mostly together. The weight var-cov matrix shows that what I learned is that: I do not know individual values of w1 and w2 with much certainty. x1 appeared slightly more often than x2, so I’m a little more certain about the value of w1.

Parameter uncertainty: Example 3 Input history: 1 0.5 -0.5 0.5 1 1.5 2 x2 was mostly “on”. I’m pretty certain about w2, but I am very uncertain about w1. Occasionally x1 and x2 were on together, so I have some reason to believe that:

Effect of uncertainty on learning rate When you observe an error in trial n, the amount that you should change w should depend on how certain you are about w. The more certain you are, the less you should be influenced by the error. The less certain you are, the more you should “pay attention” to the error. mx1 mx1 error Kalman gain Rudolph E. Kalman (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82 (Series D): 35-45. Research Institute for Advanced Study 7212 Bellona Ave, Baltimore, MD

Example of the Kalman gain: running estimate of average w(n) is the online estimate of the mean of y Past estimate New measure As n increases, we trust our past estimate w(n-1) a lot more than the new observation y(n) Kalman gain: learning rate decreases as the number of samples increase

Example of the Kalman gain: running estimate of variance sigma_hat is the online estimate of the var of y

Objective: adjust learning gain in order to minimize model uncertainty Hypothesis about data observation in trial n my estimate of w* before I see y in trial n, given that I have seen y up to n-1 error in trial n my estimate after I see y in trial n parameter error before I saw the data (a prior error) parameter error after I saw the data point (a posterior error) a prior var-cov of parameter error a posterior var-cov of parameter error

Some observations about model uncertainty We note that P(n) is simply the var-cov matrix of our model weights. It represents the uncertainty in our model. We want to update the weights so to minimize a measure of this uncertainty.

Trace of parameter var-cov matrix is the sum of squared parameter errors Our objective is to find learning rate k (Kalman gain) such that we minimize the sum of the squared error in our parameter estimates. This sum is the trace of the P matrix. Therefore, given observation y(n), we want to find k such that we minimize the variance of our estimate w.

Find K to minimize trace of uncertainty

Find K to minimize trace of uncertainty scalar

The Kalman gain If I have a lot of uncertainty about my model, P is large compared to sigma. I will learn a lot from the current error. If I am pretty certain about my model, P is small compared to sigma. I will tend to ignore the current error.

Update of model uncertainty Model uncertainty decreases with every data point that you observe.

Hidden variable In this model, we hypothesize that the hidden variables, i.e., the “true” weights, do not change from trial to trial. Observed variables A priori estimate of mean and variance of the hidden variable before I observe the first data point Update of the estimate of the hidden variable after I observed the data point Forward projection of the estimate to the next trial

In this model, we hypothesize that the hidden variables change from trial to trial. A priori estimate of mean and variance of the hidden variable before I observe the first data point Update of the estimate of the hidden variable after I observed the data point Forward projection of the estimate to the next trial

Uncertainty about my model parameters Uncertainty about my measurement Learning rate is proportional to the ratio between two uncertainties: my model vs. my measurement. After we observe an input x, the uncertainty associated with the weight of that input decreases. Because of state update noise Q, uncertainty increases as we form the prior for the next trial.

Comparison of Kalman gain to LMS See derivation of this in homework In the Kalman gain approach, the P matrix depends on the history of all previous and current inputs. In LMS, the learning rate is simply a constant that does not depend on past history. With the Kalman gain, our estimate converges on a single pass over the data set. In LMS, we don’t estimate the var-cov matrix P on each trial, but we will need multiple passes before our estimate converges.

Effect of state and measurement noise on the Kalman gain 2 4 6 8 10 2.5 3 3.5 4.5 5 2 4 6 8 10 2.5 3 3.5 4.5 5 2 4 6 8 10 0.65 0.7 0.75 0.8 2 4 6 8 10 0.5 0.55 0.6 0.65 0.7 0.75 0.8 High noise in the state update model produces increased uncertainty in model parameters. This produces high learning rates. High noise in the measurement also increases parameter uncertainty. But this increase is small relative to measurement uncertainty. Higher measurement noise leads to lower learning rates.

Effect of state transition auto-correlation on the Kalman gain 2 4 6 8 10 1 3 5 2 4 6 8 10 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Learning rate is higher in a state model that has high auto-correlations (larger a). That is, if the learner assumes that the world is changing slowly (a is close to 1), then the learner will have a large learning rate.