BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) -1000100 0 20 40 60 80 100 Direction (deg) Activity -1000100 0 20 40 60 80.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Pattern Recognition and Machine Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Pattern Recognition and Machine Learning

Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC

Neural Computation Chapter 3. Neural Computation Outline Comparison of behavioral and neural response on a discrimination task –Bayes rule –ROC curves.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.

Chapter 4: Linear Models for Classification

Visual Recognition Tutorial

Reading population codes: a neural implementation of ideal observers Sophie Deneve, Peter Latham, and Alexandre Pouget.

How well can we learn what the stimulus is by looking at the neural responses? We will discuss two approaches: devise and evaluate explicit algorithms.

For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:

Maximum likelihood (ML) and likelihood ratio (LR) test

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Spike Train decoding Summary Decoding of stimulus from response –Two choice case Discrimination ROC curves –Population decoding MAP and ML estimators.

Visual Recognition Tutorial

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Maximum likelihood (ML)

Review of Lecture Two Linear Regression Normal Equation

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PATTERN RECOGNITION AND MACHINE LEARNING

Population Coding Alexandre Pouget Okinawa Computational Neuroscience Course Okinawa, Japan November 2004.

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

1 / 41 Inference and Computation with Population Codes 13 November 2012 Inference and Computation with Population Codes Alexandre Pouget, Peter Dayan,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.

Population coding Population code formulation Methods for decoding: population vector Bayesian inference maximum a posteriori maximum likelihood Fisher.

INTRODUCTION TO Machine Learning 3rd Edition

BCS547 Neural Decoding.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Machine Learning 5. Parametric Methods.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Review of statistical modeling and probability theory Alan Moses ML4bio.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Bayesian Perception.

Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Probability Theory and Parameter Estimation I

Ch3: Model Building through Regression

CH 5: Multivariate Methods

Maximum Likelihood Estimation

Special Topics In Scientific Computing

Data Mining Lecture 11.

Synapses Signal is carried chemically across the synaptic cleft.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

10701 / Machine Learning Today: - Cross validation,

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Pattern Recognition and Machine Learning

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Linear Discrimination

Presentation transcript:

BCS547 Neural Decoding

Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity Preferred Direction (deg) Activity s?

Nature of the problem In response to a stimulus with unknown orientation s, you observe a pattern of activity r. What can you say about s given r? Bayesian approach: recover p(s|r) (the posterior distribution) Estimation theory: come up with a single value estimate  from r

Maximum Likelihood Tuning Curves Direction (deg) Activity Pattern of activity (r) Preferred Direction (deg) Activity

Maximum Likelihood Template

Preferred Direction (deg) Activity Maximum Likelihood Template

Maximum Likelihood Preferred Direction (deg) Activity

Maximum Likelihood The maximum likelihood estimate is the value of s maximizing the likelihood p(r|s). Therefore, we seek such that: Noise distribution

Activity distribution P(a i |  =-60) P(r i |s=0) P(r i |s=-60)

Maximum Likelihood The maximum likelihood estimate is the value of s maximizing the likelihood p(s|r). Therefore, we seek such that: is unbiased and efficient. Noise distribution

Estimation Theory Preferred orientation Activity vector: r Decoder Encoder (nervous system)

Preferred retinal location r2r2 Decoder Trial 2 Encoder (nervous system) Preferred retinal location r1r1 Decoder Trial 1 Encoder (nervous system) Decoder Trial 200 Encoder (nervous system) Preferred retinal location r 200

Estimation Theory If, the estimate is said to be unbiased If is as small as possible, the estimate is said to be efficient Preferred orientation Activity vector: r Decoder Encoder (nervous system)

Estimation theory A common measure of decoding performance is the mean square error between the estimate and the true value This error can be decomposed as:

Efficient Estimators The smallest achievable variance for an unbiased estimator is known as the Cramer- Rao bound,  CR 2. An efficient estimator is such that In general :

and it is equal to: where p(r|s) is the distribution of the neuronal noise. Fisher Information Fisher information is defined as:

Fisher Information

For one neuron with Poisson noise For n independent neurons : The more neurons, the better!Small variance is good! Large slope is good!

Fisher Information and Tuning Curves Fisher information is maximum where the slope is maximum This is consistent with adaptation experiments

Fisher Information In 1D, Fisher information decreases as the width of the tuning curves increases In 2D, Fisher information does not depend on the width of the tuning curve In 3D and above, Fisher information increases as the width of the tuning curves increases WARNING: this is true for independent gaussian noise.

Ideal observer The discrimination threshold of an ideal observer,  s, is proportional to the variance of the Cramer-Rao Bound. In other words, an efficient estimator is an ideal observer.

An ideal observer is an observer that can recover all the Fisher information in the activity (easy link between Fisher information and behavioral performance) If all distributions are gaussians, Fisher information is the same as Shannon information.

Estimation theory Other examples of decoders Preferred orientation Activity vector: r Decoder Encoder (nervous system)

Voting Methods Optimal Linear Estimator

Linear Estimators

X and Y must be zero mean Trust cells that have small variances and large covariances

Voting Methods Optimal Linear Estimator

Voting Methods Optimal Linear Estimator Center of Mass Linear in r i /  j r j Weights set to s i

Center of Mass/Population Vector The center of mass is optimal (unbiased and efficient) iff: The tuning curves are gaussian with a zero baseline, uniformly distributed and the noise follows a Poisson distribution In general, the center of mass has a large bias and a large variance

Voting Methods Optimal Linear Estimator Center of Mass Population Vector Linear in r i Weights set to P i Nonlinear step

Population Vector s riPiriPi P

Typically, Population vector is not the optimal linear estimator.

Population Vector Population vector is optimal iff: The tuning curves are cosine, uniformly distributed and the noise follows a normal distribution with fixed variance In most cases, the population vector is biased and has a large variance The variance of the population vector estimate does not reflect Fisher information

Population Vector Population vector CR bound Population vector should NEVER be used to estimate information content!!!! The indirect method is prone to severe problems…

Population Vector

Maximum Likelihood Preferred Direction (deg) Activity

Maximum Likelihood If the noise is gaussian and independent Therefore and the estimate is given by: Distance measure: Template matching

Gradient descent for ML To minimize the likelihood function with respect to s, one can use a gradient descent technique in which s is updated according to:

Gaussian noise with variance proportional to the mean If the noise is gaussian with variance proportional to the mean, the distance being minimized changes to: Data point with small variance are weighted more heavily

Poisson noise If the noise is Poisson then And :

ML and template matching Maximum likelihood is a template matching procedure BUT the metric used is not always the Euclidean distance, it depends on the noise distribution.

Bayesian approach We want to recover p(s|r). Using Bayes theorem, we have: likelihood of s posterior distribution over s prior distribution over r prior distribution over s

Bayesian approach What is the likelihood of s  p(r| s)?  It is the distribution of the noise… It is the same distribution we used for maximum likelihood.

Bayesian approach The prior p(s) correspond to any knowledge we may have about s before we get to see any activity. Ex: prior for smooth and slow motions

Bayesian approach Once we have p(s  r), we can proceed in two different ways. We can keep this distribution for Bayesian inferences (as we would do in a Bayesian network) or we can make a decision about s. For instance, we can estimate s as being the value that maximizes p(s|r), This is known as the maximum a posteriori estimate (MAP). For flat prior, ML and MAP are equivalent.

Bayesian approach Limitations: the Bayesian approach and ML require a lot of data (estimating p(r|s) requires at least n+(n-1)(n-1)/2 parameters for multivariate gaussian)… Alternative: 1- Naïve Bayes: assume independence and hope for the best 2- Use clever method for fitting p(r|s). 3- Estimate p(s|r) directly using a nonlinear estimate. 4- hope the brain uses likelihood functions that have only N free parameters, e.g., the exponential family with linear sufficient statistics

Bayesian approach: logistic regression Example: Decoding finger movements in M1. On each trial, we observe 100 cells and we want to know which one of the 5 fingers is being moved … 100 input units 5 categories P(F5|r) r 1010 g(x)

P(F5|r) Bayesian approach: logistic regression Example: 5N free parameters instead of O(N 2 ) … 100 input units 5 categories r 1010 s

Bayesian approach: multinomial distributions Example: Decoding finger movements in M1. Each finger can take 3 mutually exclusive states: no movement, flexion, extension. Probability of no movement Probability of flexion Probability of extension Activity of the N M1 neurons W Digit 1Wrist Softmax Digit 2Digit 3Digit 4Digit 5

Decoding time varying signals s(t)  (t)

Decoding time varying signals Note the time shift…

Decoding time varying signals Discrete sum of templates centered on spikes

Decoding time varying signals s(t)  (t)

Decoding time varying signals Finding the optimal kernel (similar to OLE)

Autocorrelation function of the spike train Appendix A chapter 2 If the spike train is uncorrelated, the optimal kernel is the spike triggered average of the stimulus Correlation of the firing rate and stimulus