MACHINE LEARNING - Doctoral Class - EDIC EPFL - 2006 A.. Billard MACHINE LEARNING Information Theory and The Neuron - II Aude.

Slides:



Advertisements
Similar presentations
FMRI Methods Lecture 10 – Using natural stimuli. Reductionism Reducing complex things into simpler components Explaining the whole as a sum of its parts.
Advertisements

Pattern Recognition and Machine Learning
Independent Component Analysis: The Fast ICA algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
An Information-Maximization Approach to Blind Separation and Blind Deconvolution A.J. Bell and T.J. Sejnowski Computational Modeling of Intelligence (Fri)
Biological and Artificial Neurons Michael J. Watts
Artificial Neural Networks - Introduction -
Simple Neural Nets For Pattern Classification
Self Organization: Hebbian Learning CS/CMPE 333 – Neural Networks.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Un Supervised Learning & Self Organizing Maps Learning From Examples
Independent Component Analysis (ICA) and Factor Analysis (FA)
Bayesian belief networks 2. PCA and ICA
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Unsupervised learning
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Classification Part 3: Artificial Neural Networks
Principal Component Analysis and Independent Component Analysis in Neural Networks David Gleich CS 152 – Neural Networks 11 December 2003.
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England
Unsupervised learning
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.
LEAST MEAN-SQUARE (LMS) ADAPTIVE FILTERING. Steepest Descent The update rule for SD is where or SD is a deterministic algorithm, in the sense that p and.
Adaptive Algorithms for PCA PART – II. Oja’s rule is the basic learning rule for PCA and extracts the first principal component Deflation procedure can.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Artificial Neural Networks Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Independent Component Analysis Independent Component Analysis.
Introduction to Neural Networks Freek Stulp. 2 Overview Biological Background Artificial Neuron Classes of Neural Networks 1. Perceptrons 2. Multi-Layered.
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
CS623: Introduction to Computing with Neural Nets (lecture-12) Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Deep Feedforward Networks
Chapter 4 Supervised learning: Multilayer Networks II
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Brain Electrophysiological Signal Processing: Preprocessing
Real Neurons Cell structures Cell body Dendrites Axon
Simple learning in connectionist networks
PCA vs ICA vs LDA.
ECE 471/571 - Lecture 17 Back Propagation.
Bayesian belief networks 2. PCA and ICA
EE513 Audio Signals and Systems
Backpropagation.
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Presentation transcript:

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard MACHINE LEARNING Information Theory and The Neuron - II Aude Billard

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Overview LECTURE I: Neuron – Biological Inspiration Information Theory and the Neuron Weight Decay + Anti-Hebbian Learning  PCA Anti-Hebbian Learning  ICA LECTURE II: Capacity of the single Neuron Capacity of Associative Memories (Willshaw Net, Extended Hopfield Network) LECTURE III: Continuous Time-Delay NN Limit-Cycles, Stability and Convergence

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Cell Body Dendrites Synapse E Electrical Potential time Integration Decay-depolarization Neural Processing - The Brain A neuron receives and integrate input from other neurons. Once the input exceeds a critical level, the neuron discharges a spike. This spiking event is also called depolarization, and is followed by a refractory period, during which the neuron is unable to fire. Refractory

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard You can view the neuron as a memory. What can you store in this memory? What is the maximal capacity? How can you find a learning rule that maximizes the capacity? W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Information Theory and The Neuron A fundamental principle of learning systems is their robustness to noise. One way to measure the system’s robustness to noise is to determine the joint information between its inputs and output. Output: y

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Consider the neuron as a sender-receiver system, with X being the message sent and y the received message. Information theory can give you a measure of the information conveyed by y about X. If the transmission system is imperfect (noisy), you must find a way to ensure minimal disturbance in the transmission. W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard W1W1 W2W2 W3W3 W4W4 Output: y The mutual information between the neuron output y and its Inputs x is given by: where is the signal-to-noise ratio. In order to maximize the ratio, one can increase the magnitude of the weights. Information Theory and The Neuron

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard The mutual information between the neuron output y and its Inputs X is given by: This time, one cannot simply increase the magnitude of the weights, as this affects the value of as well. W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Information Theory and The Neuron

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard How to define a learning rule to optimize the mutual information?

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Hebbian Learning Output Input If x I and y I fire simultaneously, the weight of the connection between them will be strengthened in proportion to their strength of firing.

Hebbian Learning – Limit Cycle Stability? This is true for all i, thus, w_j is an eigenvector of C, with associated Eigenvalue 0 Under a small disturbance  The weights tend to grow in the direction of the largest eigenvalue of C. C is a positive, symmetric and semi-definite matrix  all eigenvalues are >=0.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Hebbian Learning – Weight Decay The only advantage of substractive rules over simply clipping the weights lies in that it allows to eliminates weights that have little importance. The simple weight decay rule belong to a class of decay rule called Substractive Rule Another important type of decay rules is the Multiplicative Rule The advantage of multiplicative rules is that, in addition to giving small weights, they also give useful weights.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron Oja’s one neuron model The weights converge toward the first eigenvector of the input covariance matrix and are normalized.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Hebbian Learning – Weight Decay Oja’s subspace algorithm Equivalent to minimizing the generalized form of J:

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Hebbian Learning – Weight Decay Why PCA, LDA, ICA with ANN? Explain the way the brain could derive important properties of the sensory and motor space. Allows to discover new mode of computation with simple iterative and local learning rules.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Recurrence in Neural Networks Sofar, we have considered only feed-forward neural networks Most biological network have recurrent connections. This change of direction in the flow of information is interesting, as it can allow: To keep a memory of the activation of the neuron To propagate the information across output neurons

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning How to maximize information transmission in a network, I.e. maximize: I(x;y) Anti-Hebbian Learning

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning Anti-Hebbian learning is also known as lateral inhibition Average of values taken over all training patterns

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning If the two outputs are highly correlated, then, the weights between them will grow to a large negative value and each will tend to turn the other off. No need for weight decay or renormalizing on anti-Hebbian weights, as they are automatically self-limiting!

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning Foldiak’s first Model In Matrix Terms

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning Foldiak’s first Model One can further show that there is a stable point in the weight space.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning Foldiak’s 2 ND Model Allows all neurons to receive their own outputs with weight 1 This network will converge when: 1)the outputs are decorrelated 2)the expected variance of the outputs is equal to 1.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard PCA versus ICA PCA looks at the covariance matrix only. What if the data is not well described by the covariance matrix? The only distribution which is uniquely specified by its covariance (with the subtracted mean) is the Gaussian distribution. Distributions which deviate from the Gaussian are poorly described by their covariances.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard PCA versus ICA Even with non-Gaussian data, variance maximization leads to the most faithful representation in a reconstruction error sense. The mean-square error measure implicitly assumes Gaussianity, since it penalizes datapoints close to the mean less that those that are far away. But it does not in general lead to the most meaningful representation.  We need to perform gradient descent in some function other than the reconstruction error.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Uncorrelated and Statistical Independent IndependentUncorrelated True for any non-linear transformation f Statistical Independence is a stronger constraint than decorrelation.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Objective Function of ICA We want to ensure that the outputs yi are maximally independent. This is identical to requiring that the mutual information be small. Or alternately that the joint entropy be large. H(x,y) H(x)H(y) H(x|y)I(x,y)H(y|x)

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA Anti-Hebbian Learning can also lead to a decomposition in Statistically Independent Component, and, as such allow to do a decomposition of the type of ICA. To ensure independence, the network must converge to a solution that satisfies the condition: For any given function f.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard ICA for TIME-DEPENDENT SIGNALS Original Signal Adapted from 2000 Mixed Signal

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Adapted from 2000 Mixed Signal ICA for TIME-DEPENDENT SIGNALS

Anti-Hebbian Learning and ICA Jutten and Herault Model Non-linear Learning Rule If f and g are the identity, we find again the Hebbian Rule, which ensures convergence to uncorrelated outputs: To ensure independence, the network must converge to a solution that satisfies the condition: For any given function f.

Anti-Hebbian Learning and ICA HINT: Use two odd functions for f and g (f(-x)=-f(x)), then their taylor series expansion consists solely of the odd terms Since most (audio) signals have an even distribution, at convergence, one has:

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation MIXED SIGNALS Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation UNMIXED SIGNALS THROUGH GENERALIZED ANTI-HEBBIAN LEARNING Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation MIXED SIGNALS Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation UNMIXED SIGNALS THROUGH GENERALIZED ANTI HEBBIAN LEARNING Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Information Maximization Bell & Sejnowsky proposed a network to maximize the mutual information between the output and the input when those are not subjected to noise (or rather when the input and the noise can no longer be distinguished, then H(Y|X) tend to negative infinity). Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, W1W1 W2W2 W3W3 W4W4 Output: y W0W0

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Information Maximization Bell & Sejnowsky proposed a network to maximize the mutual information between the output and the input when those are not subjected to noise (or rather when the input and the noise can no longer be distinguished, then H(Y|X) tend to negative infinity). Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, H(Y|X) is independent of the weights W and so

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Information Maximization Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, The entropy of a distribution is maximized when all outcomes are equally likely.  We must choose an activation function at the output neurons which equalizes each neuron’s chances of firing and so maximizes their collective entropy.

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA The sigmoid is the optimal solution to even out a gaussian distribution so that all outputs are equally probable Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6,

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA The sigmoid is the optimal solution to even out a gaussian distribution so that all outputs are equally probable Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6,

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA The sigmoid is the optimal solution to even out a gaussian distribution so that all outputs are equally probable Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, W1W1 W2W2 W3W3 W4W4 Output: y W0W0

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA The pdf of the output can be written as: Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, The entropy of the output is then given by: The learning rules that optimize this entropy are given by:

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA Bell A.J. and Sejnowski T.J An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 6, Anti-weight decay (moves away from simple solution w=0) Anti-Hebbian (avoids solution y=1)

MACHINE LEARNING - Doctoral Class - EDIC EPFL A.. Billard Anti-Hebbian Learning and ICA This can be generalized to a many inputs - many outputs network with sigmoid function for the output. The learning rules that optimizes the mutual information between input and output are then given by: Bell A.J. and Sejnowski T.J An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, Such a network can linearly decompose up to 10 sources.