Presentation is loading. Please wait.

Presentation is loading. Please wait.

MACHINE LEARNING - Doctoral Class - EDIC EPFL - 2006 A.. Billard MACHINE LEARNING Information Theory and The Neuron - II Aude.

Similar presentations


Presentation on theme: "MACHINE LEARNING - Doctoral Class - EDIC EPFL - 2006 A.. Billard MACHINE LEARNING Information Theory and The Neuron - II Aude."— Presentation transcript:

1 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard MACHINE LEARNING Information Theory and The Neuron - II Aude Billard

2 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Overview LECTURE I: Neuron – Biological Inspiration Information Theory and the Neuron Weight Decay + Anti-Hebbian Learning  PCA Anti-Hebbian Learning  ICA LECTURE II: Capacity of the single Neuron Capacity of Associative Memories (Willshaw Net, Extended Hopfield Network) LECTURE III: Continuous Time-Delay NN Limit-Cycles, Stability and Convergence

3 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Cell Body Dendrites Synapse E Electrical Potential time Integration Decay-depolarization Neural Processing - The Brain A neuron receives and integrate input from other neurons. Once the input exceeds a critical level, the neuron discharges a spike. This spiking event is also called depolarization, and is followed by a refractory period, during which the neuron is unable to fire. Refractory

4 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard You can view the neuron as a memory. What can you store in this memory? What is the maximal capacity? How can you find a learning rule that maximizes the capacity? W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron

5 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Information Theory and The Neuron A fundamental principle of learning systems is their robustness to noise. One way to measure the system’s robustness to noise is to determine the joint information between its inputs and output. Output: y

6 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Consider the neuron as a sender-receiver system, with X being the message sent and y the received message. Information theory can give you a measure of the information conveyed by y about X. If the transmission system is imperfect (noisy), you must find a way to ensure minimal disturbance in the transmission. W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron

7 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard W1W1 W2W2 W3W3 W4W4 Output: y The mutual information between the neuron output y and its Inputs x is given by: where is the signal-to-noise ratio. In order to maximize the ratio, one can increase the magnitude of the weights. Information Theory and The Neuron

8 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard The mutual information between the neuron output y and its Inputs X is given by: This time, one cannot simply increase the magnitude of the weights, as this affects the value of as well. W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron

9 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Information Theory and The Neuron

10 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard How to define a learning rule to optimize the mutual information?

11 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Hebbian Learning Output Input If x I and y I fire simultaneously, the weight of the connection between them will be strengthened in proportion to their strength of firing.

12 Hebbian Learning – Limit Cycle Stability? This is true for all i, thus, w_j is an eigenvector of C, with associated Eigenvalue 0 Under a small disturbance  The weights tend to grow in the direction of the largest eigenvalue of C. C is a positive, symmetric and semi-definite matrix  all eigenvalues are >=0.

13 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Hebbian Learning – Weight Decay The only advantage of substractive rules over simply clipping the weights lies in that it allows to eliminates weights that have little importance. The simple weight decay rule belong to a class of decay rule called Substractive Rule Another important type of decay rules is the Multiplicative Rule The advantage of multiplicative rules is that, in addition to giving small weights, they also give useful weights.

14 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard W1W1 W2W2 W3W3 W4W4 Output: y Information Theory and The Neuron Oja’s one neuron model The weights converge toward the first eigenvector of the input covariance matrix and are normalized.

15 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Hebbian Learning – Weight Decay Oja’s subspace algorithm Equivalent to minimizing the generalized form of J:

16 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Hebbian Learning – Weight Decay Why PCA, LDA, ICA with ANN? Explain the way the brain could derive important properties of the sensory and motor space. Allows to discover new mode of computation with simple iterative and local learning rules.

17 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Recurrence in Neural Networks Sofar, we have considered only feed-forward neural networks Most biological network have recurrent connections. This change of direction in the flow of information is interesting, as it can allow: To keep a memory of the activation of the neuron To propagate the information across output neurons

18 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning How to maximize information transmission in a network, I.e. maximize: I(x;y) Anti-Hebbian Learning

19 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning Anti-Hebbian learning is also known as lateral inhibition Average of values taken over all training patterns

20 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning If the two outputs are highly correlated, then, the weights between them will grow to a large negative value and each will tend to turn the other off. No need for weight decay or renormalizing on anti-Hebbian weights, as they are automatically self-limiting!

21 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning Foldiak’s first Model In Matrix Terms

22 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning Foldiak’s first Model One can further show that there is a stable point in the weight space.

23 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning Foldiak’s 2 ND Model Allows all neurons to receive their own outputs with weight 1 This network will converge when: 1)the outputs are decorrelated 2)the expected variance of the outputs is equal to 1.

24 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard PCA versus ICA PCA looks at the covariance matrix only. What if the data is not well described by the covariance matrix? The only distribution which is uniquely specified by its covariance (with the subtracted mean) is the Gaussian distribution. Distributions which deviate from the Gaussian are poorly described by their covariances.

25 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard PCA versus ICA Even with non-Gaussian data, variance maximization leads to the most faithful representation in a reconstruction error sense. The mean-square error measure implicitly assumes Gaussianity, since it penalizes datapoints close to the mean less that those that are far away. But it does not in general lead to the most meaningful representation.  We need to perform gradient descent in some function other than the reconstruction error.

26 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Uncorrelated and Statistical Independent IndependentUncorrelated True for any non-linear transformation f Statistical Independence is a stronger constraint than decorrelation.

27 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Objective Function of ICA We want to ensure that the outputs yi are maximally independent. This is identical to requiring that the mutual information be small. Or alternately that the joint entropy be large. H(x,y) H(x)H(y) H(x|y)I(x,y)H(y|x)

28 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA Anti-Hebbian Learning can also lead to a decomposition in Statistically Independent Component, and, as such allow to do a decomposition of the type of ICA. To ensure independence, the network must converge to a solution that satisfies the condition: For any given function f.

29 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard ICA for TIME-DEPENDENT SIGNALS Original Signal Adapted from Hyvarinen @ 2000 Mixed Signal

30 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Adapted from Hyvarinen @ 2000 Mixed Signal ICA for TIME-DEPENDENT SIGNALS

31 Anti-Hebbian Learning and ICA Jutten and Herault Model Non-linear Learning Rule If f and g are the identity, we find again the Hebbian Rule, which ensures convergence to uncorrelated outputs: To ensure independence, the network must converge to a solution that satisfies the condition: For any given function f.

32 Anti-Hebbian Learning and ICA HINT: Use two odd functions for f and g (f(-x)=-f(x)), then their taylor series expansion consists solely of the odd terms Since most (audio) signals have an even distribution, at convergence, one has:

33 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation MIXED SIGNALS Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

34 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation UNMIXED SIGNALS THROUGH GENERALIZED ANTI-HEBBIAN LEARNING Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

35 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation MIXED SIGNALS Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

36 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA Application for Blind Source Separation UNMIXED SIGNALS THROUGH GENERALIZED ANTI HEBBIAN LEARNING Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999

37 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Information Maximization Bell & Sejnowsky proposed a network to maximize the mutual information between the output and the input when those are not subjected to noise (or rather when the input and the noise can no longer be distinguished, then H(Y|X) tend to negative infinity). Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 W1W1 W2W2 W3W3 W4W4 Output: y W0W0

38 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Information Maximization Bell & Sejnowsky proposed a network to maximize the mutual information between the output and the input when those are not subjected to noise (or rather when the input and the noise can no longer be distinguished, then H(Y|X) tend to negative infinity). Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 H(Y|X) is independent of the weights W and so

39 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Information Maximization Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 The entropy of a distribution is maximized when all outcomes are equally likely.  We must choose an activation function at the output neurons which equalizes each neuron’s chances of firing and so maximizes their collective entropy.

40 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA The sigmoid is the optimal solution to even out a gaussian distribution so that all outputs are equally probable Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159

41 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA The sigmoid is the optimal solution to even out a gaussian distribution so that all outputs are equally probable Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159

42 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA The sigmoid is the optimal solution to even out a gaussian distribution so that all outputs are equally probable Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 W1W1 W2W2 W3W3 W4W4 Output: y W0W0

43 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA The pdf of the output can be written as: Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 The entropy of the output is then given by: The learning rules that optimize this entropy are given by:

44 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA Bell A.J. and Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 Anti-weight decay (moves away from simple solution w=0) Anti-Hebbian (avoids solution y=1)

45 MACHINE LEARNING - Doctoral Class - EDIC http://lasa.epfl.ch EPFL - LASA @ 2006 A.. Billard Anti-Hebbian Learning and ICA This can be generalized to a many inputs - many outputs network with sigmoid function for the output. The learning rules that optimizes the mutual information between input and output are then given by: Bell A.J. and Sejnowski T.J. 1995. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129-1159 Such a network can linearly decompose up to 10 sources.


Download ppt "MACHINE LEARNING - Doctoral Class - EDIC EPFL - 2006 A.. Billard MACHINE LEARNING Information Theory and The Neuron - II Aude."

Similar presentations


Ads by Google