Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Theory and Learning

Similar presentations


Presentation on theme: "Information Theory and Learning"— Presentation transcript:

1 Information Theory and Learning
Tony Bell Helen Wills Neuroscience Institute University of California at Berkeley

2 One input, one output deterministic Infomax: match the input distribution
to the non-linearity:

3 Gradient descent learning rule to maximise the transferred information
deterministic sensory only

4 Examples of score functions
LOGISTIC LAPLACIAN In stochastic gradient algorithms (online training), we dispense with the ensemble averages giving: for a single training example and a laplacian ‘prior’.

5 Same theory for multiple dimensions: fire vectors into the the unit hypercube uniformly:
( ) where this is the absolute determinant of the Jacobian matrix, measuring how stretchy the mapping is for square or overcomplete transforms Undercomplete transformations are not invertable, and require the more complex formula:

6 Same theory for multiple dimensions: fire vectors into the the unit hypercube uniformly:
( ) Post-multiplying this by a positive definate transform rescales the gradient optimally (called the Natural Gradient - Amari) giving the pleasantly simple form:

7 Decorrelation is not enough:
diagonal matrix f gives higher order statistics, through its Taylor expansion

8 Infomax/ICA on image patches: learn co-ordinates for natural scenes.
In this linear generative model, we want u = s: recover independent sources. After training, we calculate A = W , and plot the columns. For 16x16 images, we get 256 bases -1

9 f from logistic density

10 f from laplacian density

11 f from Gaussian density

12 But this does not actually make the neurons independent.
Many joint densities p(u1,u2) are decorrelated but still radially symmetric: they factorise in polar co-ordinates, but not in cartesian, unless they’re Gaussian.. instead of This happens when cells have similar position, spatial frequency, and orientation selectivity, but different phase. Dependent filters can combine to make non-linear complex cells (oriented but phase insensitive).

13 ‘Dependent’ Component Analysis.
First, the maximum likelihood framework. What we have been doing is: Infomax Maximum Likelihood Minimum KL Divergence We are fitting a model to the data: or equivalently: But a much more general model is the ‘energy-based’ model (Hinton): sum of functions on subsets of with

14 ‘Dependent’ Component Analysis.
For the completely general model: the learning rule is: with the 2nd term reducing to -I (identity) in the case of ICA. Unfortunately this involves an intractable integral over the model q. Nonetheless, we can still work with all dependency models which are non-loopy hypergraphs. Learn as before, but with a modified score function: : a loopy hypergraph: instead of

15 For example, we can split the space into subspaces such
that the cells are independent between subspaces and dependent within the subspaces. Eg: for 4 cells: 1 3 2 4 We now show a sequence of symmetry-breaking occuring as we move from training, on images, a model which is one big 256-dimensional hyperball, down to a model which is 64 four-dimensional hyperballs:

16 Logistic Density 1 subspace

17 Logistic density 2 subspaces

18 Logistic density 4 subspaces

19 Logistic density 8 subspaces

20 Logistic density 16 subspaces

21 Logistic density 32 subspaces

22 Logistic density 64 subspaces

23 Topographic ICA Arrange the cells in a 2D map with a statistical model q constructed from overlapping subsets. This is a loopy hypergraph, an un-normalised model, but it still gives a nice result…. The hyperedges of our hypergraph are overlapping 4x4 neighbourhoods etc.

24

25 That was from Hyvarinen & Hoyer.
Here’s one from Osindero & Hinton.

26 Conclusion. Well, we did get somewhere:
We seem to have an information-theoretic explanation of some properties of area V1 of visual cortex: -simple cells (Olshausen &Field, Bell & Sejnowski) -complex cells (Hyvarinen & Hoyer) -topographic maps with singularities (Hyvarinen & Hoyer) -colour receptive fields (Doi & Lewicki) -direction sensitivity (van Hateren & Ruderman) But we are stuck on: -the gradient of the partition function -still working with rate models, not spiking neurons -no top-down feedback -no sensory-motor (all passive world modeling)

27 References. The references for all the work in these 3 talks will be
forwarded separately. If you don’t have access to them me at and I’ll send them to you.


Download ppt "Information Theory and Learning"

Similar presentations


Ads by Google