Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

Similar presentations


Presentation on theme: "1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew."— Presentation transcript:

1 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew Tong Tim Marks

2 2 Collaborators Honghao Shan Chris Kanan

3 3 Collaborators Tim Marks Matt Tong Lingyun Zhang

4 4 Efficient Encoding of the world Sparse Principal Components Analysis: A model of unsupervised learning for early perceptual processing (Honghao Shan) The model embodies three constraints 1. Keep as much information as possible 2. While trying to equalize the neural responses 3. And minimizing the connections.

5 5 Trained on grayscale images Trained on color images Spatial extent Temporal extent Efficient Encoding of the world leads to magno- and parvo-cellular response properties… Trained on video cubes because This suggests that these cell types exist because they are useful for efficiently encoding the temporal dynamics of the world. Midget? Parasol? Persistent, small Transient, large

6 6 Efficient Encoding of the world leads to gammatone filters as in auditory nerves: Using exactly the same algorithm, applied to speech, environmental sounds, etc.: Using exactly the same algorithm, applied to speech, environmental sounds, etc.:

7 7 Efficient Encoding of the world A single unsupervised learning algorithm leads to A single unsupervised learning algorithm leads to Model cells with properties similar to those found in the retina when applied to natural videos Model cells with properties similar to those found in the retina when applied to natural videos Models cells with properties similar to those found in auditory nerve when applied to natural sounds Models cells with properties similar to those found in auditory nerve when applied to natural sounds One small step towards a unified theory of temporal processing. One small step towards a unified theory of temporal processing.

8 8 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) Recursive ICA (RICA 1.0 (Shan et al., 2008) ): Recursive ICA (RICA 1.0 (Shan et al., 2008) ): Alternately compress and expand representation using PCA and ICA; Alternately compress and expand representation using PCA and ICA; ICA was modified by a component-wise nonlinearity ICA was modified by a component-wise nonlinearity Receptive fields expanded at each ICA layer Receptive fields expanded at each ICA layer

9 9 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) ICA was modified by a component-wise nonlinearity: ICA was modified by a component-wise nonlinearity: Think of ICA as a generative model: The pixels are the sum of many independent random variables: Gaussian. Think of ICA as a generative model: The pixels are the sum of many independent random variables: Gaussian. Hence ICA prefers its inputs to be Gaussian- distributed. Hence ICA prefers its inputs to be Gaussian- distributed. We apply an inverse cumulative Gaussian to the absolute value of the ICA components to gaussianize them. We apply an inverse cumulative Gaussian to the absolute value of the ICA components to gaussianize them.

10 10 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) Strong responses, either positive or negative, are mapped to the positive tail of the Gaussian; weak ones, to the negative tail; ambiguous ones to the center. Strong responses, either positive or negative, are mapped to the positive tail of the Gaussian; weak ones, to the negative tail; ambiguous ones to the center.

11 11 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) RICA 2.0: RICA 2.0: Replace PCA by SPCA Replace PCA by SPCA SPCA

12 12 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) RICA 2.0 Results: Multiple layer system with RICA 2.0 Results: Multiple layer system with Center-surround receptive fields at the first layer Center-surround receptive fields at the first layer Simple edge filters at the second (ICA) layer Simple edge filters at the second (ICA) layer Spatial pooling of orientations at the third (SPCA) layer: Spatial pooling of orientations at the third (SPCA) layer: V2-like response properties at the fourth (ICA) layer V2-like response properties at the fourth (ICA) layer

13 13 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) V2-like response properties at the fourth (ICA) layer V2-like response properties at the fourth (ICA) layer These maps show strengths of connections to layer 1 ICA filters. Warm and cold colors are strong +/- connections, gray is weak connections, orientation corresponds to layer 1 orientation. These maps show strengths of connections to layer 1 ICA filters. Warm and cold colors are strong +/- connections, gray is weak connections, orientation corresponds to layer 1 orientation. The left-most column displays two model neurons that show uniform orientation preference to layer-1 ICA features. The left-most column displays two model neurons that show uniform orientation preference to layer-1 ICA features. The middle column displays model neurons that have non- uniform/varying orientation preference to layer-1 ICA features. The middle column displays model neurons that have non- uniform/varying orientation preference to layer-1 ICA features. The right column displays two model neurons that have location preference, but no orientation preference, to layer-1 ICA features. The right column displays two model neurons that have location preference, but no orientation preference, to layer-1 ICA features. The left two columns are consistent with Anzen, Peng, & Van Essen The right hand column is a prediction

14 14 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) Dimensionality Reduction & Expansion might be a general strategy of information processing in the brain. Dimensionality Reduction & Expansion might be a general strategy of information processing in the brain. The first step removes noise and reduces complexity, the second step captures the statistical structure. The first step removes noise and reduces complexity, the second step captures the statistical structure. We showed that retinal ganglion cells and V1 complex cells may be derived from the same learning algorithm, applied to pixels in one case, and V1 simple cell outputs in the second. We showed that retinal ganglion cells and V1 complex cells may be derived from the same learning algorithm, applied to pixels in one case, and V1 simple cell outputs in the second. This highly simplified model of early vision is the first one that learns the RFs of all early visual layers, using a consistent theory - the efficient coding theory. This highly simplified model of early vision is the first one that learns the RFs of all early visual layers, using a consistent theory - the efficient coding theory. We believe it could serve as a basis for more sophisticated models of early vision. We believe it could serve as a basis for more sophisticated models of early vision. An obvious next step is to train and thus make predictions about higher layers. An obvious next step is to train and thus make predictions about higher layers.

15 15 We showed in Shan & Cottrell (CVPR 2008) that we could achieve state-of-the-art face recognition with the non-linear ICA features and a simple softmax output. We showed in Shan & Cottrell (CVPR 2008) that we could achieve state-of-the-art face recognition with the non-linear ICA features and a simple softmax output. We showed in Kanan & Cottrell (CVPR 2010) that we could achieve state-of-the-art face and object recognition with a system that used an ICA-based salience map, simulated fixations, non-linear ICA features, and a kernel-density memory. We showed in Kanan & Cottrell (CVPR 2010) that we could achieve state-of-the-art face and object recognition with a system that used an ICA-based salience map, simulated fixations, non-linear ICA features, and a kernel-density memory. Here I briefly describe the latter. Here I briefly describe the latter. Nice, but is it useful?

16 16 Our attention is automatically drawn to interesting regions in images. Our attention is automatically drawn to interesting regions in images. Our salience algorithm is automatically drawn to interesting regions in images. Our salience algorithm is automatically drawn to interesting regions in images. These are useful locations for discriminating one object (face, butterfly) from another. These are useful locations for discriminating one object (face, butterfly) from another. One reason why this might be a good idea…

17 17 Training Phase (learning object appearances): Training Phase (learning object appearances): Use the salience map to decide where to look. (We use the ICA salience map) Use the salience map to decide where to look. (We use the ICA salience map) Memorize these samples of the image, with labels (Bob, Carol, Ted, or Alice) (We store the (compressed) ICA feature values) Memorize these samples of the image, with labels (Bob, Carol, Ted, or Alice) (We store the (compressed) ICA feature values) Main Idea

18 18 Testing Phase (recognizing objects we have learned): Testing Phase (recognizing objects we have learned): Now, given a new face, use the salience map to decide where to look. Now, given a new face, use the salience map to decide where to look. Compare new image samples to stored ones - the closest ones in memory get to vote for their label. Compare new image samples to stored ones - the closest ones in memory get to vote for their label. Main Idea

19 19 Stored memories of Bob Stored memories of Alice New fragments 19 Result: 7 votes for Alice, only 3 for Bob. Its Alice!

20 20 Voting The voting process is based on Bayesian updating (with Naïve Bayes). The voting process is based on Bayesian updating (with Naïve Bayes). The size of the vote depends on the distance from the stored sample, using kernel density estimation. The size of the vote depends on the distance from the stored sample, using kernel density estimation. Hence NIMBLE: NIM with Bayesian Likelihood Estimation. Hence NIMBLE: NIM with Bayesian Likelihood Estimation.

21 21 The ICA features do double-duty: The ICA features do double-duty: They are combined to make the salience map - which is used to decide where to look They are combined to make the salience map - which is used to decide where to look They are stored to represent the object at that location They are stored to represent the object at that location Overview of the system

22 22 Compare this to (most, not all!) computer vision systems: Compare this to (most, not all!) computer vision systems: One pass over the image, and global features. One pass over the image, and global features. Image Global Features Global Classifier Decision NIMBLE vs. Computer Vision

23 23

24 24 Belief After 1 FixationBelief After 10 Fixations

25 25 Human vision works in multiple environments - our basic features (neurons!) dont change from one problem to the next. Human vision works in multiple environments - our basic features (neurons!) dont change from one problem to the next. We tune our parameters so that the system works well on Bird and Butterfly datasets - and then apply the system unchanged to faces, flowers, and objects We tune our parameters so that the system works well on Bird and Butterfly datasets - and then apply the system unchanged to faces, flowers, and objects This is very different from standard computer vision systems, that are (usually) tuned to a particular domain This is very different from standard computer vision systems, that are (usually) tuned to a particular domain Robust Vision

26 26 Cal Tech 101: 101 Different Categories AR dataset: 120 Different People with different lighting, expression, and accessories

27 27 Flowers: 102 Different Flower Species

28 28 ~7 fixations required to achieve at least 90% of maximum performance ~7 fixations required to achieve at least 90% of maximum performance

29 29 So, we created a simple cognitive model that uses simulated fixations to recognize things. So, we created a simple cognitive model that uses simulated fixations to recognize things. But it isnt that complicated. But it isnt that complicated. How does it compare to approaches in computer vision? How does it compare to approaches in computer vision?

30 30 Caveats: Caveats: As of mid As of mid Only comparing to single feature type approaches (no Multiple Kernel Learning (MKL) approaches). Only comparing to single feature type approaches (no Multiple Kernel Learning (MKL) approaches). Still superior to MKL with very few training examples per category. Still superior to MKL with very few training examples per category.

31 NUMBER OF TRAINING EXAMPLES

32 NUMBER OF TRAINING EXAMPLES

33 33

34 34 Again, best for single feature-type systems and for 1 training instance better than all systems

35 35 More neurally and behaviorally relevant gaze control and fixation integration. More neurally and behaviorally relevant gaze control and fixation integration. People dont randomly sample images. People dont randomly sample images. A foveated retina A foveated retina Comparison with human eye movement data during recognition/classification of faces, objects, etc. Comparison with human eye movement data during recognition/classification of faces, objects, etc.

36 36 A biologically-inspired, fixation- based approach can work well for image classification. A biologically-inspired, fixation- based approach can work well for image classification. Fixation-based models can achieve, and even exceed, some of the best models in computer vision. Fixation-based models can achieve, and even exceed, some of the best models in computer vision. …Especially when you dont have a lot of training images.

37 37 Software and Paper Available at Software and Paper Available at For more details For more details This work was supported by the NSF (grant #SBE ) to the Temporal Dynamics of Learning Center.

38 38 Thanks!

39 39 Sparse Principal Components Analysis We minimize: We minimize: Subject to the following constraint: Subject to the following constraint:

40 40 The SPCA model as a neural net … It is A T that is mostly 0 …

41 41 Results suggesting the 1/f power spectrum of images is where this is coming from… suggesting the 1/f power spectrum of images is where this is coming from…

42 42 Results The role of : The role of : Recall this reduces the number of connections… Recall this reduces the number of connections…

43 43 Results The role of : higher means fewer connections, which alters the contrast sensitivity function (CSF). The role of : higher means fewer connections, which alters the contrast sensitivity function (CSF). Matches recent data on malnourished kids and their CSFs: lower sensitivity at low spatial frequencies, but slightly better at high than normal controls… Matches recent data on malnourished kids and their CSFs: lower sensitivity at low spatial frequencies, but slightly better at high than normal controls…

44 44 NIMBLE represents its beliefs using probability distributions NIMBLE represents its beliefs using probability distributions Simple nearest neighbor density estimation to estimate: Simple nearest neighbor density estimation to estimate: P(fixation t | Category = k) Fixations are combined over fixations/time using Bayesian updating Fixations are combined over fixations/time using Bayesian updating

45 45


Download ppt "1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew."

Similar presentations


Ads by Google