Presentation is loading. Please wait.

Presentation is loading. Please wait.

3/22/2017 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun.

Similar presentations


Presentation on theme: "3/22/2017 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun."— Presentation transcript:

1 3/22/2017 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew Tong Tim Marks OSHER

2 3/22/2017 Collaborators Chris Kanan Honghao Shan OSHER

3 3/22/2017 Collaborators Lingyun Zhang Matt Tong Tim Marks OSHER

4 Efficient Encoding of the world
3/22/2017 Efficient Encoding of the world Sparse Principal Components Analysis: A model of unsupervised learning for early perceptual processing (Honghao Shan) The model embodies three constraints Keep as much information as possible While trying to equalize the neural responses And minimizing the connections. OSHER

5 3/22/2017 Efficient Encoding of the world leads to magno- and parvo-cellular response properties… Trained on video cubes Spatial extent Temporal extent Trained on color images Persistent, small Transient, large Midget? Parasol? Trained on grayscale images This suggests that these cell types exist because they are useful for efficiently encoding the temporal dynamics of the world. OSHER

6 3/22/2017 Efficient Encoding of the world leads to gammatone filters as in auditory nerves: Using exactly the same algorithm, applied to speech, environmental sounds, etc.: OSHER

7 Efficient Encoding of the world
3/22/2017 Efficient Encoding of the world A single unsupervised learning algorithm leads to Model cells with properties similar to those found in the retina when applied to natural videos Models cells with properties similar to those found in auditory nerve when applied to natural sounds One small step towards a unified theory of temporal processing. OSHER

8 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) Recursive ICA (RICA 1.0 (Shan et al., 2008)): Alternately compress and expand representation using PCA and ICA; ICA was modified by a component-wise nonlinearity Receptive fields expanded at each ICA layer OSHER

9 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) ICA was modified by a component-wise nonlinearity: Think of ICA as a generative model: The pixels are the sum of many independent random variables: Gaussian. Hence ICA prefers its inputs to be Gaussian-distributed. We apply an inverse cumulative Gaussian to the absolute value of the ICA components to “gaussianize” them. OSHER

10 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) Strong responses, either positive or negative, are mapped to the positive tail of the Gaussian; weak ones, to the negative tail; ambiguous ones to the center. OSHER

11 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) RICA 2.0: Replace PCA by SPCA SPCA OSHER

12 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) RICA 2.0 Results: Multiple layer system with Center-surround receptive fields at the first layer Simple edge filters at the second (ICA) layer Spatial pooling of orientations at the third (SPCA) layer: V2-like response properties at the fourth (ICA) layer OSHER

13 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) V2-like response properties at the fourth (ICA) layer These maps show strengths of connections to layer 1 ICA filters. Warm and cold colors are strong +/- connections, gray is weak connections, orientation corresponds to layer 1 orientation. The left-most column displays two model neurons that show uniform orientation preference to layer-1 ICA features. The middle column displays model neurons that have non-uniform/varying orientation preference to layer-1 ICA features. The right column displays two model neurons that have location preference, but no orientation preference, to layer-1 ICA features. The left two columns are consistent with Anzen, Peng, & Van Essen The right hand column is a prediction OSHER

14 3/22/2017 Unsupervised Learning of Hierarchical Representations (RICA 2.0; cf. Shan et al., NIPS 19) Dimensionality Reduction & Expansion might be a general strategy of information processing in the brain. The first step removes noise and reduces complexity, the second step captures the statistical structure. We showed that retinal ganglion cells and V1 complex cells may be derived from the same learning algorithm, applied to pixels in one case, and V1 simple cell outputs in the second. This highly simplified model of early vision is the first one that learns the RFs of all early visual layers, using a consistent theory - the efficient coding theory. We believe it could serve as a basis for more sophisticated models of early vision. An obvious next step is to train and thus make predictions about higher layers. OSHER

15 3/22/2017 Nice, but is it useful? We showed in Shan & Cottrell (CVPR 2008) that we could achieve state-of-the-art face recognition with the non-linear ICA features and a simple softmax output. We showed in Kanan & Cottrell (CVPR 2010) that we could achieve state-of-the-art face and object recognition with a system that used an ICA-based salience map, simulated fixations, non-linear ICA features, and a kernel-density memory. Here I briefly describe the latter. OSHER

16 One reason why this might be a good idea…
3/22/2017 One reason why this might be a good idea… Our attention is automatically drawn to interesting regions in images. Our salience algorithm is automatically drawn to interesting regions in images. These are useful locations for discriminating one object (face, butterfly) from another. OSHER

17 Main Idea Training Phase (learning object appearances):
3/22/2017 Main Idea Training Phase (learning object appearances): Use the salience map to decide where to look. (We use the ICA salience map) Memorize these samples of the image, with labels (Bob, Carol, Ted, or Alice) (We store the (compressed) ICA feature values) OSHER

18 Main Idea Testing Phase (recognizing objects we have learned):
3/22/2017 Main Idea Testing Phase (recognizing objects we have learned): Now, given a new face, use the salience map to decide where to look. Compare new image samples to stored ones - the closest ones in memory get to vote for their label. OSHER

19 Result: 7 votes for Alice, only 3 for Bob. It’s Alice!
3/22/2017 Stored memories of Bob Stored memories of Alice New fragments Result: 7 votes for Alice, only 3 for Bob. It’s Alice! 19 OSHER 19

20 3/22/2017 Voting The voting process is based on Bayesian updating (with Naïve Bayes). The size of the vote depends on the distance from the stored sample, using kernel density estimation. Hence NIMBLE: NIM with Bayesian Likelihood Estimation. OSHER

21 Overview of the system The ICA features do double-duty:
3/22/2017 Overview of the system The ICA features do double-duty: They are combined to make the salience map - which is used to decide where to look They are stored to represent the object at that location 8:40 - OSHER 21

22 NIMBLE vs. Computer Vision
3/22/2017 NIMBLE vs. Computer Vision Compare this to (most, not all!) computer vision systems: One pass over the image, and global features. Image Global Features Global Classifier Decision This is in stark contrast to the predominant methods used in computer vision, and even many models in computational neurosciece Line 1: one-shot system Line 2: active vision Note that the bottom approach is primate-like (although pretty dumbed down) Note that I’m leaving out most of the details OSHER 22

23 3/22/2017 Humans make ~170,000 saccades each day OSHER 23

24 Belief After 10 Fixations
3/22/2017 Explain how it uses a saliency map to acquire information and how as it serially acquires more information over time NIMBLE becomes more confident about the correct category. Belief After 1 Fixation Belief After 10 Fixations OSHER 24

25 3/22/2017 Robust Vision Human vision works in multiple environments - our basic features (neurons!) don’t change from one problem to the next. We tune our parameters so that the system works well on Bird and Butterfly datasets - and then apply the system unchanged to faces, flowers, and objects This is very different from standard computer vision systems, that are (usually) tuned to a particular domain OSHER

26 Cal Tech 101: 101 Different Categories
3/22/2017 Cal Tech 101: 101 Different Categories AR dataset: 120 Different People with different lighting, expression, and accessories OSHER

27 Flowers: 102 Different Flower Species
3/22/2017 Flowers: 102 Different Flower Species OSHER

28 ~7 fixations required to achieve at least 90% of maximum performance
3/22/2017 ~64 fixations required to achieve 99% of maximum accuracy Averaged over 10 cross validation runs ~7 fixations required to achieve at least 90% of maximum performance OSHER 28

29 But it isn’t that complicated.
3/22/2017 So, we created a simple cognitive model that uses simulated fixations to recognize things. But it isn’t that complicated. How does it compare to approaches in computer vision? OSHER

30 Still superior to MKL with very few training examples per category.
3/22/2017 Caveats: As of mid-2010. Only comparing to single feature type approaches (no “Multiple Kernel Learning” (MKL) approaches). Still superior to MKL with very few training examples per category. OSHER

31 NUMBER OF TRAINING EXAMPLES
3/22/2017 Note that this is a comparison versus the best results using a single feature type and looks at percent improvement in performance (not absolute improvement, so it is 1 - (Nimble Perf / Best One-Desc Perf) Mention training instances on X-axis NUMBER OF TRAINING EXAMPLES OSHER 31

32 NUMBER OF TRAINING EXAMPLES
3/22/2017 Note again that NIMBLE performs very well using few training images even when dealing with disguises NUMBER OF TRAINING EXAMPLES OSHER 32

33 3/22/2017 OSHER

34 Again, best for single feature-type systems
3/22/2017 Again, best for single feature-type systems and for 1 training instance better than all systems OSHER

35 People don’t randomly sample images. A foveated retina
3/22/2017 More neurally and behaviorally relevant gaze control and fixation integration. People don’t randomly sample images. A foveated retina Comparison with human eye movement data during recognition/classification of faces, objects, etc. OSHER

36 …Especially when you don’t have a lot of training images.
3/22/2017 A biologically-inspired, fixation-based approach can work well for image classification. Fixation-based models can achieve, and even exceed, some of the best models in computer vision. …Especially when you don’t have a lot of training images. OSHER

37 Software and Paper Available at www.chriskanan.com
3/22/2017 Software and Paper Available at For more details We showed that NIMBLE is not a toy cognitive model, but one with real-world applicability This work was supported by the NSF (grant #SBE ) to the Temporal Dynamics of Learning Center., G.W. Cottrell, PI. This work was supported by the NSF (grant #SBE ) to the Temporal Dynamics of Learning Center. OSHER 37

38 3/22/2017 Thanks! OSHER

39 Sparse Principal Components Analysis
3/22/2017 Sparse Principal Components Analysis We minimize: Subject to the following constraint: Include in the overview information on the purpose and mission of the SLC, the strategic concept and milestones, achievements, new directions; the organization of the research thrusts (or equivalent); value of the Center mode; and the integrative nature and relationship of all following presentations (scientific and other) to research and overall vision of Center. The Center’s vision should address each of the SLC program goals: advancing the frontiers of the science of learning through integrated research; connecting this research to specific scientific, technological, educational, and workforce challenges; and enabling research communities that can capitalize on new opportunities and discoveries and respond to new challenges. OSHER

40 The SPCA model as a neural net…
3/22/2017 The SPCA model as a neural net… Include in the overview information on the purpose and mission of the SLC, the strategic concept and milestones, achievements, new directions; the organization of the research thrusts (or equivalent); value of the Center mode; and the integrative nature and relationship of all following presentations (scientific and other) to research and overall vision of Center. The Center’s vision should address each of the SLC program goals: advancing the frontiers of the science of learning through integrated research; connecting this research to specific scientific, technological, educational, and workforce challenges; and enabling research communities that can capitalize on new opportunities and discoveries and respond to new challenges. It is AT that is mostly 0… OSHER

41 3/22/2017 Results suggesting the 1/f power spectrum of images is where this is coming from… OSHER

42 Results The role of : Recall this reduces the number of connections…
3/22/2017 Results The role of : Recall this reduces the number of connections… OSHER

43 3/22/2017 Results The role of : higher  means fewer connections, which alters the contrast sensitivity function (CSF). Matches recent data on malnourished kids and their CSF’s: lower sensitivity at low spatial frequencies, but slightly better at high than normal controls… OSHER

44 NIMBLE represents its beliefs using probability distributions
3/22/2017 NIMBLE represents its beliefs using probability distributions Simple nearest neighbor density estimation to estimate: P(fixationt | Category = k) Fixations are combined over fixations/time using Bayesian updating OSHER

45 3/22/2017 OSHER


Download ppt "3/22/2017 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun."

Similar presentations


Ads by Google