What can computational models tell us about face processing?

What can computational models tell us about face processing?
Gary Cottrell & * 07/16/96 What can computational models tell us about face processing? Garrison W. Cottrell Gary's Unbelievable Research Unit (GURU) Computer Science and Engineering Department Institute for Neural Computation UCSD Collaborators, Past, Present and Future: Ralph Adolphs, Luke Barrington, Serge Belongie, Kristin Branson, Tom Busey, Andy Calder, Eric Christiansen, Matthew Dailey, Piotr Dollar, Michael Fleming, AfmZakaria Haque, Janet Hsiao, Carrie Joyce, Brenden Lake, Kang Lee, Joe McCleery, Janet Metcalfe, Jonathan Nelson, Nam Nguyen, Curt Padgett, Angelina Saldivar, Honghao Shan, Maki Sugimoto, Matt Tong, Brian Tran, Keiji Yamada, Lingyun Zhang APA Talk, August 2000: The face of fear: *

And now for something completely different…
The CIS goal is to “mimic nature for problem solving” My goal is to mimic nature in order to understand nature In fact, as a cognitive scientist, I am glad when my models make the same mistakes people do… Because that means the model is fitting the data better -- so maybe I have a better model! So - don’t look for a better problem solver here…hopefully, look for some insights into how people process faces. IEEE Computational Intelligence Society 4/12/2006

Why use models to understand thought?
Models rush in where theories fear to tread. Models can be manipulated in ways people cannot Models can be analyzed in ways people cannot. IEEE Computational Intelligence Society 4/12/2006

Models rush in where theories fear to tread
Theories are high level descriptions of the processes underlying behavior. They are often not explicit about the processes involved. They are difficult to reason about if no mechanisms are explicit -- they may be too high level to make explicit predictions. Theory formation itself is difficult. Using machine learning techniques, one can often build a working model of a task for which we have no theories or algorithms (e.g., expression recognition). A working model provides an “intuition pump” for how things might work, especially if they are “neurally plausible” (e.g., development of face processing - Dailey and Cottrell). A working model may make unexpected predictions (e.g., the Interactive Activation Model and SLNT). IEEE Computational Intelligence Society 4/12/2006

Models can be manipulated in ways people cannot
We can see the effects of variations in cortical architecture (e.g., split (hemispheric) vs. non-split models (Shillcock and Monaghan word perception model)). We can see the effects of variations in processing resources (e.g., variations in number of hidden units in Plaut et al. models). We can see the effects of variations in environment (e.g., what if our parents were cans, cups or books instead of humans? I.e., is there something special about face expertise versus visual expertise in general? (Sugimoto and Cottrell, Joyce and Cottrell)). We can see variations in behavior due to different kinds of brain damage within a single “brain” (e.g. Juola and Plunkett, Hinton and Shallice). IEEE Computational Intelligence Society 4/12/2006

Models can be analyzed in ways people cannot
In the following, I specifically refer to neural network models. We can do single unit recordings. We can selectively ablate and restore parts of the network, even down to the single unit level, to assess the contribution to processing. We can measure the individual connections -- e.g., the receptive and projective fields of a unit. We can measure responses at different layers of processing (e.g., which level accounts for a particular judgment: perceptual, object, or categorization? (Dailey et al. J Cog Neuro 2002). IEEE Computational Intelligence Society 4/12/2006

How (I like) to build Cognitive Models
I like to be able to relate them to the brain, so “neurally plausible” models are preferred -- neural nets. The model should be a working model of the actual task, rather than a cartoon version of it. Of course, the model should nevertheless be simplifying (i.e. it should be constrained to the essential features of the problem at hand): Do we really need to model the (supposed) translation invariance and size invariance of biological perception? As far as I can tell, NO! Then, take the model “as is” and fit the experimental data: 0 fitting parameters is preferred over 1, 2 , or 3. IEEE Computational Intelligence Society 4/12/2006

The other way (I like) to build Cognitive Models
Same as above, except: Use them as exploratory models -- in domains where there is little direct data (e.g. no single cell recordings in infants or undergraduates) to suggest what we might find if we could get the data. These can then serve as “intuition pumps.” Examples: Why we might get specialized face processors Why those face processors get recruited for other tasks IEEE Computational Intelligence Society 4/12/2006

Outline Review of our model of face and object processing
Some insights from modeling: What could “holistic processing” mean? Does a specialized processor for faces need to be innately specified? Why would a face area process BMW’s? Some new directions: How do we select where to look next? How is information integrated across saccades? IEEE Computational Intelligence Society 4/12/2006

The Face Processing System
PCA . Gabor Filtering Happy Sad Afraid Angry Surprised Disgusted Neural Net Pixel (Retina) Level Object (IT) Perceptual (V1) Category IEEE Computational Intelligence Society 4/12/2006

PCA . Gabor Filtering Bob Carol Ted Alice Neural Net Pixel (Retina) Level Object (IT) Perceptual (V1) Category IEEE Computational Intelligence Society 4/12/2006

. Bob Carol Ted Cup Can Book PCA Gabor Filtering Neural Net Pixel (Retina) Level Perceptual (V1) Level Object (IT) Level Category Level Feature level IEEE Computational Intelligence Society 4/12/2006

LSF PCA HSFPCA . Gabor Filtering Bob Carol Ted Cup Can Book Neural Net Pixel (Retina) Level Object (IT) Perceptual (V1) Category IEEE Computational Intelligence Society 4/12/2006

The Gabor Filter Layer Basic feature: the 2-D Gabor wavelet filter (Daugman, 85): These model the processing in early visual areas Convolution * Magnitudes Subsample in a 29x36 grid IEEE Computational Intelligence Society 4/12/2006

Principal Components Analysis
The Gabor filters give us 40,600 numbers We use PCA to reduce this to 50 numbers PCA is like Factor Analysis: It finds the underlying directions of Maximum Variance PCA can be computed in a neural network through a competitive Hebbian learning mechanism Hence this is also a biologically plausible processing step We suggest this leads to representations similar to those in Inferior Temporal cortex IEEE Computational Intelligence Society 4/12/2006

Gary Cottrell & * 07/16/96 How to do PCA with a neural network (Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell & Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations (features, Principal Components, Holons, eigenfaces) Holons (Gestalt layer) Input from Perceptual Layer The next layer up extracts covariances from the previous layer -- essentially doing a principal components analysis (you can think factor analysis if you don’t know what PCA is). That is, the features are sensitive to the directions of maximum variation in the gabor filter responses. When you look at what the Optimal stimulus for these units are, (something it is easy to do in neural nets, relatively difficult in brains), you see these rather ghostly looking faces ... IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Gary Cottrell & * 07/16/96 The “Gestalt” Layer: Holons (Cottrell, Munro & Zipser, 1987; Cottrell & Fleming 1990; Cottrell & Metcalfe 1990; O’Toole et al. 1991) A self-organizing network that learns whole-object representations (features, Principal Components, Holons, eigenfaces) ... Holons (Gestalt layer) Input from Perceptual Layer The next layer up extracts covariances from the previous layer -- essentially doing a principal components analysis (you can think factor analysis if you don’t know what PCA is). That is, the features are sensitive to the directions of maximum variation in the gabor filter responses. When you look at what the Optimal stimulus for these units are, (something it is easy to do in neural nets, relatively difficult in brains), you see these rather ghostly looking faces IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Holons They act like face cells (Desimone, 1991):
Gary Cottrell & * 07/16/96 Holons They act like face cells (Desimone, 1991): Response of single units is strong despite occluding eyes, e.g. Response drops off with rotation Some fire to my dog’s face A novel representation: Distributed templates -- each unit’s optimal stimulus is a ghostly looking face (template-like), but many units participate in the representation of a single face (distributed). For this audience: Neither exemplars nor prototypes! Explain holistic processing: Why? If stimulated with a partial match, the firing represents votes for this template: Units “downstream” don’t know what caused this unit to fire. (more on this later…) These cells, or holons, as janet metcalfe and I called them, act like face cells. The respond to faces even when the eyes, for example Are occluded, the response drops off with rotation of the stimulus, and they generalize to other species. The representation is like distributed grandmother cells - each receptive field looks like a face, but each unit participates in some, but not all of the faces. Finally, there is a processing advantage to these representations: they do pattern completion. If part of a holon is matched, no one listening to this cell knows which half. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Gary Cottrell & * 07/16/96 The Final Layer: Classification (Cottrell & Fleming 1990; Cottrell & Metcalfe 1990; Padgett & Cottrell 1996; Dailey & Cottrell, 1999; Dailey et al. 2002) The holistic representation is then used as input to a categorization network trained by supervised learning. Output: Cup, Can, Book, Greeble, Face, Bob, Carol, Ted, Happy, Sad, Afraid, etc. Categories ... Holons Input from Perceptual Layer The last, and essential layer, is the classifier. This is a neural network trained to categorize the inputs using error correction learning. My colleagues and I have used these successfully for many years to classify visual inputs by class, identity, and expression. Excellent generalization performance demonstrates the sufficiency of the holistic representation for recognition IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

The Final Layer: Classification
Categories can be at different levels: basic, subordinate. Simple learning rule (~delta rule). It says (mild lie here): add inputs to your weights (synaptic strengths) when you are supposed to be on, subtract them when you are supposed to be off. This makes your weights “look like” your favorite patterns – the ones that turn you on. When no hidden units => No back propagation of error. When hidden units: we get task-specific features (most interesting when we use the basic/subordinate distinction) IEEE Computational Intelligence Society 4/12/2006

Holistic Processing Holistic processing refers to a type of processing where visual stimuli are treated “as a piece” -- in fact, we are unable to ignore other apparent “parts” of an image. Face processing, in particular, is thought to be “holistic” in nature. We are better at recognizing “Bob’s nose” when it is on his face Changing the spacing between the eyes makes the nose look different We are unable to ignore conflicting information from other parts of a face All of these might be summarized as “context influences perception,” but the context is obligatory. IEEE Computational Intelligence Society 4/12/2006

Who do you see? Context influences perception
IEEE Computational Intelligence Society 4/12/2006

Same Different Task IEEE Computational Intelligence Society 4/12/2006

IEEE Computational Intelligence Society 4/12/2006

These look like very different women

But all that has changed is the height of the eyes, right?

Take the configural processing test!
Gary Cottrell & * 07/16/96 Take the configural processing test! What emotion is being shown in the top half of the image below? Happy, Sad, Afraid, Surprised, Disgusted, or Angry? Now, what do you see? To see these effects, let’s take the configural processing test. What expression do you see in the top half of this image, happy sad afraid surprised disgusted or angry? Ok, now what do you see? The answer is “sad.” Notice how much easier it is when the misleading information is misaligned with the top half. Answer: Sad IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Do Holons explain these effects?
Gary Cottrell & * 07/16/96 Do Holons explain these effects? Recall that they are templates -- each unit’s optimal stimulus is a ghostly looking face (template-like) What will happen if there is a partial match? Suppose there is a holon that “likes happy faces”. The mouth will match, causing this unit to fire. Units downstream have learned to associate this firing with a happy face. They will “think” the top of the face is happier than it is… These cells, or holons, as janet metcalfe and I called them, act like face cells. The respond to faces even when the eyes, for example Are occluded, the response drops off with rotation of the stimulus, and they generalize to other species. The representation is like distributed grandmother cells - each receptive field looks like a face, but each unit participates in some, but not all of the faces. Finally, there is a processing advantage to these representations: they do pattern completion. If part of a holon is matched, no one listening to this cell knows which half. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Do Holons explain these effects?
Gary Cottrell & * 07/16/96 Do Holons explain these effects? Clinton/Gore: The outer part of the face votes for Gore. The nose effect: a match at the eyes votes for that template’s nose. Expression/identity configural effects: Split faces: The bottom votes for one person, the top another, but both vote for the WHOLE face… Split expressions: The bottom votes for one expression, the top another… These cells, or holons, as janet metcalfe and I called them, act like face cells. The respond to faces even when the eyes, for example Are occluded, the response drops off with rotation of the stimulus, and they generalize to other species. The representation is like distributed grandmother cells - each receptive field looks like a face, but each unit participates in some, but not all of the faces. Finally, there is a processing advantage to these representations: they do pattern completion. If part of a holon is matched, no one listening to this cell knows which half. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Attention to half an image
Gary Cottrell & * 07/16/96 Attention to half an image Gabor Pattern Gabor Filtering Attenuate Attenuated Pattern To make our model do this, we need to give it the ability to pay attention to half of an image. We do this by simply attenuating the part of the image that is not attended to, in line with “spotlight” theories of selective attention. Note that if half of the data coming in is incorrect and attenuated, then it will tend to make the network make errors and be less confident in its choices. This naturally leads to the correct effects, as we see in the next slide. Input Pixel Image IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Composite vs. non-composite facial expressions (Calder et al. 2000)
Gary Cottrell & * 07/16/96 Composite vs. non-composite facial expressions (Calder et al. 2000) Human Reaction Times Network Errors This is a direct comparison to human data obtained by Andy calder using composites of facial expressions. Where the human confusion shows up in reaction time, ours shows up in errors. This could be due to a speed accuracy tradeoff in the humans. (error bars indicate one standard deviation) APA Talk, August 2000: The face of fear: *

Is Configural Processing of Identity and Expression Independent?
Gary Cottrell & * Is Configural Processing of Identity and Expression Independent? Calder et al. (2000) found that adding additional inconsistent information that is not relevant to the task didn’t further slow reaction times. E.g., when the task is “who is it on the top?”, having a different person’s face on the bottom hurts your performance, but also having a different expression doesn’t hurt you any more. 07/16/96 Andy Calder did further experiments to look at whether, for example, incorrect expressions added any more to the composite deficit when the task was identity. Same Identity, Different Expression Different Identity, Same Expression Different Identity, Different Expression IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

(Lack of) Interaction between expression and identity
Gary Cottrell & * 07/16/96 (Lack of) Interaction between expression and identity Human Reaction Time (ms) Network Reaction Time: 1 – Correct Output Our networks shows the same effects. Cottrell, Branson, and Calder, 2002 IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Why does this work? Neural Net Happy Sad Afraid Bob Carol Ted PCA
Gary Cottrell & * 07/16/96 Why does this work? Attenuated Inconsistent Information here . Happy Sad Afraid Bob Carol Ted Neural Net The representation of shifted information here (non-configural)-> Gabor Filtering PCA Leads to a weaker representation here-> Because the Wrong template Is weakly activated Has little Impact here--> because The bottom half doesn’t match any template We caan explain these effects in our networks by examining what happens at every layer. When informatiion is attenuated at the input, that leads to a weaker representation at the gestalt layer. When it is attenuated and inconsistent, it also weakens the response, but moves it towards the wrong direction. When it is shifted, though, it has no impact on the holistic representation, because it does not match with it. Pixel (Retina) Level Perceptual (V1) Level Object (IT) Level Category Level IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Configural/holistic processing phenomena accounted for
Gary Cottrell & * 07/16/96 Configural/holistic processing phenomena accounted for Interference from incorrect information in other half of image. Lack of interference from misaligned incorrect information. We have shown this for identity and expression, as well as the lack of interaction between these. Calder suggested from his data that we must have two representations: one for expression and one for identity: but our model has only one representation. So we are able to account for the main effects of standard experiments in configural processing, as well as the subtleties introduced in calder’s recent experiments. Modeling can also help disprove inferences that people make from their data. Calder assumed his data meant that there must be separate representations of for facial expression and identity that do not interact with one another, but our model shows this is not a necessary conclusion. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Gary Cottrell & * 07/16/96 Introduction The brain appears to devote specialized resources to face processing. The issue: innate or learned? Our approach: computational models guided by neuropsychological and experimental data. The model: competing neural networks + biologically plausible task and input biases. Results: interaction between face discrimination and low visual acuity leads to networks specializing for face recognition. No innateness necessary! IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Step one: a model with parts
Gary Cottrell & * 07/16/96 Step one: a model with parts Stimulus Decision Mediator Feature Extraction units Face Processing Object Processing ?? The mediator in this model is assigning tasks based on their difficulty. We think that at the time of development, a structure like this could result in a “specialist” and a “generalist,” where the specialist learns and performs the more difficult tasks, and the generalist learns and performs what’s left. Note that there is a good deal of evidence that prosopagnosia is not a deficit in within-category discrimination, one of the earlier proposals. But it’s important to note that the “face recognition unit” seems to be mandatory -- after damage, it is still used. Remember the face inversion effect: Independent networks compete to perform new tasks A mediator rewards winners The question: What might cause a specialized face processor? IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Developmental biases in learning
Gary Cottrell & * 07/16/96 Developmental biases in learning The task: we have strong need to discriminate between faces but not between baby bottles. Mother’s face recognition at 4 days (Pascalis et al., 1995) The input: low spatial frequencies - which tends to be more holistic in nature Infant sensitivity to high spatial frequencies is low at birth - A poor man’s holistic information is embodied in LSF’s From Banks and Salapatek, 1981 IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Neural Network Implementation
Gary Cottrell & * 07/16/96 Neural Network Implementation High spatial frequency Input Stimulus ... Gate Image Preprocessing ... ... Output Separate nets in competition ... ... Learning rules: Experts get the output error normalized by an estimate of the posterior probability that the given input was drawn from that expert’s distribution. Gates adjust their parameters to maximize the log likelihood of the data under the mixture of Gaussian assumption. The learning rule maximizes hi for the gates and uses hi to modulate the error feedback each expert receives. ... multiplicative connections Output mixed by gate network Low spatial frequency More error feedback to “winner” IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Gary Cottrell & * 07/16/96 Experimental methods Image data: 12 faces, 12 books, 12 cups, 12 soda cans, five examples each. 8-bit grayscale, cropped and scaled to 64x64 pixels IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Image Preprocessing Gabor Jet PCA PCA PCA PCA PCA Pattern Vector
Gary Cottrell & * 07/16/96 Image Preprocessing Filter Responses (512x5 Elements) Pattern Vector (8x5 Elements) Gabor Jet PCA PCA PCA Gabor filter: a Gaussian-restricted sinusoid with real and imaginary parts. Real part is a cosine logon; imaginary part is a sine logon. Independent parameters: scale and orientation. Gabor jet: a collection of responses from filters at several scales and orientations. With PCA, get 192-element vectors which we present as inputs for training the mixture model. With 192 principal components and 192 training set images, training set is most likely linearly separable. Reconstruction is perfect, but generalization could be poor. Also means that either expert in the model could solve the problem alone. PCA PCA Dimensionality Reduction IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Task Manipulation Trained networks for two types of task:
Gary Cottrell & * 07/16/96 Task Manipulation Trained networks for two types of task: Superordinate four-way classification (book? face?) Subordinate classification within one class; simple classification for others (book? John?) Face Can Book Cup Bob Carol Ted ... Alice Task 1: Superordinate Task 2: Subordinate Network Output Units Explain gate variance maximization. The gate network generates a pair of mixing parameters. Ideally, we want each of the experts to specialize strongly on a sizable subset of the training. Note that the variance of a gate network output is maximized when it is 1.0 for half of the training patterns and 0.0 for the other half. Since partition induced was sensitive to training parameters, we searched for the parameters that maximized this variance. This is an objective means of setting the parameters, not giving the network knowledge of faces vs. books, etc. More recenty, experimented with adding a variance maximization term to the objective function for the gate network. Works, generating solutions not so sensitive to the training parameter values. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Input spatial frequency manipulation
b c d e a b 0.5c 0.5c d e Used two input pattern formats Each module receives same full pattern vector One module receives low spatial frequencies; other receives high spatial frequencies IEEE Computational Intelligence Society 4/12/2006

Measuring specialization
Net 1 Net 2 Gate 0.2 0.8 Net 1 Net 2 Gate 0.7 0.3 Train the network Record how gate network outputs change with each pattern IEEE Computational Intelligence Society 4/12/2006

Specialization Results
All frequencies Module 1 Module 2 Gating Unit Average Weight Hi/Lo split High frequency module Low frequency Four-way classification (Face, Book, Cup, Can?) Book identification (Face, Cup, Can, Book1, Book2, ...?) Face identification (Book, Cup, Can, Bob, Carol, Ted, ...?) TASK IEEE Computational Intelligence Society 4/12/2006

Modeling prosopagnosia
Can “damage” the specialized network. Damage to high spatial frequency network degrades object classification Damage to low spatial frequency network degrades face identification IEEE Computational Intelligence Society 4/12/2006

Gary Cottrell & * 07/16/96 Conclusions so far… There is a strong interaction between task and spatial frequency in the degree of specialization for face processing. The model suggests that the infant’s low visual acuity and the need to discriminate between faces but not other objects could “lock in” a special face processor early in development. => General mechanisms (competition, known innate biases) could lead to a specialized face processing “module” No need for an innately-specified processor The good: we see a strong prosopagnosic effect when the face expert is damaged. The bad: the double dissociation is not very strong. Partially explained by the simplicity of the task, and the fact that both experts learn about all patterns early in training, giving the “face expert” some knowledge of the other parts. Possibly alleviated by weight decay. The main reason it works at all, we think, is due to the “local expert” idea. NEVERTHELESS, the experiments show how task effects during learning IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Are you a perceptual expert?
Gary Cottrell & * 07/16/96 Are you a perceptual expert? Take the expertise test!!!** “Identify this object with the first name that comes to mind.” **Courtesy of Jim Tanaka, University of Victoria IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

“2002 BMW Series 7” - Expert! “Car” - Not an expert
Gary Cottrell & * 07/16/96 “Car” - Not an expert “2002 BMW Series 7” - Expert! IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

“Indigo Bunting” - Expert!
Gary Cottrell & * 07/16/96 “Bird” or “Blue Bird” - Not an expert “Indigo Bunting” - Expert! IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

“George Dubya”- Expert!
Gary Cottrell & * 07/16/96 “Face” or “Man” - Not an expert “George Dubya”- Expert! IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Greeble Experts (Gauthier et al. 1999)
Subjects trained over many hours to recognize individual Greebles. Activation of the FFA increased for Greebles as the training proceeded. IEEE Computational Intelligence Society 4/12/2006

The visual expertise mystery
Gary Cottrell & * 07/16/96 The visual expertise mystery If the so-called “Fusiform Face Area” (FFA) is specialized for face processing, then why would it also be used for cars, birds, dogs, or Greebles? Our view: the FFA is an area associated with a process: fine level discrimination of homogeneous categories. But the question remains: why would an area that presumably starts as a face area get recruited for these other visual tasks? Surely, they don’t share features, do they? Read the slide Sugimoto & Cottrell (2001), Proceedings of the Cognitive Science Society IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Solving the mystery with models
Gary Cottrell & * 07/16/96 Solving the mystery with models Main idea: There are multiple visual areas that could compete to be the Greeble expert - “basic” level areas and the “expert” (FFA) area. The expert area must use features that distinguish similar looking inputs -- that’s what makes it an expert Perhaps these features will be useful for other fine-level discrimination tasks. We will create Basic level models - trained to identify an object’s class Expert level models - trained to identify individual objects. Then we will put them in a race to become Greeble experts. Then we can deconstruct the winner to see why they won. The main idea here is that there are multiple visual areas. Basic level areas and the FFA. The FFA has a harder task -- fine level discrimination within a homogeneous class, so it must have features that distinguish similar inputs. Perhaps these features are useful for other tasks like this. So, to test this idea, we created two kinds of models -- basic level categorizers and experts. Then we race them to see who can distinguish greebles first. Sugimoto & Cottrell (2001), Proceedings of the Cognitive Science Society IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Model Database A network that can differentiate faces, books, cups and
Gary Cottrell & * 07/16/96 Model Database We start with the cottrell and metcalfe face images,. The dailey and cottrel object images, and greebles. One kind of network is trained to say Cup can book face. Another network is trained to say cup can book bob carol ted and alice. A network that can differentiate faces, books, cups and cans is a “basic level network.” A network that can also differentiate individuals within ONE class (faces, cups, cans OR books) is an “expert.” IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Model can Bob Ted cup (Experts) book Pretrain two groups of neural networks on different tasks. Compare the abilities to learn a new individual Greeble classification task. Carol Greeble1 Greeble2 Greeble3 can cup book face (Non-experts) Hidden layer IEEE Computational Intelligence Society 4/12/2006

Expertise begets expertise
Gary Cottrell & * Expertise begets expertise 07/16/96 Amount Of Training Required To be a Greeble Expert This is a graph of the results. The X-axis represents the amount of training on the first task - either basic level processing, (cup can book face) or expert level processing. We trained four kinds of experts - a face expert, book expert, can expert and cup experts. The Y axis shows the amount of training required to become a greeble expert. Notice that the more the pretraining, the faster the learning of the greeble task. This is somewhat strange -- usually neural networks show overtraining effects -- here, we are getting positive transfer. So, if these nets were competing to solve the greeble task the FFA would win. Or, if our parents were cans, the FCA would win. Training Time on first task Learning to individuate cups, cans, books, or faces first, leads to faster learning of Greebles (can’t try this with kids!!!). The more expertise, the faster the learning of the new task! Hence in a competition with the object area, FFA would win. If our parents were cans, the FCA (Fusiform Can Area) would win. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Gary Cottrell & * 07/16/96 Entry Level Shift: Subordinate RT decreases with training (rt = uncertainty of response = 1.0 -max(output)) Human data Network data --- Subordinate Basic These graphs show the downward entry level shift in reaction times for the humans and the face network we get the shift in our networks. RT # Training Sessions IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

How do experts learn the task?
Gary Cottrell & * 07/16/96 How do experts learn the task? Expert level networks must be sensitive to within-class variation: Representations must amplify small differences Basic level networks must ignore within-class variation. Representations should reduce differences So, the mystery remains, why is it better to be an expert? We can now deconstruct the networks to see why. The main idea is, An expert’s job is to be sensitive to variation WITHIN a class -- it has to respond differently to all of the faces. This variability may generalize to new inputs. So we looked at the variance of the hidden layer responses to stimuli/. IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Observing hidden layer representations
Principal Components Analysis on hidden unit activation: PCA of hidden unit activations allows us to reduce the dimensionality (to 2) and plot representations. We can then observe how tightly clustered stimuli are in a low-dimensional subspace We expect basic level networks to separate classes, but not individuals. We expect expert networks to separate classes and individuals. IEEE Computational Intelligence Society 4/12/2006

Subordinate level training magnifies small differences within object representations
1 epoch 80 epochs 1280 epochs Face greeble Basic IEEE Computational Intelligence Society 4/12/2006

Greeble representations are spread out prior to Greeble Training
Face Basic IEEE Computational Intelligence Society 4/12/2006

Variability Decreases Learning Time
Greeble Learning Time (r = ) Greeble Variance Prior to Learning Greebles IEEE Computational Intelligence Society 4/12/2006

Examining the Net’s Representations
We want to visualize “receptive fields” in the network. But the Gabor magnitude representation is noninvertible. We can learn an approximate inverse mapping, however. We used linear regression to find the best linear combination of Gabor magnitude principal components for each image pixel. Then projecting each hidden unit’s weight vector into image space with the same mapping visualizes its “receptive field.” IEEE Computational Intelligence Society 4/12/2006

Two hidden unit receptive fields
AFTER TRAINING AS A FACE EXPERT AFTER FURTHER TRAINING ON GREEBLES HU 16 HU 36 NOTE: These are not face-specific! IEEE Computational Intelligence Society 4/12/2006

Controlling for the number of classes
We obtained 13 classes from hemera.com: 10 of these are learned at the basic level. 10 faces, each with 8 expressions, make the expert task 3 (lamps, ships, swords) are used for the novel expertise task. IEEE Computational Intelligence Society 4/12/2006

Results: Pre-training
New initial tasks of similar difficulty: In previous work, the basic level task was much easier. These are the learning curves for the 10 object classes and the 10 faces. IEEE Computational Intelligence Society 4/12/2006

Number of training epochs on faces or objects
Results As before, experts still learned new expert level tasks faster Number of epochs To learn swords After learning faces Or objects Number of training epochs on faces or objects IEEE Computational Intelligence Society 4/12/2006

Issues I haven’t addressed…
Development - what is the trajectory of the system from infant to adult? How do representations change over development? How do earlier acquired representations differ from later ones? I.e., what is the representational basis of Age of Acquisition effects? How do representations change based on familiarity? Does the FFA participate in basic level processing? Dynamics of expertise: Eye movements How do they change with expertise? Are there visual routines for different tasks? How much does the stimulus influence eye movements? I.e., how flexible are the routines? How do we decide where to look next? How are samples integrated across saccades? IEEE Computational Intelligence Society 4/12/2006

How do we decide where to look next?
Both bottom up and top down influences: Local stimulus complexity == “interestingness” Task requirements: Look for discriminative features We’ve looked at at least two ideas: Gabor filter response variance Mutual information between the features and the categories IEEE Computational Intelligence Society 4/12/2006

Interest points created using Gabor filter variance

Where do we look next #2: Mutual Information
Ullman et al. (2002) proposed that features of intermediate complexity are best for classification. They used mutual info. between patches in images and categories to find patches that were good discriminators: Faces vs. non-faces; Cars vs. non-cars They found that medium-sized and medium-resolution patches were best for these tasks. Our question: what features are best for subordinate-level classification tasks that need expertise, like facial identity recognition We found that traditional features such as eyes, noses, and mouths are informative for identity ONLY in the context of each other: I.e., in a configuration. Conclusion: Holistic processing develops because “it is good.” IEEE Computational Intelligence Society 4/12/2006

Ullman et al 2002 Features of intermediate complexity (size and resolution) are best for classification. These were determined by computing the mutual information between an image patch and the class IEEE Computational Intelligence Society 4/12/2006

Facial Identity Classification
Gary Cottrell & * 07/16/96 Facial Identity Classification Will features that are good for telling faces from objects be good for identification? We expect that more specific features will be needed for face identification. Wouldn’t something face-like be different from Gary-like? A blurred face image? A nose? Can you tell several person apart just by noses? IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Data Set We used 36 frontal images of 6 individuals (6 images each) from FERET [Phillips et al., 1998]. The images were aligned. Gabor filter responses were extracted from rectangular grids IEEE Computational Intelligence Society 4/12/2006

Gary Cottrell & * 07/16/96 Patches Rectangular patches of different centers, sizes and Gabor filter frequencies were taken from images. Because we have normalized our images, we can define corresponding patches in image coordinates IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Corresponding Patches
Gary Cottrell & * Corresponding Patches 07/16/96 Patches are defined as “corresponding” when they are in the same position, size and Gabor filter frequency across images. If a “Fred patch” matches the corresponding patch in another image, this is evidence for the “Fredness” of the new image. We can then use some measure of how many Fred patches match, and a threshold, to decide if this face is “Fred.” Because we have normalized our images, we can define corresponding patches in image coordinates IEEE Computational Intelligence Society 4/12/2006 APA Talk, August 2000: The face of fear: *

Mutual Information How useful the patches were for face identification was measured by mutual information: I(C,F) = H(C) - H(C|F) C, F are binary variables standing for class and feature C=1 when the image is of the individual F=1 when the patch is present in the image IEEE Computational Intelligence Society 4/12/2006

Implementation For patch i in image j
C=1 for the 6 images belonging to the same individual of image j, C=0 for the other 30. F=1 for images in which the patch i is present, F=0 for the rest. The best threshold for “the presence of a patch” was found by brute force search, from -0.9 to 0.9 by steps of 0.1. Mutual information was calculated over every patch in each image. This measures how well each patch predicts the identity of the image. An average was taken across corresponding patches to measure how a patch of certain center, size and frequency predicts face identity. IEEE Computational Intelligence Society 4/12/2006

Results: Best Patches The 6 patches with the highest mutual information. Frequency of 1 to 5 denote from the highest Gabor filter frequency to the lowest. These are similar to each other because we do not eliminate redundancy in the patches (these are not independent) IEEE Computational Intelligence Society 4/12/2006

Conclusions so far… Against intuition, local features of eyes and mouths by themselves are not very informative for face identity. Local features need to processed in medium-sized face areas for identification - where they are in a particular configuration with other features. This may explain why holistic processing has developed for face processing - simply because it is good or even necessary for identification. IEEE Computational Intelligence Society 4/12/2006

Integration across saccades
Now, given these patches sampled from an image, what to do with them? Joyca LaCroix’s (2004) Natural Input Memory (NIM) model of recognition memory: At study, sample the image at random points. Store the patches. At test, sample the new image at random points. Count how many stored patches fall inside a ball of radius R around the new patches. The average of this is the recognition score. This is a kernel density estimation model, like GCM, but exemplars are patches, not the whole image. I.e., the “NIM” answer to the integration problem is: Don’t integrate! NIM is a natural partner to our eye movement modeling. IEEE Computational Intelligence Society 4/12/2006

Implications of the NIM model
What would this mean for expertise? Lots of experience -> lots of fragments -> better discrimination Familiarity also means lots of fragments, under many lighting conditions, all associated with one name Augmentation with an interest operator (e.g., look at high variance points on the face (see previous slides and Yamada & Cottrell 1994)) could easily lead to parts-based representations! IEEE Computational Intelligence Society 4/12/2006

Wrap up We are able to explain a variety of results in face processing. We have a mechanistic way of talking about “holistic processing.” How a specialized area might arise for faces, and why low spatial frequencies (LSF) appear to be important in face processing (specialization model: LSF -> better learning and generalization). Why a face area would be recruited to be a Greeble area: expert level (fine discrimination) processing leads to highly differentiated features useful for other discrimination tasks. And…we have plans to go beyond simple passive recognition models… IEEE Computational Intelligence Society 4/12/2006

What can computational models tell us about face processing?

Similar presentations

Presentation on theme: "What can computational models tell us about face processing?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What can computational models tell us about face processing?

Similar presentations

Presentation on theme: "What can computational models tell us about face processing?"— Presentation transcript:

Similar presentations

About project

Feedback