Presentation is loading. Please wait.

Presentation is loading. Please wait.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 1 Computational Architectures in Biological Vision,

Similar presentations


Presentation on theme: "Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 1 Computational Architectures in Biological Vision,"— Presentation transcript:

1 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 1 Computational Architectures in Biological Vision, USC Lecture 13. Scene Perception Reading Assignments: None

2 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 2

3 3 How much can we remember? Incompleteness of memory: how many domes in the Taj Mahal? despite conscious experience of picture-perfect, iconic memorization.

4 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 4

5 5 Change blindness Rensink, O’Regan & Clark 1996 See the demo!

6 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 6

7 7

8 8

9 9 But… We can recognize complex scenes which we have seen before. So, we do have some form of iconic memory. In this lecture: - examine how we can perceive scenes - what is the representation (that can be memorized) - what are the mechanisms

10 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 10 Extended Scene Perception Attention-based analysis: Scan scene with attention, accumulate evidence from detailed local analysis at each attended location. Main issues: - what is the internal representation? - how detailed is memory? - do we really have a detailed internal representation at all!!? Gist: Can very quickly (120ms) classify entire scenes or do simple recognition tasks; can only shift attention twice in that much time!

11 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 11 Accumulating Evidence Combine information across multiple eye fixations. Build detailed representation of scene in memory.

12 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 12 Eye Movements 1) Free examination 2) estimate material circumstances of family 3) give ages of the people 4) surmise what family has been doing before arrival of “unexpected visitor” 5) remember clothes worn by the people 6) remember position of people and objects 7) estimate how long the “unexpected visitor” has been away from family

13 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 13 Clinical Studies Studies with patients with some visual deficits strongly argue that tight interaction between where and what visual streams are necessary for scene interpretation. Visual agnosia: can see objects, copy drawings of them, etc., but cannot recognize or name them! Dorsal agnosia: cannot recognize objects if more than two are presented simulta- neously: problem with localization Ventral agnosia: cannot identify objects.

14 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 14 These studies suggest… We bind features of objects into objects (feature binding) We bind objects in space into some arrangement (space binding) We perceive the scene. Feature binding = what stream Space binding = where/how stream

15 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 15 Schema-based Approaches Schema (Arbib, 1989): describes objects in terms of their physical properties and spatial arrangements. Abstract representation of scenes, objects, actions, and other brain processes. Intermediate level between neural firing and overall behavior. Schemas both cooperate and compete in describing the visual world:

16 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 16

17 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 17 VISOR Leow & Miikkulainen, 1994: low-level -> sub-schema activity maps (coarse description of components of objects) -> competition across several candidate schemas -> one schema wins and is the percept.

18 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 18 Biologically-Inspired Models Rybak et al, Vision Research, 1998. What & Where. Feature-based frame of reference.

19 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 19

20 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 20 Algorithm - At each fixation, extract central edge orientation, as well as a number of “context” edges; - Transform those low-level features into more invariant “second order” features, represented in a referential attached to the central edge; - Learning: manually select fixation points; store sequence of second-order features found at each fixation into “what” memory; also store vector for next fixation, based on context points and in the second-order referential;

21 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 21 Algorithm As a result, sequence of retinal images is stored in “what” memory, and corresponding sequence of attentional shifts in the “where” memory.

22 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 22 Algorithm - Search mode: look for an image patch that matches one of the patches stored in the “what” memory; - Recognition mode: reproduce scanpath stored in memory and determine whether we have a match.

23 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 23 Robust to variations in scale, rotation, illumination, but not 3D pose.

24 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 24 Schill et al, JEI, 2001

25 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 25

26 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 26 Dynamic Scenes Extension to moving objects and dynamic environment. Rizzolatti: mirror neurons in monkey area F5 respond when monkey observes an action (e.g., grasping an object) as well as when he executes the same action. Computer vision models: decompose complex actions using grammars of elementary actions and precise composition rules. Resembles temporal extension of schema-based systems. Is this what the brain does?

27 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 27 Several Problems… with the “progressive visual buffer hypothesis:” Change blindness: Attention seems to be required for us to perceive change in images, while these could be easily detected in a visual buffer! Amount of memory required is huge! Interpretation of buffer contents by high-level vision is very difficult if buffer contains very detailed representations (Tsotsos, 1990)!

28 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 28 The World as an Outside Memory Kevin O’Regan, early 90s: why build a detailed internal representation of the world? too complex… not enough memory… … and useless? The world is the memory. Attention and the eyes are a look-up tool!

29 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 29 The “Attention Hypothesis” Rensink, 2000 No “integrative buffer” Early processing extracts information up to “proto-object” complexity in massively parallel manner Attention is necessary to bind the different proto-objects into complete objects, as well as to bind object and location Once attention leaves an object, the binding “dissolves.” Not a problem, it can be formed again whenever needed, by shifting attention back to the object. Only a rather sketchy “virtual representation” is kept in memory, and attention/eye movements are used to gather details as needed

30 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 30

31 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 31

32 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 32

33 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 33

34 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 34 Back to accumulated evidence! Hollingworth et al, 2000 argue against the disintegration of coherent visual representations as soon as attention is withdrawn. Experiment: - line drawings of natural scenes - change one object (target) during a saccadic eye movement away from that object - instruct subjects to examine scene, and they would later be asked questions about what was in it - also instruct subjects to monitor for object changes and press a button as soon as a change detected Hypothesis: It is known that attention will precede eye movements. So the change is outside the focus of attention. If subjects can notice it, it means that some detailed memory of the object is retained.

35 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 35 Hollingworth et al, 2000 Subjects can see the change (26% correct overall) Even if they only notice it a long time afterwards, at their next visit of the object

36 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 36 Hollingworth et al So, these results suggest that “the online representation of a scene can contain detailed visual information in memory from previously attended objects. Contrary to the proposal of the attention hypothesis (see Rensink, 2000), the results indicate that visual object representations do not disintegrate upon the withdrawal of attention.”

37 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 37 Gist of a Scene Biederman, 1981: from very brief exposure to a scene (120ms or less), we can already extract a lot of information about its global structure, its category (indoors, outdoors, etc) and some of its components. “riding the first spike:” 120ms is the time it takes the first spike to travel from the retina to IT! Thorpe, van Rullen: very fast classification (down to 27ms exposure, no mask), e.g., for tasks such as “was there an animal in the scene?”

38 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 38 demo

39 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 39

40 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 40

41 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 41

42 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 42

43 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 43

44 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 44

45 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 45

46 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 46

47 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 47 Gist of a Scene Oliva & Schyns, Cognitive Psychology, 2000 Investigate effect of color on fast scene perception. Idea: Rather than looking at the properties of the constituent objects in a given scene, look at the global effect of color on recognition. Hypothesis: diagnostic colors (predictive of scene category) will help recognition.

48 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 48 Color & Gist

49 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 49 Color & Gist

50 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 50

51 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 51 Color & Gist Conclusion from Oliva & Schyns study: “colored blobs at a coarse spatial scale concur with luminance cues to form the relevant spatial layout that mediates express scene recognition.”

52 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 52

53 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 53 Combining saliency and gist Torralba, JOSA-A, 2003 Idea: when looking for a specific object, gist may combine with saliency in guiding attention

54 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 54

55 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 55

56 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 56

57 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 57

58 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 58

59 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 59

60 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 60

61 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 61 Application: Beobots

62 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 62

63 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 63

64 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 64

65 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 65

66 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 66

67 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 67

68 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 68 Outlook It seems unlikely that we perceive scenes by building a progressive buffer and accumulating detailed evidence into it. It would take to much resources and be too complex to use. Rather, we may only have an illusion of detailed representation, and the availability of our eyes/attention to get the details whenever they are needed. The world as an outside memory. In addition to attention-based scene analysis, we are able to very rapidly extract the gist of a scene – much faster than we can shift attention around. This gist may be constructed by fairly simple processes that operate in parallel. It can then be used to prime memory and attention.

69 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 69 Goal-oriented scene understanding? Question: describe what is happening in the video clip shown in the following slide.

70 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 70

71 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 71 Goal for our algorithms Extract the “minimal subscene ® ”, that is, the smallest set of actors, objects and actions that describe the scene under given task definition. E.g., If “who is doing what and to whom?” task And boy-on-scooter video clip Then minimal subscene is “a boy with a red shirt rides a scooter around”

72 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 72 Challenge The minimal subscene in our example has 10 words, but… The video clip has over 74 million different pixel values (about 1.8 billion bits once uncompressed and displayed – though with high spatial and temporal correlation)

73 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 73 Starting point Can attend to salient locations Can identify those locations? Can evaluate the task-relevance of those locations, based on some general symbolic knowledge about how various entities relate to each other?

74 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 74

75 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 75 Task influences eye movements Yarbus, 1967: Given one image, An eye tracker, And seven sets of instructions given to seven observers, … … Yarbus observed widely different eye movement scanpaths depending on task.

76 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 76 1) Free examination 2) estimate material circumstances of family 3) give ages of the people 4) surmise what family has been doing before arrival of “unexpected visitor” 5) remember clothes worn by the people 6) remember position of people and objects 7) estimate how long the “unexpected visitor” has been away from family [1]: A.Yarbus, Plenum Press, New York, 1967. Yarbus, 1967: Task influences human eye movements

77 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 77 Towards a computational model Consider the following scene (next slide) Let’s walk through a schematic (partly hypothetical, partly implemented) diagram of the sequence of steps that may be triggered during its analysis.

78 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 78

79 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 79 Two streams Not where/what… But attentional/non-attentional Attentional: local analysis of details of various objects Non-attentional: rapid global analysis yields coarse identification of the setting (rough semantic category for the scene, e.g., indoors vs. outdoors, rough layout, etc)

80 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 80 Itti 2002, also see Rensink, 2000 Setting pathway Attentional pathway

81 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 81 Step 1: eyes closed Given a task, determine objects that may be relevant to it, using symbolic LTM (long-term memory), and store collection of relevant objects in symbolic WM (working memory). E.g., if task is to find a stapler, symbolic LTM may inform us that a desk is relevant. Then, prime visual system for the features of the most-relevant entity, as stored in visual LTM. E.g., if most relevant entity is a red object, boost red-selective neurons. C.f. guided search, top-down attentional modulation of early vision.

82 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 82 Navalpakkam & Itti, in press 1. Eyes closed

83 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 83 Step 2: attend The biased visual system yields a saliency map (biased for features of most relevant entity) See Itti & Koch, 1998-2003, Navalpakkam & Itti, 2003 The setting yields a spatial prior of where this entity may be, based on very rapid and very coarse global scene analysis; here we use this prior as an initializer for our “task-relevance map ® ”, a spatial pointwise filter that will be applied to the saliency map E.g., if scene is a beach and looking for humans, look around where the sand is, not in the sky! See Torralba, 2003 for computer implementation.

84 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 84 2. Attend

85 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 85 3. Recognize Once the most (salient * relevant) location has been selected, it is fed (through Rensink’s “nexus” or Olshausen et al.’s “shifter circuit”) to object recognition. If the recognized entity was not in WM, it is added

86 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 86 3. Recognize

87 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 87 4. Update As an entity is recognized, its relationships to other entities in the WM are evaluated, and the relevance of all WM entities is updated. The task-relevance map (TRM) is also updated with the computed relevant of the currently- fixated entity. That will ensure that we will later come back regularly to that location, if relevant, or largely ignore it, if irrelevant.

88 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 88 4. Update

89 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 89 Iterate The system keeps looping through steps 2-4 The current WM and TRM are a first approximation to what may constitute the “Minimal subscene”: A set of relevant spatial locations with attached object labels (see “object files”), and A set of relevant symbolic entities with attached relevance values

90 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 90 Prototype Implementation

91 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 91 Symbolic LTM

92 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 92 Simple hierarchical Representation of Visual features of Objects

93 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 93 The visual features Of objects in visual LTM are used to Bias attention Top-down

94 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 94 Once a location is attended to, its local visual features Are matched to those in visual LTM, to recognize the attended object

95 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 95 Learning object features And using them for biasing Naïve: Looking for Salient objects Biased: Looking for a Coca-cola can

96 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 96

97 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 97 Exercising the model by requesting that it finds several objects

98 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 98 Learning the TRM through sequences of attention and recognition

99 Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 99 Outlook Open architecture – model not in any way dedicated to a specific task, environment, knowledge base, etc. just like our brain probably has not evolved to allow us to drive cars. Task-dependent learning – In the TRM, the knowledge base, the object recognition system, etc., guided by an interaction between attention, recognition, and symbolic knowledge to evaluate the task-relevance of attended objects Hybrid neuro/AI architecture – Interplay between rapid/coarse learnable global analysis (gist), symbolic knowledge-based reasoning, and local/serial trainable attention and object recognition Key new concepts: Minimal subscene – smallest task-dependent set of actors, objects and actions that concisely summarize scene contents Task-relevance map – spatial map that helps focus computational resources on task-relevant scene portions


Download ppt "Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception 1 Computational Architectures in Biological Vision,"

Similar presentations


Ads by Google