A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science.

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science

Before I begin I would like you to see a ‘video game’ that will figure in the last part of my talk The demonstration shows a task called “Multiple Object Tracking” Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the “targets” After each example I’d like you to ask yourself, “How do I do it?” If you are like most of our subjects you will have no idea, or a false idea…

Keep track of the objects that flash 512x6.83 172x 169

How do we do it? What properties of individual objects do we use?

Going behind occluding surfaces does not disrupt tracking Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290.

Not all well-defined features can be tracked: Track endpoints of these lines Endpoints move exactly as the squares did!

 What determines our behavior is not how the world is, but how we represent it as being  As Chomsky pointed out in his review of Skinner, if we describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent  Every naturally-occurring behavioral regularity is cognitively penetrable Any information that changes beliefs can systematically and rationally change behavior The basic problem of cognitive science

Representation and Mind Why representations are essential Do representations only come into play in “higher level” mental activities, such as reasoning? Even at early stages of perception many of the states that must be postulated are representations (i.e. what they are about plays a role in explanations).

Examples from vision (1): Intrapercept constraints Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83.

Examples from vision (2): The Pogendorf iIlusion depends on perceived contours – they need not be physical edges

The rules of color mixing apply to perceived color ‘Red light and yellow light mix to produce orange light’   This ‘law” holds regardless of how the red light and yellow light are produced;   The yellow may be light of 580 nanometer wavelength, or it may be a mixture of light of 530 nm and 650 nm wavelengths. ☺ ☺ So long as one light looks yellow and the other looks red the “law” will hold – the mixture will look orange.

Another example of a classical representation

Other forms of representation…. a) Lines FG, BC are parallel and equal. b) Lines EH, AD are parallel and equal. c) Lines FB, GC are parallel and equal. d) Lines EA, HD are parallel and equal. e) Vertices EF, HG, DC and AB are joined.... f) Part-Of{Cube, Top-Face(EFGH), Bottom- Face(ABCD), Front-Face(FGCB), Back- Face(EHDA)} g) Part-Of{Top-Face(Front-Edge(FG), Back- Edge(EH), Left-Edge(EF), Right-Edge(HG)},…

What’s wrong with this picture? What’s wrong is that the CTM is incomplete — it does not address a number of fundamental questions  It fails to specify how representations connect with what they represent – it’s not enough to use English words in the representation (that’s been a common confusion in AI) or to draw pictures (a common confusion in theories of mental imagery)  English labels and pictures may help the theorist recall which objects are being referred to …  But what makes it the case that a particular mental symbol refers to one thing rather than another?  How are concepts grounded? (Symbol Grounding Problem )

Another way to look at what the Computational Theory of Mind lacks The missing function in the CTM is a mechanism that allows perception to refer to individual things in the visual field directly and nonconceptually:  Not as “whatever has properties P 1, P 2, P 3,...”, but as a singular term that refers directly to an individual and does not appeal to a representation of the individual’s properties.  Such a reference is like a proper name or a pointer in a computer data structure, or like a demonstrative term (like this or that) in natural language.  Note that in a computer a pointer does not refer via a location, despite what the term “pointer” suggests

An example from personal history: Why we need to pick out individual things without referring to their properties We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatially- restricted information as it examined the drawing This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram.

Begin by drawing a line…. L1

Now draw a second line…. L2

And draw a third line…. L3

Notice what you have so far…. (noticings are local – you encode what you attend to) There is an intersection of two lines… But which of the two lines you drew are they? There is no way to indicate which individual things are seen again without a way to refer to individual (token) things L1 L2 V6

Look around some more to see what is there …. Here is another intersection of two lines… Is it the same intersection as the one seen earlier? Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode? L5 L2 V12

In examining a geometrical figure one only gets to see a sequence of local glimpses

The incremental construction of visual representations requires solving a correspondence problem over time We have to determine whether a particular individual element seen at time t is identical to another individual element seen at a previous time t- . This is one manifestation of the correspondence problem. Solving the correspondence problem is equivalent to picking out and tracking the identity of token individuals as they change their appearance, their location or the way they are encoded or conceptualized To do that we need the capacity to refer to token individuals (I will call them objects) without doing so by appealing to their properties. This requires a special form of demonstrative reference I call a Visual Index.

A note about the use of labels in this example There are two purposes for figure labels. One is to specify what type of individual it is (line, vertex,..). The other is to specify which individual it is so it is individuated and thus can be selected or bound to the argument of a predicate. The second of these is what I am concerned with because indicating which individual it is is essential in vision.  Many people (e.g., Marr, Yantis) have suggested that individuals may be marked by tags, but that won’t do since one cannot literally place a tag on an object and even if we could it would not obviate the need to individuate and index just as labels don’t help. Labeling things in the world is not enough because to refer to the line labeled L 1 you would have to be able to think “this is line L 1 ” and you could not think that unless you had a way to first picking out the referent of this.

The difference between a direct (demonstrative) and a descriptive way of picking something out has produced many “You are here” cartoons. It is also illustrated in this recent New Yorker cartoon…

The difference between descriptive and demonstrative ways of picking something out (illustrated in this New Yorker cartoon by Sipress )

‘Picking out’ Picking out entails individuating, in the sense of separating something from a background (what Gestalt psychologists called a figure-ground distinction) This sort of picking out has been studied in psychology under the heading of focal or selective attention.  Focal attention appears to pick out and adhere to objects rather than places In addition to a unitary focal attention there is also evidence for a mechanism of multiple references (about 4 or 5), that I have called a visual index or a FINST  Indexes are different from focal attention in many ways that we have studied in our laboratory (I will mention a few later)  A visual index is like a pointer in a computer data structure – it allows access but does not itself tell you anything about what is being pointed to

The requirements for picking out and keeping track of several individual things reminded me of an early comic book character called Plastic Man

Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e.g. ‘what finger # 2 is touching’) and could move your attention to them. You would then be said to possess FIN gers of INST antiation ( FINSTs)

FINST Theory postulates a limited number of pointers in early vision that are elicited by certain events in the visual field and that enable vision to refer to those things without doing so under concept or a description

FINSTs and Object Files form the link between the world and its conceptualization Object File contents are conceptual! Information (causal) link FINST Demonstrative reference link The only nonconceptual contents in this picture are FINST indexes!

Summarizing FINSTs A FINST is a primitive reference mechanism that normally references individual visible objects in the world. There are a small number (~4-5) FINSTs available at any one time. Objects are picked out and referred to without using any encoding of their properties, including their location.  Picking out objects is prior to encoding any properties! Indexing is nonconceptual because it does not represent an individual as a member of some conceptual category. An important function of FINST indexes is to bind arguments of visual predicates to things in the world to which they refer. Only predicates with bound arguments can be evaluated. Since predicates are quintessential concepts, an index serves as a bridge from nonconceptual to conceptual representations. Similarly they can bind arguments of motor commands, including the command to move focal attention or gaze to the indexed object: e.g., MoveGaze(x)

A note on terminology A FINST provides a reference to an individual visible ‘thing’ I sometimes call this referent a FING by analogy with FINST and sometimes an object to conform with usage in psych, but FINGs are nonconceptual so they do not pick out something as an object, because OBJECT us a concept. Maybe “proto object”? I have also called it a pointer, but that erroneously suggests that it “points to” the location of an object, as opposed to the object itself. In a computer, a pointer is the name of a stored datum. I have said that a FINST is a visual demonstrative like ‘this’ or ‘that’, but that too is misleading because the reference of a demonstrative depends on the intentions of the speaker I have also noted that a FINST is like a proper name but that won’t do since a name can pick out something not in sensory contact whereas a FINST can only refer to a visible item (or one that is briefly out of sight).

A quick tour of some evidence for FINSTs The correspondence problem The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first  Subitizing  Subset search ● Multiple-Object Tracking Cognizing space without requiring a spatial display in the head

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first  Subitizing  Subset selection  Multiple-Object Tracking Cognizing space without requiring a spatial display in the head

Individual objects and the binding problem We can distinguish scenes that differ by conjunctions of properties, so early vision must somehow keep track of how properties co-occur – conjunction must not be obscured. This is the called the binding problem The most common proposal is that vision keeps track of properties according to their location and binds together co-located properties.12

The proposal of binding conjunctions by the location of conjuncts does not work when feature location is not punctate and becomes even more problematic if they are co-located – e.g., if their relation is “inside”

Pandemonium An early architecture, was proposed by Oliver Selfridge in 1959. This idea continues to be at the heart of many psychological models, including ones implemented in contemporary connectionist or neural net models.

Binding as object-based The proposal that properties are conjoined by virtue of their common location has many problems  In order to assign a location to a property you need to know its boundaries, which requires distinguishing the object that has those properties from its background (figure-ground individuation)  Properties are properties of objects, not of locations – which is why properties move when objects move. Empty locations have no causal properties. The alternative to conjoining-by-location is conjoining by object. According to this view, solving the binding problem requires first selecting individual objects and then keeping track of each object’s properties (in its object file)  If only properties of selected objects are encoded and if those properties are recorded in object files specific to each object, then all conjoined properties will be recorded in the same object file, thus solving the binding problem

Attention spreads over perceived objects Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads to other parts of the same visual object compared to equally distant parts of different objects. Spreads to B and not C Spreads to B and not C Spreads to C and not B Spreads to C and not B *

Being able to pick out and refer to individual distal elements is essential for encoding patterns  Encoding relational predicates; e.g., Collinear (x,y,z,..); Inside (x, C); Above (x,y); Square (w,x,y,z), requires simultaneously binding the arguments of n-place predicates to n elements in the visual scene Evaluating such visual predicates requires individuating and referring to the objects over which the predicate is evaluated: i.e., the arguments in the predicate must be bound to individual elements in the scene.

Several objects must be picked out at once in making relational judgments When we judge that certain objects are collinear, we must first pick out the relevant objects while ignoring their properties

Several objects must be picked out at once in making relational judgments The same is true for other relational judgments like inside or on- the-same-contour… etc. We must pick out the relevant individual objects first. Are dots Inside-same contour? On-same contour?

A quick tour of some evidence for FINSTs The correspondence problem The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without first having to search for them  Subitizing  Subset selection  Multiple-Object Tracking Cognizing space without requiring a spatial display in the head

More functions of FINSTs Further experimental explorations using different paradigms Recognizing the cardinality of small sets of things: Subitizing vs counting (Trick, 1994) Searching through subsets – selecting items to search through (Burkell, 1997)  Selecting subsets and maintaining the selection during a saccade (Currie, 2002) Application of FINST index theory to infant cardinality studies (Carey, Spelke, Leslie, Uller, etc)  Indexes explain how children are able to acquire words for objects by ostension without suffering Quine’s Gavagai problem.

Signature subitizing phenomena only appear when objects are automatically individuated and indexed Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.

Subitizing results  There is evidence that a different mechanism is involved in enumerating small (n 4) numbers of items (even different brain mechanisms – Dehaene & Cohen, 1994 )  Rapid small-number enumeration (subitizing) only occurs when items are first (automatically) individuated *  Subitizing is not affected by precuing location while counting is *  Subitizing is insensitive to distance among items *  Our explanation for what is special about subitizing is that once FINST indexes are assigned to n< 4 individual objects, the objects can be enumerated without first searching for them. In fact they might be enumerated simply by counting active indexes which is fast and accurate because it does not require visual scanning * Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.

Subset selection for search Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial Vision, 11(2), 225-258.

Subset search results:  Only properties of the subset matter – but note that properties of the entire subset are taken into account simultaneously (since that is what distinguishes a feature search from a conjunction search)  If the subset is a single-feature search it is fast and the slope (RT vs number of items) is shallow  If the subset is a conjunction search set, it takes longer and is more sensitive to the set size As with subitizing, the distance between targets does not matter, so observers don’t seem to be scanning the display looking for the target

The stability of the visual world entails the capacity to reidentify individuals after a saccade There is no problem about how tactile selection can provide a stable world when you move around while keeping your fingers on the same objects – because in that case retaining individual identity is automatic But with FINSTs the same can be true with vision – for a small number of visual objects  This is compatible with the fact that it appears one retains the relative location of only about 4 elements during saccadic eye movements (Irwin, 1996) [ Irwin, D. E. (1996). Integrating information across saccadic eye movements. Current Directions in Psychological Science, 5(3), 94-100.]

The selective search experiment with a saccade induced between the late onset cues and start of search Even with a saccade between selection and access, items can be accessed efficiently

Demonstrating the function of FINSTs with Multiple Object Tracking (MOT) In a typical experiment, 8 simple identical objects are presented on a screen and 4 of them are briefly distinguished in some visual manner – usually by flashing them on and off. After these 4 targets are briefly identified, all objects resume their identical appearance and move randomly. The observers’ task is to keep track of the ones that had been designated as targets at the start After a period of 5-10 seconds the motion stops and observers must indicate, using a mouse, which objects are the targets

Another example of MOT: With self occlusion 5 x 5 1.75 x 1.75

Self occlusion dues not seriously impair tracking

 Basic finding: Most people can track at least 4 targets that move randomly among identical non-target objects (even 5 year old children can track 3 objects)  Object properties do not appear to be recorded during tracking and tracking is not improved if all objects are visually distinct (no two objects have the same color, shape or size)  How is it done?  We showed that it is unlikely that the tracking is done by keeping a record of the targets’ locations and updating them by serially visiting the objects (Pylyshyn & Storm, 1998)  Other strategies may be employed (e.g., tracking a single deforming pattern), but they do not explain tracking  Hypothesis: FINST Indexes get assigned to targets. At the end of the trial these pointers can be used to move attention to the targets and hence to select them Some findings of Multiple Object Tracking

What role do visual properties play in MOT? Certain properties may have to be present in order for an object to be indexed, and certain properties (probably different properties) may be required in order for the index to keep track of the object, but this does not mean that such properties are encoded, stored, or used in tracking.  Compare this with Kripke’s distinction between properties that fix the referent of a proper name and the property that the name refers to. The former only plays a role at the name’s initial “baptism.” Is there something special about location? Do we record and track properties-at-locations?  Location in time & space may be essential for individuating objects, but locations need not be encoded or made cognitively available  The fact that an object is actually at some location or other does not mean that it is represented as such. Representing property ‘P’ (where P happens to be at location L) ≠ Representing property ‘P-is-at-L’.

A way of viewing what goes on in MOT According Kahneman & Treisman’s Object File theory, the appearance of a new visual object causes a new Object File to be created. Each object file is associated with its respective object – presumably through a FINST Index. The object file may contain information about the object to which it is attached. But according to FINST Theory, keeping track of the object’s identity does not require the use of this information. The evidence suggests that in MOT, little or nothing is stored in the object file except maybe in special cases (e.g., when the object suddenly changes or disappears). What makes something the same object over time is that it remains connected to the same object-file (by the same FINST). Thus, for vision to treat something as the same enduring individual does not require appeal to properties or concepts.

Why is this relevant to foundational questions in the philosophy of mind? According to Quine, Strawson, and most philosophers, you cannot pick out or track individuals without concepts (sortals) But you also cannot pick out individuals with only concepts  Sooner or later you have to pick out individuals using nonconceptual causal connections between thoughts and things The present proposal is that FINSTs provide the needed non-conceptual mechanism for individuating objects and for tracking their identity, which works most of the time in our kind of world. It relies on a natural constraint (Marr) FINST indexes provide the right sort of connection for predicating properties of the world by allowing the arguments of predicates to be bound to objects prior to the predicates being evaluated. They may thus be the basis for early vocabulary learning.

But there must be some properties that cause indexes to be grabbed! Of course there are properties that are causally responsible for indexes being grabbed, and also properties (probably different ones) that make it possible for objects to be tracked; But these properties need not be represented (encoded) and used in tracking The distinction between object properties that cause indexes to be assigned and those that are represented (in Object Files) is similar to Kripke’s distinction between properties that are needed to pick out name an object and those that constitute its meaning

Effect of target properties on MOT Changes of target properties are not reported nor even noticed during MOT Keeping all targets at different color, size, or shape does not improve tracking Observers do not use target speed or direction in tracking (e.g., by anticipating where the targets will be when they reappear after occlusion)

Some open questions We have arrived at the view that only properties of selected (indexed) objects enter into subsequent conceptualization and perception-based thought (i.e., only information in object files is made available to cognition) So what happens to the rest of the visual information? Visual information seems rich and fine-grained while this theory only allows for the properties of 4 or 5 objects to be encoded!  The present view leaves no room for nonconceptual representations whose content corresponds to the content of conscious experience  According to the present view, the only content that nonconceptual representations have is the demonstrative content of indexes that refer to perceptual objects  Question: Why do we need any more than that?

An intriguing possibility…. Maybe the theoretically relevant information we take in is less than (or at least different from) what we experience  This possibility has received attention recently with the discovery of various “blindnesses” (e.g., change- blindness, inattentional blindness, blindsight…) as well as the discovery of independent-vision systems (e.g., recognition and motor control)  The qualitative content of conscious experience may not play a role in explanations of cognitive processes  Even if unconceptualized information enters into causal process (e.g., motor control) it may not be represented or made available to the cognitive mind it – not even as a nonconceptual representation  For something to be a representation its content must figure in explanations – it must capture generalizations. It must have truth conditions and therefore allow for misrepresentation. It is an empirical question whether current proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyn’s Razor

Vision science has always been deeply ambivalent about role of conscious experience Isn’t how things appear one of the things that our theories must explain? Answer: There is no a priori ‘must explain’! ● The content of subjective experience is a major type of evidence. But it may turn out not to be the most reliable source for inferring the relevant functional states. It competes with other types of evidence. ● How things appear cannot be taken at face value: it carries substantive theoretical assumptions. It also draws on many levels of processing.  It was a serious obstacle to early theories of vision (Kepler)  It has been a poor guide in the case of theories of mental imagery (e.g., color mixing, image size, image distances). ‘Reading X off an image’ is an illusion. ● It seems likely that vision science will use evidence of conscious experience the way linguistics uses evidence of grammatical intuitions – only as it is filtered through developing theories.  The questions a science is expected to answer cannot be set in advance – they change as the science develops.

What next? This picture leaves many unanswered questions, but it does provide a mechanism for solving the binding problem and also explaining how mental representations could have a nonconceptual connection with objects in the world (something required if mental representations are to connect with actions)

Schema for how FINSTs function in hockey

For a copy of these slides see: http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt Or MIT Press Paperback

Index capacity and training Daphne Bavelier’s lab (Rochester) has shown that videogame players can track a larger number of objects in MOT Jose Rivest (York) has shown that some athletes can track more targets than non- athletes Within individuals the main determiner of number of targets that can be tracked is the spacing between them

X You are now here But you are also here

MOT with occlusion MOT with virtual occluders MOT with matched nonoccluding disappearance Track endpoints of lines Track rubber-band linked boxes Track and remember ID by location Track and remember ID by name (number) Track while everything briefly disappears (½ sec) and goes on moving while invisible Track while everything briefly disappears (½ sec) and goes on moving while invisible Track while everything briefy disappears and reappears where they were when they disappeared Track while everything briefy disappears and reappears where they were when they disappeared Additional examples of MOT

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science.

Similar presentations

Presentation on theme: "A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science.

Similar presentations

Presentation on theme: "A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science."— Presentation transcript:

Similar presentations

About project

Feedback