# Alignment Visual Recognition “Straighten your paths” Isaiah.

## Presentation on theme: "Alignment Visual Recognition “Straighten your paths” Isaiah."— Presentation transcript:

Alignment Visual Recognition “Straighten your paths” Isaiah

Main approaches to recognition: Pattern recognition Pattern recognition Invariants Invariants Alignment Alignment Part decomposition Part decomposition Functional description Functional description

Alignment An approach to recognition where an object is first aligned with an image using a small number of pairs of model and image features, and then the aligned model is compared directly against the image.

Object Recognition Using Alignment D. Huttenlocher and S. Ullman 1 st ICCV 1987

The Task Matching a 2D view of a rigid object against a potential model. Matching a 2D view of a rigid object against a potential model. The viewed object can have arbitrary 3D position, orientation, and scale, and may be touching or occluded by other objects The viewed object can have arbitrary 3D position, orientation, and scale, and may be touching or occluded by other objects First the domain of flat rigid objects is considered: First the domain of flat rigid objects is considered:  The problem is not 2D as the flat object positioned in 3D  It still has to handle occlusion and individuating multiple objects in an image Next – extension to rigid objects in general. Next – extension to rigid objects in general.

Aligning a Model With the Image For 2D recognition, only two pairs of corresponding model and image points are needed to align a model with an image. Consider two pairs, such that model point corresponds to image point and model point corresponds to image point. The 2D alignment of the contours has three steps: The model is translated such that is coincident with The model is translated such that is coincident with Then it is rotated about the new such that the edge is coincident with the edge Then it is rotated about the new such that the edge is coincident with the edge Finally the scale factor is computed to make coincident with Finally the scale factor is computed to make coincident with

These two translations, one rotation, and a scale factor make each unoccluded point of the model coincident with its corresponding image point, as long as the initial correspondence of and is correct. These two translations, one rotation, and a scale factor make each unoccluded point of the model coincident with its corresponding image point, as long as the initial correspondence of and is correct. For 3D from 2D recognition, the alignment method is similar, requiring three pairs of model and image points to perform a three-dimensional transformation and scaling of the model. For 3D from 2D recognition, the alignment method is similar, requiring three pairs of model and image points to perform a three-dimensional transformation and scaling of the model.

The Alignment Method of Recognition Two stage approach: First: position, orientation, and scale of an object are found using a minimal amount of information (e.g. three pairs of model and image points) First: position, orientation, and scale of an object are found using a minimal amount of information (e.g. three pairs of model and image points) Second: alignment is used to map the object model into image coordinates for comparison with the image Second: alignment is used to map the object model into image coordinates for comparison with the image Given an object O in 3D and its 2D image I (perhaps along with other objects). Find O in the image using the alignment computation.

Assume that a feature detector returns a set of potentially Assume that a feature detector returns a set of potentially matching model and image feature pairs P matching model and image feature pairs P Since three pairs of model and image features specify a potential alignment of a model with an image, any triplet in P may specify the position and orientation of the object Since three pairs of model and image features specify a potential alignment of a model with an image, any triplet in P may specify the position and orientation of the object In general, some small number of triplets will specify the In general, some small number of triplets will specify the correct position and orientation, and the rest will be due to correct position and orientation, and the rest will be due to incorrect matching of model and image points incorrect matching of model and image points Thus the recognition problem is: Thus the recognition problem is: Determine which alignment in P defines the transformation that best maps the model into the image. Determine which alignment in P defines the transformation that best maps the model into the image.

Given a set pairs of model and image features, P,we solve for the alignment specified by each triplet in P Given a set pairs of model and image features, P,we solve for the alignment specified by each triplet in P For some triplets, there will be no way to position and orient the three model points such that they project onto their corresponding image points For some triplets, there will be no way to position and orient the three model points such that they project onto their corresponding image points Such triplets do not specify a possible alignment of the model and the image Such triplets do not specify a possible alignment of the model and the image The remaining triplets each specify a transformation mapping model points to image points The remaining triplets each specify a transformation mapping model points to image points An alignment is scored by using the transformation to map the model edges into the image, and comparing the transformed model edges with the image edges An alignment is scored by using the transformation to map the model edges into the image, and comparing the transformed model edges with the image edges The best alignment is the one that maps the most model edges onto image edges The best alignment is the one that maps the most model edges onto image edges

For m model features and i image features, the number of pairs of model and image features, p, is at most For m model features and i image features, the number of pairs of model and image features, p, is at most A good labeling scheme will bring p close to m (then each model point has one corresponding image point) A good labeling scheme will bring p close to m (then each model point has one corresponding image point) Given p pairs of features,there are, or an upper bound of Given p pairs of features,there are, or an upper bound of triplets of pairs. Each specifies a possible alignment triplets of pairs. Each specifies a possible alignment An alignment is scored by mapping the model edges into the image An alignment is scored by mapping the model edges into the image If the model edges are of length l,then the worst case running time of the algorithm is If the model edges are of length l,then the worst case running time of the algorithm is Alignment transforms recognition from exponential problem of finding the largest consistent set of model and image points, to polynomial problem of finding the best triplet of model and image points. Alignment transforms recognition from exponential problem of finding the largest consistent set of model and image points, to polynomial problem of finding the best triplet of model and image points. Complexity

Alignment points It is important to label features distinctively in order to limit the number of pairs It is important to label features distinctively in order to limit the number of pairs The labels must be relatively insensitive to partial occlusion, juxtaposition,and projective The labels must be relatively insensitive to partial occlusion, juxtaposition,and projective distortion, while being as distinctive as possible distortion, while being as distinctive as possible If the number of pairs, p, is kept small then little or no search is necessary to find the correct alignment. If the number of pairs, p, is kept small then little or no search is necessary to find the correct alignment.

Multi-Scale Description Using significant inflection points and low- curvature regions to segment edge contours Using significant inflection points and low- curvature regions to segment edge contours Edge segments are labeled to produce distinctive labels for use in pairing together potentially matching image and model points Edge segments are labeled to produce distinctive labels for use in pairing together potentially matching image and model points Context – edge contour is smoothed at various scales and the finer scale descriptions are used to label the coarser scale segments Context – edge contour is smoothed at various scales and the finer scale descriptions are used to label the coarser scale segments The coarser scale segments are used to group finer scale segments together The coarser scale segments are used to group finer scale segments together

The tree corresponding to the curvature scale space segmentation The contours are segmented at inflections in the smoothed curvature

Alignment of a widget with an image that does not match the model edge contour with image edges

Left - Matching a widget against an image of two widgets in the plane Right – Matching a widget against an image of a foreshortened widget

3D from 2D Alignment It is shown that the position,orientation,and scale of an object in 3D can be determined from a 2D image using three pairs of corresponding model and image points under weak perspective model It is shown that the position,orientation,and scale of an object in 3D can be determined from a 2D image using three pairs of corresponding model and image points under weak perspective model Under full perspective – up to four distinct solutions Under full perspective – up to four distinct solutions Next: Next:  The use of orthographic projection and a linear scale factor (weak perspective) as approximation for perspective viewing  The alignment method using explicit 3D rotation  Alignment method can be simulated using only planar operations. planar operations.

Weak Perspective Projection Given a set of points In the new image:

Projection model: W.P. is good enough A point (X,Y,Z) is projected: under perspective: under perspective: under weak perspective: under weak perspective: The error is expressed by: or

Allowed depth ratios as a function of x

Error is small when: The measured feature is close to the optical axis The measured feature is close to the optical axisor The estimate for the depth is close to the real depth (average depth of the observed environment) The estimate for the depth is close to the real depth (average depth of the observed environment) Supports the intuition that for images with low depth variance and for fixed regions near the center - perspective distortions are relatively small

Alignment Consider three model points and and three Consider three model points and and three corresponding image points and,where the model corresponding image points and,where the model points specify 3D positions (x,y,z) and the image points specify points specify 3D positions (x,y,z) and the image points specify positions in the image plane,(x,y,0) positions in the image plane,(x,y,0) The alignment task is to find a transformation that maps the plane defined by the three model points onto the image plane, such that each model point coincides with its corresponding image point. If no such transformation exists,then the alignment process must determine this fact The alignment task is to find a transformation that maps the plane defined by the three model points onto the image plane, such that each model point coincides with its corresponding image point. If no such transformation exists,then the alignment process must determine this fact Since the viewing direction is along the z-axis,an alignment is a transformation that positions the model such that projects Since the viewing direction is along the z-axis,an alignment is a transformation that positions the model such that projects along the z-axis onto, and similarly for onto,and onto along the z-axis onto, and similarly for onto,and onto

The transformation consists of translations in x and y,and The transformation consists of translations in x and y,and rotations about three orthogonal axes. There is no translation in z (all points along the viewing axis are equivalent under rotations about three orthogonal axes. There is no translation in z (all points along the viewing axis are equivalent under orthographic projection) orthographic projection) First we show how to solve for the alignment assuming no change in scale,and then modify the computation to allow for a linear scale factor First we show how to solve for the alignment assuming no change in scale,and then modify the computation to allow for a linear scale factor First translate the model points so that one point projects along the z-axis onto corresponding image point First translate the model points so that one point projects along the z-axis onto corresponding image point Using for this purpose, the model points are translated by Using for this purpose, the model points are translated by yielding the model points and yielding the model points and This brings,the projection of into the image plane This brings,the projection of into the image plane into correspondence with into correspondence with

Now it is necessary to rotate the model about three orthogonal axes to align and with their corresponding image points.

First we align one of the model edges with its corresponding image edge by rotating the model about the z-axis First we align one of the model edges with its corresponding image edge by rotating the model about the z-axis Using the edge we rotate the model by the angel between the image edge, and the projected model edge Using the edge we rotate the model by the angel between the image edge, and the projected model edge,yielding the model points and (stage b),yielding the model points and (stage b) To simplify the presentation, the coordinate axes are now shifted To simplify the presentation, the coordinate axes are now shifted Because,the projection of into the image plane, lies along the x-axis,it can be brought into correspondence with by simply rotating the model about the y-axis Because,the projection of into the image plane, lies along the x-axis,it can be brought into correspondence with by simply rotating the model about the y-axis The amount of rotation is determined by the relative lengths of and The amount of rotation is determined by the relative lengths of and If the model edge is shorter than the image edge - there is no such rotation, and hence the model cannot be aligned with the image If the model edge is shorter than the image edge - there is no such rotation, and hence the model cannot be aligned with the image

The model points and are rotated about the y-axis by to obtain and, where The model points and are rotated about the y-axis by to obtain and, where(1) for (stage c) for (stage c) is brought into correspondence with by rotation about the x-axis is brought into correspondence with by rotation about the x-axis The degree of rotation is again determined by the relative lengths of model and image edges The degree of rotation is again determined by the relative lengths of model and image edges In the previous case the edges were parallel to the x-axis, and therefore the length was the same as the x component of the length In the previous case the edges were parallel to the x-axis, and therefore the length was the same as the x component of the length In this case, the edges need not be parallel to the y axis, and therefore the y component of the lengths must be used In this case, the edges need not be parallel to the y axis, and therefore the y component of the lengths must be used

Thus, the rotation about the x-axis is,where Thus, the rotation about the x-axis is,where(2) for (stage d) for (stage d) If the model distance is shorter than the image distance,there is no transformation that aligns the model and the image If the model distance is shorter than the image distance,there is no transformation that aligns the model and the image If the rotation does not actually bring into If the rotation does not actually bring into correspondence with,then there is also no alignment correspondence with,then there is also no alignment Verification: The combination of translations and rotations can now be used to map the model into the image Verification: The combination of translations and rotations can now be used to map the model into the image

Scale Linear scale factor - a sixth unknown Linear scale factor - a sixth unknown The final two rotations which align with, and are the only computations affected by a change in scale. The final two rotations which align with, and are the only computations affected by a change in scale. The alignment of involves movement of along the x-axis, whereas the alignment of involves movement of in both the x and y directions. The alignment of involves movement of along the x-axis, whereas the alignment of involves movement of in both the x and y directions.

Because the movement of is a sliding along the x-axis,only the x-component,,changes. The change is given by the rotation about the y-axis, as in (1).With a scale factor s this becomes (3) (3) Similarly the movement of in the y direction is given by the rotation about the x-axis, as in (2).With a scale factor this becomes (4)

The movement of in the x direction is given by the rotations about both the x- and y-axis we obtain Thus with the scale factor,the x component of is (5) (5) Now we have three equations in the three unknowns, s, and One method to solve for s is to substitute for,,and in (5). From (3) we know that,,and in (5). From (3) we know that, (6) (6)

And similarly from (4), (7) Substituting (6) and (7) into (5) and simplifying yields Expanding out the terms we obtain a quadratic in While there are generally two possible solutions, it can be shown that only one of the solutions will specify possible values of and. Having solved for the scale of an object, the final two rotations and can be computed using (1)and (2) modified to account for the scale factor s.

Issues 3D objects: 3D objects:  Maintain a single 3-D model, use the recovered T and align – occlusion ** 2  Store number of models and alignment keys representing different viewing positions  Object centered vs viewer centered Handling DB with multiple objects Handling DB with multiple objects

Examples

Alignment Recognizing objects by compensating for variations Recognizing objects by compensating for variationsMethod: The stored library of objects contains their shape and allowed transformations. The stored library of objects contains their shape and allowed transformations. Given an image and an object model, a transformation is sought that brings the object to appear identical to the image. Given an image and an object model, a transformation is sought that brings the object to appear identical to the image.

Alignment (cont.) Domain: Suitable mainly for recognition of specific object. Suitable mainly for recognition of specific object.Problems: Complexity: recovering the transformation is time-consuming. Complexity: recovering the transformation is time-consuming. Indexing: library is searched serially. Indexing: library is searched serially. Non rigidities are difficult to model. Non rigidities are difficult to model.

Linear Combinations Scheme Relates familiar views and novel views of objects in a simple way Relates familiar views and novel views of objects in a simple way Novel views are expressed by linear combinations of the familiar views Novel views are expressed by linear combinations of the familiar views This is used to develop a recognition system that uses viewer-centered representations This is used to develop a recognition system that uses viewer-centered representations An object is modeled by a small set of its familiar views An object is modeled by a small set of its familiar views Recognition involves comparing the novel views to linear combinations of the model views Recognition involves comparing the novel views to linear combinations of the model views

Weak Perspective Projection Given a set of points In the new image:

x’ y’ belong to 4D linear subspace ! For under weak perspective: In vector equation form: Consequently,

Theorem: The coefficients satisfy two quadratic constraints, which can be derived from three images Proof: Consider the coefficients Since R is a rotation matrix, its row vectors are orthonormal, and therefore the following equations hold for the coefficients: Since R is a rotation matrix, its row vectors are orthonormal, and therefore the following equations hold for the coefficients: Choosing a different base to represent the object will change the constraints Choosing a different base to represent the object will change the constraints The constraints depend on the transformation that separates the model views The constraints depend on the transformation that separates the model views The Coefficients

Denote the coefficients that represent a novel view, namely and denote U the rotation matrix that separates the two model views By substituting the new coefficients we obtain new constraints:

To derive the constraints the values of and should be To derive the constraints the values of and should be recovered. A third view can be used for this purpose recovered. A third view can be used for this purpose When a third view of the object is given, the constraints supply two linear equations in and, and, therefore, in general, their values can be recovered from the two constraints When a third view of the object is given, the constraints supply two linear equations in and, and, therefore, in general, their values can be recovered from the two constraints This proof suggest a simple, essentially linear structure from motion algorithm that resembles the method used in [Ullman79, Huang and Lee89]

Linear Combination For two views there exist coefficients and such that The coefficients satisfy the following two quadratic constraints: To derive these the transformation should be recovered – a third image is needed.

LC - Formally Given: P and a set of stored models M Find: such that P matches i.e such that Object – is modeled by a set images with correspondence (quadratic constraints may also be stored) Object – is modeled by a set images with correspondence (quadratic constraints may also be stored) Recognition – recovering the LC that aligns model and image Recognition – recovering the LC that aligns model and image 3-4 points are sufficient to determine the coefficients 3-4 points are sufficient to determine the coefficients Predict the appearance and verify Predict the appearance and verify Worse case complexity - no. of models, no. of model points, no. of image points, no. of points used for verification Worse case complexity - no. of models, no. of model points, no. of image points, no. of points used for verification

Results The bottom 2 lines were created by linear combinations of the top 2 lines:

Results (cont.) a)The model pictures b)A linear combination c)True images d)The error between b) and c) e)The error between b) and another car

Recognition Operator Operators that are invariants for a given space of views Operators that are invariants for a given space of views Return constant value for all views of the object and different value for views of other objects Return constant value for all views of the object and different value for views of other objects Correspondence needed, but no need for explicit recovery of the alignment coefficient Correspondence needed, but no need for explicit recovery of the alignment coefficient

Idea LC => a view is a point in R n LC => a view is a point in R n Object’s view => belongs to the space of views spanned by the object model Object’s view => belongs to the space of views spanned by the object model Recognition = how far the in view from the views space of the object Recognition = how far the in view from the views space of the object Project the given view on the views space of the object and compute distance Project the given view on the views space of the object and compute distance

Recognition Operator (cont) Represent the model picture by the vector Construct a matrix, such that We get: So if an image is a view of the object, it is mapped to the same vector up to a scale. (for every )

L measures the distance of the new view from the linear space spanned by the object model views L measures the distance of the new view from the linear space spanned by the object model views Note that L does not verify any of the quadratic constraints Note that L does not verify any of the quadratic constraints To verify these a quadratic invariant can be constructed To verify these a quadratic invariant can be constructed

How to find ? Such that all the vectors are independent We get: if(noise is mapped to itself)

For quantitative results, we can chose and test the ratio For a view of the object (recognition) and for pure noise (no recognition)

Results Left column: A view of the object Middle column: Transformed to the canonical view by Right column: Another object transformed

Recognize !

Projection model: W.P. is good enough A point (X,Y,Z) is projected: under perspective: under perspective: under weak perspective: under weak perspective: The error is expressed by: or

Error is small when: The measured feature is close to the optical axis The measured feature is close to the optical axisor The estimate for the depth is close to the real depth (average depth of the observed environment) The estimate for the depth is close to the real depth (average depth of the observed environment) Supports the intuition that for images with low depth variance and for fixed regions near the center - variance and for fixed regions near the center - perspective distortions are relatively small perspective distortions are relatively small

Allowed depth ratios as a function of x