Object Recognition using Local Invariant Features Claudio Scordino July 5 th 2006.

Object Recognition using Local Invariant Features Claudio Scordino scordino@di.unipi.it July 5 th 2006

Object Recognition Widely used in the industry for Inspection Registration Manipulation Robot localization and mapping Current commercial systems Correlation-based template matching Computationally infeasible when object rotation, scale, illumination and 3D pose vary Even more infeasible with partial occlusion Alternative: Local Image Features

Local Image Features Unaffected by Nearby clutter Partial occlusion Invariant to Illumination 3D projective transforms Common object variations...but, at the same time, sufficiently distinctive to identify specific objects among many alternatives!

Related work Line segments, edges and regions grouping Detection not good enough for reliable recognition Peaks detection in local image variations Example: Harris corner detector Drawback: image examined at only a single scale Different key locations as the image scale changes Eigenspace matching, color and receptive field histograms Successful on isolated objects Unextendable to cluttered and partially occluded images

SIFT Method Scale Invariant Feature Transform (SIFT) Staged filtering approach Identifies stable points (image “ keys ” ) Computation time less than 2 secs

SIFT Method (2) Local features: Invariant to image translation, scaling, rotation Partially invariant to illumination changes and 3D projection (up to 20° of rotation) Minimally affected by noise Similar properties with neurons in Inferior Temporal cortex used for object recognition in primate vision

First stage Input: original image (512 x 512 pixel) Goal: key localization and image description Output: SIFT keys Feature vector describing the local image region sampled relative to its scale-space coordinate frame

First stage (2) Description: Represents blurred image gradient locations in multiple orientations planes and at multiple scales Approach based on a model of cells in the celebral cortex of mammalian vision Less than 1 sec of computation time Build a pyramid of images Images are difference-of-Gaussian (DOG) functions Resampling between each level

Key localization Algorithm: Expand original image by a factor of 2 using bilinear interpolation For each pyramid level: 1. Smooth input image through a convolution with the 1D Gaussian function (horizontal direction): withobtaining Image A

Key localization (2) 2. Smooth Image A through a further convolution with th 1D Gaussian function (vertical direction) obtaining Image B 3. The DOG image of this level is B-A 4. Resample Image B using bilinear interpolation with pixel spacing 1.5 in each direction and use the result as Input Image of the new pyramid level Each new sample is a constant linear combination of 4 adjacent pixels

Key localization (3) Find maxima and minima of the DOG images: 2 nd level 1 st level

Key orientation 1. Extract image gradients and orientation at each pyramid level. For each pixel A ij compute 2. M ij thresholded at a value of 0.1 times the maximum possible gradient value Provides robustness to illumination Image Gradient Magnitude Image Gradient Orientation

Key orientation (2) 3. Create an orientation histogram using a circular Gaussian-weighted window with σ=3 times the current smoothing scale The weights are multiplied by M ij The histogram is smoothed prior to peak selection The orientation is determined by the peak in the histogram

Experimental results Original image Keys on image after rotation (15°), scaling (90%), horizontal streching (110%), change of brightness (-10%) and contrast (90%), and addition of pixel noise 78%

Experimental results (2) Image transformationLocation and scale match Orientation match Decrease constrast by 1.289.0 %86.6 % Decrease intensity by 0.288.5 %85.9 % Rotate by 20°85.4 %81.0 % Scale by 0.785.1 %80.3 % Stretch by 1.283.5 %76.1 % Stretch by 1.577.7 %65.0 % Add 10% pixel noise90.3 %88.4 % All previous78.6 %71.8 % 20 different images, around 15,000 keys

Image description Approach suggested by the response properties of complex neurons in the visual cortex A feature position is allowed to vary over a small region, while orientation and spatial frequency are maintained Image descripted through 8 orientation planes Keys inserted according to their orientations

Second stage Goal: identify candidate object matches The best candidate match is the nearest neighbour (i.e., minimum Euclidean distance between decriptor vectors) The exact solution for high dimensional vectors is known to have high complexity

Second stage (2) Algorithm: approximate Best-Bin-First (BBF) search method (Beis and Lowe) Modification of the k-d tree algorithm Identifies the nearest neighbours with high probability and small computation The keys generated at the larger scale are given twice the weight of those at the smaller scale Improves recognition by giving more weight to the least- noisy scale

Third stage Description: final verification Algorithm: low-residual least-squares fit Solution of a linear system: x = [A T A] -1 A T b When at least 3 keys agree with low residual, there is strong evidence for the presence of the object Since there are dozens of keys in the image, this works also with partial occlusion

Perspective projection

Partial occlusion Computation time: 1.5 secs on Sun Sparc 10 (0.9 secs first stage)

Connections to human vision Performance of human vision is obviously far superior than current computer vision... The brain uses a highly computational- intensive parallel process instead of a staged filtering approach

Connections to human vision However... the results are much the same Recent research in neuroscience showed that the neurons of Inferior Temporal cortex Recognize shape features The complexity of the features is roughly the same as for SIFT They also recognize color and texture properties in addition to shape Further research: 3D structure of objects Additional feature types for color and texture

Augmented Reality (AR) Registration of virtual objects into a live video sequence Current AR systems: Rely on markers strategically placed in the environment Need manual camera calibration

Related work Harris corner detector and Kanade-Lucas- Tomasi (KLT) tracker Not enough feature invariance Parallelogram-shaped and elliptical image regions tracking Requires planar structures in viewed scene Pre-built user-supplied CAD object models Not always available Limited to objects that can be easily modelled Off-line batch processing of the entire video

AR using SIFT Flexible automated AR Not needed: Camera pre-calibration Prior knowledge of scene geometry Manual initialization of the tracker Placement of special markers Special tools or equipment (just a camera) Short time and small effort to setup Robust 6 degrees of freedom

AR using SIFT (2) Need only a set of reference images taken by a handheld uncalibrated camera from arbitrary viewpoints Acquired from unknown spatially separated viewpoints by a handheld camera At least two images 5 to 20 images separated by at most 45° Used to build a 3D model of the viewed scene

AR using SIFT (3) First (off-line) stage: 1. Extract SIFT features from reference images 2. Establish multi-view correspondences 3. Build a metric model of the real world 4. Compute calibration parameters and camera poses 5. The user places the virtual object The placement is achieved by anchoring object projection in the first image Then, a second projection is adjusted in the second image Finally, the user fine-tunes position, orientation and size

AR using SIFT (4) Second (on-line) stage: 1. Features are detected in the current frame 2. Features are matched to those of the model using the BBF algorithm 3. The matches are used to compute the current pose of the camera 4. Solution is stabilized by using the values computed for the previous frame

AR using SIFT: prototype Software C programming language OpenGL and GLUT libraries Hardware: IBM ThinkPad Pentium 4-M processor (1.8 GHz) Logitech QuickCam Pro 4000 camera OperationComputation time Feature extraction150 msec Feature matching40 msec Camera pose computation25 msec 4 FPS

AR using SIFT: drawbacks The tracker is very slow 4 FPS (Frame Per Second) Too slow for real-time operations (25 FPS) The main bottleneck is feature extraction Unable to handle occlusion of inserted virtual content by real objects A full model of the observed scene is required

AR using SIFT: examples Videos: mug tabletop

Conclusions Object recognition using SIFT Reliable recognition Several characteristics in common with human vision Augmented reality using SIFT Very flexible Not possible in real-time due to the high computation times In future possible using faster processors

References David G. Lowe, "Object recognition from local scale- invariant features" International Conference on Computer Vision, Corfu, Greece (September 1999), pp. 1150-1157 Stephen Se, David G. Lowe and Jim Little, "Vision-based mobile robot localization and mapping using scale- invariant features" Proceedings of IEEE International Conference on Robotics and Automation, Seoul, Korea (May 2001), pp. 2051-58 Iryna Gordon and David G. Lowe, "Scene modelling, recognition and tracking with invariant image features" International Symposium on Mixed and Augmented Reality (ISMAR), Arlington, VA (Nov. 2004), pp. 110-119

For any question... David Lowe Computer Science Department 2366 Main Mall University of British Columbia Vancouver, B.C., V6T 1Z4, Canada E-mail: lowe@cs.ubc.ca

Object Recognition using Local Invariant Features Claudio Scordino July 5 th 2006.

Similar presentations

Presentation on theme: "Object Recognition using Local Invariant Features Claudio Scordino July 5 th 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Object Recognition using Local Invariant Features Claudio Scordino July 5 th 2006.

Similar presentations

Presentation on theme: "Object Recognition using Local Invariant Features Claudio Scordino July 5 th 2006."— Presentation transcript:

Similar presentations

About project

Feedback