Download presentation
Presentation is loading. Please wait.
1
Object Recognition using Local Invariant Features Claudio Scordino scordino@di.unipi.it July 5 th 2006
2
Object Recognition Widely used in the industry for Inspection Registration Manipulation Robot localization and mapping Current commercial systems Correlation-based template matching Computationally infeasible when object rotation, scale, illumination and 3D pose vary Even more infeasible with partial occlusion Alternative: Local Image Features
3
Local Image Features Unaffected by Nearby clutter Partial occlusion Invariant to Illumination 3D projective transforms Common object variations...but, at the same time, sufficiently distinctive to identify specific objects among many alternatives!
4
Related work Line segments, edges and regions grouping Detection not good enough for reliable recognition Peaks detection in local image variations Example: Harris corner detector Drawback: image examined at only a single scale Different key locations as the image scale changes Eigenspace matching, color and receptive field histograms Successful on isolated objects Unextendable to cluttered and partially occluded images
5
SIFT Method Scale Invariant Feature Transform (SIFT) Staged filtering approach Identifies stable points (image “ keys ” ) Computation time less than 2 secs
6
SIFT Method (2) Local features: Invariant to image translation, scaling, rotation Partially invariant to illumination changes and 3D projection (up to 20° of rotation) Minimally affected by noise Similar properties with neurons in Inferior Temporal cortex used for object recognition in primate vision
7
First stage Input: original image (512 x 512 pixel) Goal: key localization and image description Output: SIFT keys Feature vector describing the local image region sampled relative to its scale-space coordinate frame
8
First stage (2) Description: Represents blurred image gradient locations in multiple orientations planes and at multiple scales Approach based on a model of cells in the celebral cortex of mammalian vision Less than 1 sec of computation time Build a pyramid of images Images are difference-of-Gaussian (DOG) functions Resampling between each level
9
Key localization Algorithm: Expand original image by a factor of 2 using bilinear interpolation For each pyramid level: 1. Smooth input image through a convolution with the 1D Gaussian function (horizontal direction): withobtaining Image A
10
Key localization (2) 2. Smooth Image A through a further convolution with th 1D Gaussian function (vertical direction) obtaining Image B 3. The DOG image of this level is B-A 4. Resample Image B using bilinear interpolation with pixel spacing 1.5 in each direction and use the result as Input Image of the new pyramid level Each new sample is a constant linear combination of 4 adjacent pixels
11
Key localization (3) Find maxima and minima of the DOG images: 2 nd level 1 st level
12
Key orientation 1. Extract image gradients and orientation at each pyramid level. For each pixel A ij compute 2. M ij thresholded at a value of 0.1 times the maximum possible gradient value Provides robustness to illumination Image Gradient Magnitude Image Gradient Orientation
13
Key orientation (2) 3. Create an orientation histogram using a circular Gaussian-weighted window with σ=3 times the current smoothing scale The weights are multiplied by M ij The histogram is smoothed prior to peak selection The orientation is determined by the peak in the histogram
14
Experimental results Original image Keys on image after rotation (15°), scaling (90%), horizontal streching (110%), change of brightness (-10%) and contrast (90%), and addition of pixel noise 78%
15
Experimental results (2) Image transformationLocation and scale match Orientation match Decrease constrast by 1.289.0 %86.6 % Decrease intensity by 0.288.5 %85.9 % Rotate by 20°85.4 %81.0 % Scale by 0.785.1 %80.3 % Stretch by 1.283.5 %76.1 % Stretch by 1.577.7 %65.0 % Add 10% pixel noise90.3 %88.4 % All previous78.6 %71.8 % 20 different images, around 15,000 keys
16
Image description Approach suggested by the response properties of complex neurons in the visual cortex A feature position is allowed to vary over a small region, while orientation and spatial frequency are maintained Image descripted through 8 orientation planes Keys inserted according to their orientations
17
Second stage Goal: identify candidate object matches The best candidate match is the nearest neighbour (i.e., minimum Euclidean distance between decriptor vectors) The exact solution for high dimensional vectors is known to have high complexity
18
Second stage (2) Algorithm: approximate Best-Bin-First (BBF) search method (Beis and Lowe) Modification of the k-d tree algorithm Identifies the nearest neighbours with high probability and small computation The keys generated at the larger scale are given twice the weight of those at the smaller scale Improves recognition by giving more weight to the least- noisy scale
19
Third stage Description: final verification Algorithm: low-residual least-squares fit Solution of a linear system: x = [A T A] -1 A T b When at least 3 keys agree with low residual, there is strong evidence for the presence of the object Since there are dozens of keys in the image, this works also with partial occlusion
20
Perspective projection
21
Partial occlusion Computation time: 1.5 secs on Sun Sparc 10 (0.9 secs first stage)
22
Connections to human vision Performance of human vision is obviously far superior than current computer vision... The brain uses a highly computational- intensive parallel process instead of a staged filtering approach
23
Connections to human vision However... the results are much the same Recent research in neuroscience showed that the neurons of Inferior Temporal cortex Recognize shape features The complexity of the features is roughly the same as for SIFT They also recognize color and texture properties in addition to shape Further research: 3D structure of objects Additional feature types for color and texture
24
Augmented Reality (AR) Registration of virtual objects into a live video sequence Current AR systems: Rely on markers strategically placed in the environment Need manual camera calibration
25
Related work Harris corner detector and Kanade-Lucas- Tomasi (KLT) tracker Not enough feature invariance Parallelogram-shaped and elliptical image regions tracking Requires planar structures in viewed scene Pre-built user-supplied CAD object models Not always available Limited to objects that can be easily modelled Off-line batch processing of the entire video
26
AR using SIFT Flexible automated AR Not needed: Camera pre-calibration Prior knowledge of scene geometry Manual initialization of the tracker Placement of special markers Special tools or equipment (just a camera) Short time and small effort to setup Robust 6 degrees of freedom
27
AR using SIFT (2) Need only a set of reference images taken by a handheld uncalibrated camera from arbitrary viewpoints Acquired from unknown spatially separated viewpoints by a handheld camera At least two images 5 to 20 images separated by at most 45° Used to build a 3D model of the viewed scene
28
AR using SIFT (3) First (off-line) stage: 1. Extract SIFT features from reference images 2. Establish multi-view correspondences 3. Build a metric model of the real world 4. Compute calibration parameters and camera poses 5. The user places the virtual object The placement is achieved by anchoring object projection in the first image Then, a second projection is adjusted in the second image Finally, the user fine-tunes position, orientation and size
29
AR using SIFT (4) Second (on-line) stage: 1. Features are detected in the current frame 2. Features are matched to those of the model using the BBF algorithm 3. The matches are used to compute the current pose of the camera 4. Solution is stabilized by using the values computed for the previous frame
30
AR using SIFT: prototype Software C programming language OpenGL and GLUT libraries Hardware: IBM ThinkPad Pentium 4-M processor (1.8 GHz) Logitech QuickCam Pro 4000 camera OperationComputation time Feature extraction150 msec Feature matching40 msec Camera pose computation25 msec 4 FPS
31
AR using SIFT: drawbacks The tracker is very slow 4 FPS (Frame Per Second) Too slow for real-time operations (25 FPS) The main bottleneck is feature extraction Unable to handle occlusion of inserted virtual content by real objects A full model of the observed scene is required
32
AR using SIFT: examples Videos: mug tabletop
33
Conclusions Object recognition using SIFT Reliable recognition Several characteristics in common with human vision Augmented reality using SIFT Very flexible Not possible in real-time due to the high computation times In future possible using faster processors
34
References David G. Lowe, "Object recognition from local scale- invariant features" International Conference on Computer Vision, Corfu, Greece (September 1999), pp. 1150-1157 Stephen Se, David G. Lowe and Jim Little, "Vision-based mobile robot localization and mapping using scale- invariant features" Proceedings of IEEE International Conference on Robotics and Automation, Seoul, Korea (May 2001), pp. 2051-58 Iryna Gordon and David G. Lowe, "Scene modelling, recognition and tracking with invariant image features" International Symposium on Mixed and Augmented Reality (ISMAR), Arlington, VA (Nov. 2004), pp. 110-119
35
For any question... David Lowe Computer Science Department 2366 Main Mall University of British Columbia Vancouver, B.C., V6T 1Z4, Canada E-mail: lowe@cs.ubc.ca
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.