17 So, what is SIFT? Scale-Invariant Feature Transform David Lowe at UBC Scale/rotation invariantCurrently best known feature descriptorMany real-world applicationsObject recognitionPanorama stitchingRobot localizationVideo indexing…
19 SIFT propertiesLocality: features are local, so robust to occlusion and clutterDistinctiveness: individual features can be matched to a large database of objectsQuantity: many features can be generated for even small objectsEfficiency: close to real-time performance
20 SIFT algorithm overview Feature detectionDetect points that can be repeatably selected under location/scale changeFeature descriptionAssign orientation to detected feature pointsConstruct a descriptor for image patch around each feature pointFeature matching
21 1. Feature detection Detect points stable under location/scale change Build continuous space (x, y, scale)Approximated by multi-scale Difference-of-Gaussian pyramidSelect maxima/minima in (x, y, scale)
23 1. Feature detection Localize extrema by fitting a quadratic Sub-pixel/sub-scale interpolation using Taylor expansionTake derivative and set to zero
24 1. Feature detection Discard low-contrast/edge points Low contrast: discard keypoints with < thresholdEdge points: high contrast in one direction, low in the other compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio
25 1. Feature detection Example (a) 233x189 image (b) 832 DOG extrema (c) 729 left after peakvalue threshold(d) 536 left after testingratio of principlecurvatures
26 2. Feature description Assign orientation to keypoints Create histogram of local gradient directions computed at selected scaleAssign canonical orientation at peak of smoothed histogram
27 2. Feature description Construct SIFT descriptor Create array of orientation histograms8 orientations x 4x4 histogram array = 128 dimensions
28 2. Feature description Advantage over simple correlation Gradients less sensitive to illumination changeGradients may shift: robust to deformation, viewpoint change
29 Performance: stability to noise Match features after random change in image scale & orientation, with differing levels of image noiseFind nearest neighbor in database of 30,000 features
30 Performance: stability to affine change Match features after random change in image scale & orientation, with 2% image noise, and affine distortionFind nearest neighbor in database of 30,000 features
31 Performance: distinctiveness Vary size of database of features, with 30 degree affine change, 2% image noiseMeasure % correct for single nearest neighbor match
32 3. Feature matching For each feature in A, find nearest neighbor in B
33 3. Feature matchingNearest neighbor search too slow for large database of 128-dimenional dataApproximate nearest neighbor search:Best-bin-first [Beis et al. 97]: modification to k-d tree algorithmUse heap data structure to identify bins in order by their distance from query pointResult: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time
34 3. Feature matching Reject false matches Compare distance of nearest neighbor to second nearest neighborCommon features aren’t distinctive, therefore badThreshold of 0.8 provides excellent separation
35 3. Feature matching Now, given feature matches… Find an object in the sceneSolve for homography (panorama)…
36 3. Feature matchingExample: 3D object recognition
37 3. Feature matching 3D object recognition Assume affine transform: clusters of size >=3Looking for 3 matches out of 3000 that agree on same object and pose: too many outliers for RANSAC or LMSUse Hough TransformEach match votes for a hypothesis for object ID/poseVoting for multiple bins & large bin size allow for error due to similarity approximation
38 3. Feature matching 3D object recognition: solve for pose Affine transform of [x,y] to [u,v]:Rewrite to solve for transform parameters:
39 3. Feature matching 3D object recognition: verify model Discard outliers for pose solution in prev stepPerform top-down check for additional featuresEvaluate probability that match is correctUse Bayesian model, with probability that features would arise by chance if object was not presentTakes account of object size in image, textured regions, model feature count in database, accuracy of fit [Lowe 01]
57 System overview Video camera Computer User Desk Here is an overview of our system. In the setup, a video camera is mounted above the desk looking straight down to record the desktop.
58 System overviewVideo of deskGiven the video of the physical desktop,
59 System overview Video of desk Images from PDF ..and images of corresponding electronic documents extracted from PDF’s
60 System overview Video of desk Images from PDF Track & recognize …the system tracks and recognizes the paper documents by matching between the two,Track & recognize
61 System overview Video of desk Images from PDF Internal representation …and produces an internal graphical representation that encodes the evolution of the stack structure over time.DeskTrack & recognizeTT+1
62 System overview Video of desk Images from PDF Internal representation We call each of these graphs a “scene graph”.DeskTrack & recognizeTT+1Scene Graph
63 System overview Where is my W-2? Video of desk Images from PDF Internal representationThen, when the user issues a query, such as, where is my W-2 form?,DeskTrack & recognizeTT+1
64 System overview Where is my W-2? Answer Video of desk Images from PDF Internal representation…the system answers the query by consulting the scene graphs.Track & recognizeDeskDeskTT+1
65 Assumptions Document Corresponding electronic copy exists No duplicates of same documentWe make a number of assumptions to simplify the tracking & recognition problem. First, we assume that each paper document has a corresponding electronic copy on the computer, and also that there are no duplicate copies of the same document, in other words, each document is unique and distinct from each other.
66 Assumptions Document Motion Corresponding electronic copy exists No duplicates of same documentMotion3 event types: move/entry/exitOne document at a timeOnly topmost document can moveA number of other assumptions are made to constrain the motion of the documents. For instance, we assume that there are 3 types of events, move/entry/exit, and only one document on top of a stack can move at a time.Although these assumptions do limit the capability of our system to handle more realistic situations, they were carefully chosen to make the problem tractable while still allowing interesting applications, as we will demonstrate later in the talk.
67 Non-assumptions Desk need not be initially empty Also note that there are certain assumptions we don’t make. For instance, we don’t require the desk to be initially empty. The desk is allowed to start with unknown papers on it, and our system automatically discovers the documents as observations accumulate over time.
68 Non-assumptions Desk need not be initially empty Stacks may overlap (-10 min)Also, the paper stacks are allowed to overlap with each other, forming a complex graph structure, rather than cleanly separated stacks.
69 Algorithm overview Input Frames … … Here is a step-by-step overview of the tracking & recognition algorithm. Given the input sequence,
70 Algorithm overview Input Frames … … Event Detection before after Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence.(-11 min)Now, I’ll explain each step of the algorithm.
71 “A document moved from (x1,y1) to (x2,y2)” Algorithm overviewInput Frames……Event DetectionbeforeafterEvent Interpretation“A document moved from (x1,y1) to (x2,y2)”Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence.(-11 min)Now, I’ll explain each step of the algorithm.
72 “A document moved from (x1,y1) to (x2,y2)” Algorithm overviewInput Frames……Event DetectionbeforeafterEvent Interpretation“A document moved from (x1,y1) to (x2,y2)”File1.pdfLastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence.(-11 min)Now, I’ll explain each step of the algorithm.Document RecognitionFile2.pdfFile3.pdf
73 “A document moved from (x1,y1) to (x2,y2)” Algorithm overviewInput Frames……Event DetectionbeforeafterEvent Interpretation“A document moved from (x1,y1) to (x2,y2)”File1.pdfLastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence.(-11 min)Now, I’ll explain each step of the algorithm.Document RecognitionFile2.pdfFile3.pdfScene Graph UpdateDeskDesk
74 “A document moved from (x1,y1) to (x2,y2)” Algorithm overviewInput Frames……Event DetectionbeforeafterEvent Interpretation“A document moved from (x1,y1) to (x2,y2)”SIFTFile1.pdfLastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence.(-11 min)Now, I’ll explain each step of the algorithm.Document RecognitionFile2.pdfFile3.pdfScene Graph UpdateDeskDesk
75 Document tracking example Here’s an example of a move event,beforeafter
76 Document tracking example ..where this top-left documentbeforeafter
77 Document tracking example ..moves to the right.beforeafter
78 Document tracking example To classify the event, we first extract image features in both images.beforeafter
79 Document tracking example ..we match them between the two images.beforeafter
80 Document tracking example We identify features that have no match, shown in greenbeforeafter
81 Document tracking example ..and discard thembeforeafter
82 Document tracking example Next we cluster matching pairs of features according to their relative transformationRed features moved under the same xform, while blue ones stayed where they arebeforeafter
83 Document tracking example We look at the red cluster, and if it contains sufficiently many features, the event is considered a move.Otherwise it’s a non-move and subjected to further classification.beforeafter
84 Document tracking example Motion: (x,y,θ)If it’s a move, we obtain the motion from the transformation of red clusterbeforeafter
85 Document Recognition Match against PDF image database … … File1.pdf ..where we match features in the region identified as the document against a database of PDF images stored on the computer, also using SIFT features.File1.pdfFile2.pdfFile3.pdfFile4.pdfFile5.pdfFile6.pdf
86 Document Recognition Performance analysis Tested 20 pages against database of 162 pagesWe tested the performance of our recognition method by testing 20 pages against a database of 162 pages of documents, both of which were mostly from computer science research papers, and the method was able to correctly differentiate and recognize all of them.
87 Document Recognition Performance analysis Tested 20 pages against database of 162 pages~200x300 pixels per document for reliable matchRecognition RateWe also tested the performance with varying document image resolutions. In this graph, the X axis shows the length of the longer side of the document in pixels, and the Y axis shows the success rate of recognition.Document Resolution
88 Document Recognition Performance analysis Tested 20 pages against database of 162 pages~200x300 pixels per document for reliable match0.9Recognition RateWe found that to achieve a recognition rate of 90% the documents must be at least 200 by 300 pixels large. Note that this resolution is not high enough for recognizing text using techniques such as OCR, but is still good enough for reliable recognition of individual documents.300Document Resolution
89 Results Input video Running time ~40 minutes 1024x768 @ 15 fps 22 documents, 49 eventsRunning timeVideo processed offlineNo optimizationA few hours for entire videoBefore showing a demo of our system, let me provide some statistics on the input data and video processing.The input video was recorded over a period of 40 minutes, at 1024x768 resolution and 15 frames per second. It contained 22 documents on the desk, with 49 events.The input video was analyzed offline, that is, after the recording was over. We did not optimize the performance at all, and it took a few hours to process the entire input sequence.
90 Demo – Paper tracking (-18 min) Let me show a demo of the query interface to our system, using the same input sequence I demoed at the beginning of the talk. The right window is the visualization panel showing the current state of the desktop. The left window shows a list of thumbnails of the documents found by the system. The user can browse this list and click on the thumbnail of the document of interest to query its location in the stack. The visualization expands the stack that contains the selected document and highlights the document. The user can open the PDF file of the selected document as well.The interface also supports a couple of alternative ways to specify a document. The user can locate a document by doing a keyword search for the title or the author. Here I’m looking for the document that contains the string “digitaldesk” in its title. The system tells me he paper is in this tack.The user can also sort the thumbnails in various ways. For example, the documents can be sorted in decreasing order of the last time the user accessed each document. The oldest document at the end of this list lies at the bottom of this stack; the second oldest document no longer exists on the desk; and the next oldest document is at the bottom of this stack, and so forth. On the other hand, the most recent document at the beginning of this list is on top of this stack; the next most recent document is on top of this stack, and so forth.
91 Photo sorting exampleHere’s an example of using our system for sorting digital photographs.Sorting a large number of digital photographs using the computer interface is usually a fairly tedious task.
92 Photo sorting exampleIn contrast, it is very easy to sort printed photographs into physical stacks. So we printed out digital photographs on sheets of paper, and recorded the user sorting them into physical stacks on the desk.Here we sort the photographs from two source stacks, one shown on the bottom right of the video, and the other outside the camera view in the user's hand, into three target stacks based on the content of the pictures.
93 Demo – Photo sorting (-20 min) After processing this video with our system, we can click on each of the three stacks in the query interface, and assign it to an appropriate folder on the computer. Then our system automatically organizes the corresponding digital photographs into the designated folder, and pops up the folder in thumbnail view.I should point out that one clear drawback is the overhead of first having to print out the photographs on paper. However, we think that this can be useful for people who are not familiar with computer interfaces.
94 Future work Enhance realism More applications Handle more realistic desktopsReal-time performanceMore applicationsSupport other document tasksE.g., attach reminder, cluster documentsBeyond documentsOther 3D desktop objects, books/CD’s
95 Summary SIFT is: Scale/rotation invariant local feature Highly distinctiveRobust to occlusion, illumination change, 3D viewpoint changeEfficient (real-time performance)Suitable for many useful applications
96 References Distinctive image features from scale-invariant keypoints David G. Lowe, International Journal of Computer Vision, 60, 2 (2004), ppRecognising panoramasMatthew Brown and David G. Lowe, International Conference on Computer Vision (ICCV 2003), Nice, France (October 2003), ppVideo-Based Document Tracking: Unifying Your Physical and Electronic DesktopsJiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM Symposium on User Interface Software and Technology (UIST 2004), pp