Presentation on theme: "Human-Assisted Motion Annotation Ce Liu William T. Freeman Edward H. Adelson Massachusetts Institute of Technology Yair Weiss The Hebrew University of."— Presentation transcript:
Human-Assisted Motion Annotation Ce Liu William T. Freeman Edward H. Adelson Massachusetts Institute of Technology Yair Weiss The Hebrew University of Jerusalem Motivations Existing motion databases are either synthetic or limited to indoor, experimental setups . Can we have ground-truth motion for arbitrary, real-world videos? Humans are an expert at segmenting moving objects and perceiving difference between two frames. Can we have a computer vision system to quantify human perception of motion and generate ground-truth for motion analysis? Several issues need to addressed: 1.Is human labeling reliable (compared to the veridical ground-truth) and consistent (across subjects)? 2.How to efficiently label every pixel at every frame for hundreds of real- world videos? Our work We designed a human-in-loop system to annotate motion for real-world videos : Semiautomatic layer segmentationThe user labels contours using polygons, and the system automatically propagates the contours to other frames. The system can also propagate users correction across frames. Automatic layer-wise optical flowThe system automatically computes dense optical flow fields for every layer at every frame using user-specified parameters. For each layer, the user picks up the best flow that yields the correct matching and agrees with the smoothness and discontinuities of the image. Semiautomatic motion labelingWhen the flow estimation fails, the user can label sparse correspondences between two frames, and the system automatically interpolates it to a dense flow field. Automatic full-frame motion composition. Our methodology is examined by comparing with veridical ground-truth data and user studies. We created a ground-truth motion database consisting of 10 real-world video sequences (still growing). This database can be used for evaluating motion analysis algorithms as well as other vision and graphics applications. (a) A selected frame(b) Layer labeling(c) User-annotated motion(d) Ground-truth from  (e) Difference between (c) and (d) Figure 3. For the RubberWhale sequence in , we labeled 20 layers in (b) and obtained the annotated motion in (c). The ground-truth motion from  is shown in (d). The error between (c) and (d) is 3.21º in average angular error (AAE) and 0.104 in average endpoint error (AEP), excluding the outliers (black dots) in (d). (a) (b) (c) (e) (d) Figure 1. The graphical user interface (GUI) of our system: (a) main window for labeling contours and feature points; (b) depth controller to change depth value; (c) magnifier; (d) optical flow viewer; (e) control panel. Figure 5. Some frames of the ground-truth motion database we created. We obtained ground-truth flow fields that are consistent with object boundaries, as shown in column (3) and (4). In comparison, the output of an optical flow algorithm  is shown in column (5). From Table 1, the performance of this algorithm on our database is worse than the performance on the Yosemite sequence (1.723° AAE, 0.071 AEP). References  S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodo- logy for optical flow. In Proc. ICCV, 2007. C Liu, W. T. Freeman, E. H. Adelson, Y. Weiss. Human-Assisted Motion Annotation. Submitted to CVPR08.  A. Bruhn, J.Weickert,, and C. Schnörr. Lucas/Kanade meets Horn/Schunk: combining local and global optical flow methods. IJCV, 61(3):211–231, 2005. (a)(b)(c)(d)(e)(f)(g)(h) AAE8.996º58.905º2.573º5.313º1.924º5.689º5.243º13.306º AEP0.9764.1810.4560.3460.0850.1960.3851.567 Figure 4. The marginal ((a)~(h)) and joint ((i)~(n)) statistics of the ground-truth motion from the database we created (log histogram). Symbol u and v denotes horizontal and vertical motion, respectively. From these statistics it is evident that horizontal motion dominates vertical; vertical motion is sparser than horizontal; flow fields are sparser than natural images; spatial derivatives are sparser than temporal derivatives. Table 1. The performance of an optical flow algorithm  on our database Figure 2. The consistency of nine subjects annotation. Clockwise from top left: the image frame, mean labeled motion, mean absolute error (red: higher error, white: lower error), and error histogram. Experiment We applied our system to annotating a veridical example from  (Figure 3). Our annotation is very close to theirs: 3.21° AAE, 0.104 AEP. The main difference is on the occluding boundary. We tested the consistency of human annotation (Figure 3). The mean error is 0.989° AAE, 0.112 AEP. The error magnitude correlates with the blurriness of the image. We created a ground-truth motion database containing 10 real-world videos with 341 frames (Figure 5, Table 1) for both indoor and outdoor scenes. The statistics of the ground-truth motion are plotted in Figure 4. Color map for flow visualization System Features We used the-state-of-the art computer vision algorithms to design our system. Many of the objective functions in contour tracking, flow estimation and flow interpolation have L1 norms for robustness. Techniques such as iterative reweighted least square (IRLS), pyramid-based coarse-to-fine search and occlusion/outlier detection were intensively used for optimizing these nonlinear objective functions. The system was written in C++, and Qt TM 4.3 was used for GUI design (Figure 1). Our system has all the components to make annotation simple and easy, and also gives the user full freedom to label motion manually.