Ren Haoyu 2009.10.30 ICCV 2009 Paper Reading. Selected Paper Paper 1 –187 LabelMe Video: Building a Video Database with Human Annotations –J. Yuen, B.

Slides:

Advertisements

Similar presentations

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.

Advertisements

BRISK (Presented by Josh Gleason)

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.

Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.

LPP-HOG: A New Local Image Descriptor for Fast Human Detection Andy Qing Jun Wang and Ru Bo Zhang IEEE International Symposium.

Mixture of trees model: Face Detection, Pose Estimation and Landmark Localization Presenter: Zhang Li.

Automatic Feature Extraction for Multi-view 3D Face Recognition

Robust Object Tracking via Sparsity-based Collaborative Model

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

EECS 274 Computer Vision Object detection. Human detection HOG features Cue integration Ensemble of classifiers ROC curve Reading: Assigned papers.

Fast intersection kernel SVMs for Realtime Object Detection

São Paulo Advanced School of Computing (SP-ASC’10). São Paulo, Brazil, July 12-17, 2010 Looking at People Using Partial Least Squares William Robson Schwartz.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2005 with a lot of slides stolen from Steve Seitz and.

Fast Human Detection Using a Novel Boosted Cascading Structure With Meta Stages Yu-Ting Chen and Chu-Song Chen, Member, IEEE.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Generic object detection with deformable part-based models

Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.

Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.

A General Framework for Tracking Multiple People from a Moving Camera

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)

Detecting Pedestrians Using Patterns of Motion and Appearance Paul Viola Microsoft Research Irfan Ullah Dept. of Info. and Comm. Engr. Myongji University.

A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.

Marco Pedersoli, Jordi Gonzàlez, Xu Hu, and Xavier Roca

Pedestrian Detection and Localization

Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Presented by Pundik.

A Face processing system Based on Committee Machine: The Approach and Experimental Results Presented by: Harvest Jang 29 Jan 2003.

Stable Multi-Target Tracking in Real-Time Surveillance Video

Face Detection Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL

Raquel A. Romano 1 Scientific Computing Seminar May 12, 2004 Projective Geometry for Computer Vision Projective Geometry for Computer Vision Raquel A.

Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.

Human Detection Method Combining HOG and Cumulative Sum based Binary Pattern Jong Gook Ko', Jin Woo Choi', So Hee Park', Jang Hee You', ' Electronics and.

CS 1699: Intro to Computer Vision Detection II: Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 12, 2015.

Face Detection using the Spectral Histogram representation By: Christopher Waring, Xiuwen Liu Department of Computer Science Florida State University Presented.

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.

Week 10 Emily Hand UNR.

More sliding window detection: Discriminative part-based models

Strong Supervision From Weak Annotation Interactive Training of Deformable Part Models ICCV /05/23.

Evaluation of Gender Classification Methods with Automatically Detected and Aligned Faces Speaker: Po-Kai Shen Advisor: Tsai-Rong Chang Date: 2010/6/14.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Computer vision: models, learning and inference

Cascade for Fast Detection

PRESENTED BY Yang Jiao Timo Ahonen, Matti Pietikainen

Lit part of blue dress and shadowed part of white dress are the same color

Nonparametric Semantic Segmentation

Paper Presentation: Shape and Matching

Li Fei-Fei, UIUC Rob Fergus, MIT Antonio Torralba, MIT

Object detection as supervised classification

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.

An HOG-LBP Human Detector with Partial Occlusion Handling

The Open World of Micro-Videos

Brief Review of Recognition + Context

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

An HOG-LBP Human Detector with Partial Occlusion Handling

Paper Reading Dalong Du April.08, 2011.

Feature descriptors and matching

Jie Chen, Shiguang Shan, Shengye Yan, Xilin Chen, Wen Gao

Presentation transcript:

Ren Haoyu ICCV 2009 Paper Reading

Selected Paper Paper 1 –187 LabelMe Video: Building a Video Database with Human Annotations –J. Yuen, B. Russell, C. Liu, and A. Torralba Paper 2 –005 An HOG-LBP Human Detector with Partial Occlusion Handling –X. Wang, T. Han, and S. Yan

Paper 1 Overview Author information Abstract Paper content –To build a database –Database information –Database function Conclusion

Author information(1/4) Jenny Yuen –Education BSD at Computer Science from the University of Washington Now a third year PhD student at MIT-CSAIL Advised by Professor Antonio Torralba –Research Interest Computer vision (object/action recognition, scene understanding, image/video databases) –Papers 2 ECCV’08, 1 CVPR’09

Author information(2/4) Bryan Russell –Education... Postdoctoral Fellow at INRIA WILLOW Team –Research Interest 3D object and scene modeling, analysis, and retrieval Human activity capture and classification Category-level object and scene recognition –Papers 1 NIPS’09 1 CVPR’09, 1CVPR’08

Author information(3/4) Ce Liu –Education BSD at Department of Automation, Tsinghua University MSD at Department of Automation, Tsinghua University PhD in Department of Electrical Engineering and Computer Science, MIT –Research Interest Computer Vision, Computer Graphics, Computational Photography, applications of Machine Learning in Vision and Graphics –Publications ECCV08, CVPR08, PAMI08…

Author information(4/4) Antonio Torralba –Education … Associate Professor at MIT-CSAIL –Research Interest scene and object recognition –Papers 3 CVPR’09, 2 ICCV09…

Abstract (1/1) Problem –Currently, video analysis algorithms suffer from lack of information regarding the objects present, their interactions, as well as from missing comprehensive annotated video databases for benchmarking. Main contribution –We designed an online and openly accessible video annotation system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos. –The annotations are also complemented with knowledge from static image databases to infer occlusion and depth information. –Using this system, we have built a scalable video database composed of diverse video samples and paired with human-guided annotations. We complement this paper demonstrating potential uses of this database by studying motion statistics as well as cause-effect motion relationships between objects.

To build a database (1/1) In original database, little has been taken into account for the prior knowledge of motion, location and appearance at the object and object interaction levels in real world videos. To build one that will scale in quantity, variety, and quality like the currently available ones in benchmark databases for both static images and videos Diversity, accuracy and openness –We want to collect a large and diverse database of videos that span many different scene, object, and action categories, and to accurately label the identity and location of objects and actions. –Furthermore, we wish to allow open and easy access to the data without copyright restrictions.

Database Information (1/1) 238 object classes, 70 action classes and 1,903 video sequences

Database function (1/1) Object Annotation Event Annotation Annotation interpolation –To fill the missing polygons in between key frames –2D/3D interpolation Occlusion handling and depth ordering Cause-effect relations within moving objects

Annotation interpolation (1/3) 2D interpolation –Given the object 2-D position p 0 and p 1 at two key frames t = 0 and t = 1, assumes that the points in outlining objects are transformed by a 2D projection plus a residual term, where S, R, and T are scaling, rotation, and translation matrices encoding the projection from p0 to p1 that minimizes the residual term r –Then any polygons at frame t ∈ [0, 1], can then be linearly interpolated as:

Annotation interpolation (2/3) 3D interpolation –Assuming that a given point on the object moves in a straight line in the 3D world, the motion of point X(t) at time t in 3D can be written as where the X 0 is the initial point, D is the 3D direction and is the displacement along the direction vector –The image coordinates for points on the object can be written as –Assume perspective projection and that the camera is stationary, the intrinsic and extrinsic parameters of the camera can be expressed as a 3×4 matrix P, the points projected to the image plane is

Annotation interpolation (3/3) 3D interpolation –Assume that the point moves with constant velocity, we have –In summary, to find the image coordinates for points on the object at any time, we simply need to know the coordinates of a point at two different times. –Given a corresponding second point along the path projected into another frame, we can recover the velocity as Comparison of 2D and 3D interpolation –Pixel error per object class.

Occlusion handling and depth ordering (1/1) Occlusion handling and depth ordering –Method 1: not work Model the appearance of the object, wherever there is overlapping with another object, infer which object owns the visible part based on matching appearance –Method 2: not work When two objects overlap, the polygon with more control points in the intersection region is in front –Method 3: work Extract accurate depth information using the object labels and to infer support relationships from a large database of annotated images Defining a subset of objects as being ground objects (e.g., road, sidewalk, etc.), inferred the support relationship by counting how many times the bottom part of a polygon overlaps with the supporting object (e.g, a person + road)

Cause-effect relations within moving objects (1/1) Cause-effect relations –Define a measure of causality, which is the degree to which an object class C causes the motion in an object of class E:

Examples (1/1)

Conclusion (1/1) We designed an open, easily accessible, and scalable annotation system to allow online users to label a database of real-world videos. Using our labeling tool, we created a video database that is diverse in samples and accurate, with human guided annotations. Based on this database, we studied motion statistics and cause-effect relationships between moving objects to demonstrate examples of the wide array of applications for our database. Furthermore, we enriched our annotations by propagating depth information from a static and densely annotated image database.

Paper 2 Overview Author information Abstract Paper content –HOG feature –LBP feature –Occlusion handling Experimental result Conclusion

Author information(1/2) Wang Xiaoyu: can not find the homepage Tony Xu Han –Education PhD at University of Illinois at Urbana-Champaign. (Advisor: Prof. Thomas Huang) MSD at University of Rhode Island MSD at Beijing Jiaotong University BSD at Beijing Jiaotong University –Research interest Computer Vision, Machine Learning, Human Computer Interaction, Elder Care Technology –Papers 1 CSVT08, 1 CSVT09, 1CVPR09

Author information(2/2) Yan Shuicheng –Education BSD, MSD, PhD at mathematics department, PKU Assistant Professor in the Department of Electrical and Computer Engineering at National University of Singapore The founding lead of the Learning and Vision Research Group –Research Interest Activity and event detection in images and videos, Subspace learning and manifold learning, Transfer Learning… –Papers 2 CVPR’09, 1 IP’09, 1PAMI’09, 2 ACM’09…

Abstract (1/1) Problem –Performance, occlusion handling Main contribution –By combining Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) as the feature set, we propose a novel human detection approach capable of handling partial occlusion. –For each ambiguous scanning window, we construct an occlusion likelihood map by using the response of each block of the HOG feature to the global detector. –If partial occlusion is indicated with high likelihood in a certain scanning window, part detectors are applied on the unoccluded regions to achieve the final classification on the current scanning window. We achieve a detection rate of 91.3% with FPPW= 10−6, 94.7% with FPPW= 10−5, and 97.9% with FPPW= 10−4 on the INRIA dataset, which, to our best knowledge, is the best human detection performance on the INRIA dataset.

HOG feature(1/4) HOG feature –Histogram of Oriented Gradient, the most famous and most successful feature in human detection –Calculate the gradient orientation histogram voted by the gradient magnitude –Perform well with Linear SVM, Kernel SVM (RBF, IK, Quadratic…), LDA + AdaBoost, SVM + AdaBoost, Logistic Boost… HOG feature extraction

HOG feature(2/4) HOG feature extraction –For a 64x128 patch, the minimum cell size is 8x8 –Quantize the gradient orientation into 9 bins, use tri-linear interpolation + Gauss weighting to vote the gradient magnitude –A block consists of 2x2 cells, overlap 50%, total 105 blocks with 3,780 bins –We can use integral image to speed up the HOG feature extraction without tri-linear interpolation and Gauss weighting Cell C0 C1 C2 C3 dx1-dx dy 1-dy HOG histogram of block Cell C0 C1 C2 C3 9 orientation bins for each cell

HOG feature(3/4) Convoluted Tri-linear Interpolation (CTI) –Use CTI instead of Tri-linear Interpolation to fit integral image –Vote the gradient with a real-valued direction between 0 and pi into the 9 discrete bins according to its direction and magnitude –Using bilinear interpolation to distribute the magnitude of the gradient into two adjacent bins –Designed a 7x7 convolution kernel. The weights are distributed to the neighborhood linearly according to the distances. Use this kernel to convolve over the orientation bin image to achieve the tri-linear interpolation. –Make the integral image on the convoluted bin image

HOG feature(4/4) Convoluted Tri-linear Interpolation (CTI) –Use FFT for convolution is efficient. So CTI doesn’t increase the space complexity of the integral image approach

LBP feature(1/1) LBP feature –Local Binary Pattern, an exceptional texture that widely used in various applications and has achieved very good results in face recognition –Build pattern histograms in cells –Use to denote LBP feature that takes n sample points with radius r, and the number of 0-1 transitions is no more than u –Use bilinear interpolation to locate the points, use Euclidean distance to measure the distance instead of –Integral image for fast extraction

Occlusion handling (1/5) Basic idea –If a portion of the pedestrian is occluded, the densely extracted blocks of features in that area uniformly respond to the linear SVM classifier with negative inner products –Use the classification score of each block to infer whether the occlusion occurs and where it occurs –When the occlusion occurs, the part-based detector is triggered to examine the unoccluded portion

Occlusion handling (2/5) The decision function of linear SVM is where w is the weighting vector of the linear SVM as We distribute the constant bias to each block B i. Then the real contribution of a block could be got by subtracting the corresponding bias from the summation of feature inner production over this block. So the key problem is how to learn

Occlusion handling (3/5) Learn the, i.e. the constant bias from the training part of the INRIA dataset by collecting the relative ratio of the bias constant in each block to the total bias constant. Denote the set of HOG features of positive/negative training samples as: (N+/N- is the number of positive/negative samples) Denote the ith block as, we have

Occlusion handling (4/5) Denoted, we have i.e. where Then we have

Occlusion handling (5/5) Implementation –Construct the binary occlusion likelihood image according to the response of each block of the HOG feature. The intensity of the occlusion likelihood image is the sign of –Use mean-shift to segment out the possible occlusion regions on the binary occlusion likelihood image with as the weight –A segmented region of the window with an overall negative response is inferred as an occluded region. The part detector is applied. But if all the segmented regions are consistently negative, we tends to treat the image as a negative image Original patchesocclusion likelihood image

Experimental result (1/3) Experiment 1: using cell-LBP feature on INRIA dataset – with L1-norm of 16x16 size shows the best result, about 94.0% at 10e-4

Experimental result (2/3) Experiment 2: using HOG-LBP feature on INRIA dataset –HOG-LBP outperforms all state-of-the are algorithms under both FPPI and FPPW criteria

Experimental result (3/3) Experiment 3: occlusion handling –Using occlusion handling strategy shows sufficient improvement on the detection performance –Overlaying PASCAL segmented objects to the testing images in the INRIA dataset

Conclusion (1/1) We propose a human detection approach capable of handling partial occlusion and a feature set that combines the tri-linear interpolated HOG with LBP in the framework of integral image. It has been shown in our experiments that the HOG-LBP feature outperforms other state-of-the-art detectors on the INRIA dataset. However, our detector cannot handle the articulated deformation of people, which is the next problem to be tackled.

Thanks