Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Slides:

Advertisements

Similar presentations

Kapitel 14 Recognition Scene understanding / visual object categorization Pose clustering Object recognition by local features Image categorization Bag-of-features.

Advertisements

PhishZoo: Detecting Phishing Websites By Looking at Them

Learning visual representations for unfamiliar environments Kate Saenko, Brian Kulis, Trevor Darrell UC Berkeley EECS & ICSI.

Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support / 12 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT AND EXPLORATION.

Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities M. S. Ryoo and J. K. Aggarwal ICCV2009.

CSCE 643 Computer Vision: Template Matching, Image Pyramids and Denoising Jinxiang Chai.

A Fast Local Descriptor for Dense Matching

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Presented by Xinyu Chang

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Patch to the Future: Unsupervised Visual Prediction

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Addressing the Medical Image Annotation Task using visual words representation Uri Avni, Tel Aviv University, Israel Hayit GreenspanTel Aviv University,

Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.

SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

Discriminative and generative methods for bags of features

Local Descriptors for Spatio-Temporal Recognition

Computer Vision Group, University of BonnVision Laboratory, Stanford University Abstract This paper empirically compares nine image dissimilarity measures.

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

1 Diffusion Distance for Histogram Comparison, CVPR06. Haibin Ling, Kazunori Okada Group Meeting Presented by Wyman 3/14/2006.

Local Features and Kernels for Classification of Object Categories J. Zhang --- QMUL UK (INRIA till July 2005) with M. Marszalek and C. Schmid --- INRIA.

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.

Using Image Priors in Maximum Margin Classifiers Tali Brayer Margarita Osadchy Daniel Keren.

Oral Defense by Sunny Tang 15 Aug 2003

A String Matching Approach for Visual Retrieval and Classification Mei-Chen Yeh* and Kwang-Ting Cheng Learning-Based Multimedia Lab Department of Electrical.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Learning from Multiple Outlooks Maayan Harel and Shie Mannor ICML 2011 Presented by Minhua Chen.

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

Machine learning & category recognition Cordelia Schmid Jakob Verbeek.

Introduction to domain adaptation

Selective Transfer Machine for Personalized Facial Action Unit Detection Wen-Sheng Chu, Fernando De la Torre and Jeffery F. Cohn Robotics Institute, Carnegie.

Learning to classify the visual dynamics of a scene Nicoletta Noceti Università degli Studi di Genova Corso di Dottorato.

Person-Specific Domain Adaptation with Applications to Heterogeneous Face Recognition (HFR) Presenter: Yao-Hung Tsai Dept. of Electrical Engineering, NTU.

Overcoming Dataset Bias: An Unsupervised Domain Adaptation Approach Boqing Gong University of Southern California Joint work with Fei Sha and Kristen Grauman.

Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo Nanyang Technological.

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Week 9 Presented by Christina Peterson. Recognition Accuracies on UCF Sports data set Method Accuracy (%)DivingGolfingKickingLiftingRidingRunningSkating.

Survey of Algorithms to Query Image Databases COMP :Computational Geometry Benjamin Lok 11/2/98 Image from Kodak’s PhotoQuilt.

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

A feature-based kernel for object classification P. Moreels - J-Y Bouguet Intel.

Using Transportation Distances for Measuring Melodic Similarity Pichaya Tappayuthpijarn Qiang Wang.

Classifying Covert Photographs CVPR 2012 POSTER. Outline  Introduction  Combine Image Features and Attributes  Experiment  Conclusion.

Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.

南台科技大學資訊工程系 Region partition and feature matching based color recognition of tongue image 指導教授：李育強報告者：楊智雁日期： 2010/04/19 Pattern Recognition Letters,

Visual Event Recognition in Videos by Learning from Web Data

Data Driven Attributes for Action Detection

Saliency-guided Video Classification via Adaptively weighted learning

Learning Mid-Level Features For Recognition

Nonparametric Semantic Segmentation

The Earth Mover's Distance

Paper Presentation: Shape and Matching

ICCV Hierarchical Part Matching for Fine-Grained Image Classification

CVPR 2014 Orientational Pyramid Matching for Recognizing Indoor Scenes

Liang Zheng and Yuzhong Qu

The Open World of Micro-Videos

Knowledge-based event recognition from salient regions of activity

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

Delivered By: Yuelei Xie

Support vector machine-based text detection in digital video

Motivation It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically. We propose BC-DNN method.

Presentation transcript:

Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA

Outline Overview of the Event Recognition System Similarity between Videos – Aligned Space-Time Pyramid Matching Cross-Domain Problem – Adaptive Multiple Kernel Learning Experiments Conclusion

Overview GOAL: Recognize consumer videos Large intra-class variability; limited labeled videos Sports PicnicWedding

GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube) Sports Picnic Wedding Overview Consumer Videos A Large Number of Web Videos

Overview Video Database Test video Classifier Output Flowchart of the system

Pyramid matching methods – Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1] – Unaligned space-time pyramid matching, I. Laptev [2] Similarity between Videos Time axisSpace axes Space-time axes

Similarity between Videos

Aligned Space-Time Pyramid Matching – Level 1 Distance

Similarity between Videos Distance Integer-flow Earth Movers Distance (EMD), Y. Rubner [3] s.t.

Distance Similarity between Videos Integer-flow Earth Movers Distance (EMD), Y. Rubner [3] s.t.

Cross-Domain Problem Data distribution mismatch between consumer videos and web videos – Consumer videos: Naturally captured – Web videos: Edited; Selected Maximum Mean Discrepancy (MMD), K. M. Borgwardt [4]

Cross-Domain Problem Prior information

Cross-Domain Problem

Adaptive Multiple Kernel Learning (A-MKL) where MMD Structural risk functional

Cross-Domain Problem

Experiments Data set – 195 consumer videos and 906 web videos collected by ourselves and from Kodak Consumer Video Benchmark Data Set [5] – 6 events: wedding, birthday, picnic, parade, show and sports – Training data: 3 videos per event from consumer videos and all web videos – Test data: The rest consumer videos

Experiments

Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM) – ASTPM is better than USTPM at Level 1 Aligned Unaligned

Experiments

Comparisons of cross-domain learning methods – (a) SIFT features – (b) ST features – (c) SIFT features and ST features – parade: 75.7% (A-MKL) vs. 62.2% (FR)

Experiments Comparisons of cross-domain learning methods Relative improvements – SVM_T: 36.9% – SVM_AT: 8.6% – Feature Replication (FR) [6]: 7.6% – Adaptive SVM (A-SVM) [7]: 49.6% – Domain Transfer SVM (DTSVM) [8]: 9.9% MKL-based methods – Better fuse SIFT features and ST features – Handle noise in the loose labels

Conclusion We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos. We develop a new aligned space-time pyramid matching method. We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

References [1] D. Xu and S.-F. Chang. Video event recognition using kernel methods with multi-level temporal alignment. T-PAMI, 30(11):1985–1997, [2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, [3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth movers distance as a metric for image retrieval. IJCV, 40(2): , [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

References [5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, [6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, [7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, [8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Thank you!