Presentation on theme: "Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶"— Presentation transcript:
Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA
Outline Overview of the Event Recognition System Similarity between Videos – Aligned Space-Time Pyramid Matching Cross-Domain Problem – Adaptive Multiple Kernel Learning Experiments Conclusion
Similarity between Videos Distance Integer-flow Earth Movers Distance (EMD), Y. Rubner  s.t.
Distance Similarity between Videos Integer-flow Earth Movers Distance (EMD), Y. Rubner  s.t.
Cross-Domain Problem Data distribution mismatch between consumer videos and web videos – Consumer videos: Naturally captured – Web videos: Edited; Selected Maximum Mean Discrepancy (MMD), K. M. Borgwardt 
Experiments Data set – 195 consumer videos and 906 web videos collected by ourselves and from Kodak Consumer Video Benchmark Data Set  – 6 events: wedding, birthday, picnic, parade, show and sports – Training data: 3 videos per event from consumer videos and all web videos – Test data: The rest consumer videos
Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM) – ASTPM is better than USTPM at Level 1 Aligned Unaligned
Comparisons of cross-domain learning methods – (a) SIFT features – (b) ST features – (c) SIFT features and ST features – parade: 75.7% (A-MKL) vs. 62.2% (FR)
Experiments Comparisons of cross-domain learning methods Relative improvements – SVM_T: 36.9% – SVM_AT: 8.6% – Feature Replication (FR) : 7.6% – Adaptive SVM (A-SVM) : 49.6% – Domain Transfer SVM (DTSVM) : 9.9% MKL-based methods – Better fuse SIFT features and ST features – Handle noise in the loose labels
Conclusion We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos. We develop a new aligned space-time pyramid matching method. We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.
References  D. Xu and S.-F. Chang. Video event recognition using kernel methods with multi-level temporal alignment. T-PAMI, 30(11):1985–1997,  I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR,  Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth movers distance as a metric for image retrieval. IJCV, 40(2): ,  K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.
References  F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML,  H. Daumé III. Frustratingly easy domain adaptation. In ACL,  L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR,  J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM,  D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.