Saliency-guided Video Classification via Adaptively weighted learning

Slides:



Advertisements
Similar presentations
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Advertisements

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Limin Wang, Yu Qiao, and Xiaoou Tang
Patch to the Future: Unsupervised Visual Prediction
SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China
ICME 2008 Huiying Liu, Shuqiang Jiang, Qingming Huang, Changsheng Xu.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,
Presented by Zeehasham Rasheed
Spatial Pyramid Pooling in Deep Convolutional
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…
School of Information Technology & Electrical Engineering Multiple Feature Hashing for Real-time Large Scale Near-duplicate Video Retrieval Jingkuan Song*,
A Generic Virtual Content Insertion System Based on Visual Attention Analysis H. Liu 1, 2, S. Jiang 1, Q. Huang 1, 2, C. Xu 2, 3 1 Institute of Computing.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Semantic Embedding Space for Zero ­ Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
Naifan Zhuang, Jun Ye, Kien A. Hua
Recent developments in object detection
Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images
CNN-RNN: A Unified Framework for Multi-label Image Classification
Visual Information Retrieval
Multiple Feature Hashing for Real-time Large Scale
The Relationship between Deep Learning and Brain Function
Guillaume-Alexandre Bilodeau
Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,
Object Detection based on Segment Masks
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Automatic Video Shot Detection from MPEG Bit Stream
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Multimedia Content-Based Retrieval
A Pool of Deep Models for Event Recognition
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Video-based human motion recognition using 3D mocap data
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Enhanced-alignment Measure for Binary Foreground Map Evaluation
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Introduction to Neural Networks
Presentation 王睿.
A User Attention Based Visible Watermarking Scheme
Two-Stream Convolutional Networks for Action Recognition in Videos
The Open World of Micro-Videos
Progressive Cross-media Correlation Learning
Oral presentation for ACM International Conference on Multimedia, 2014
KFC: Keypoints, Features and Correspondences
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Deep Cross-media Knowledge Transfer
John H.L. Hansen & Taufiq Al Babba Hasan
Comparison of EET and Rank Pooling on UCF101 (split 1)
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
Human-object interaction
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Object Detection Implementations
Week 3 Volodymyr Bobyr.
Report 2 Brandon Silva.
Week 7 Presentation Ngoc Ta Aidean Sharghi
Presentation transcript:

Saliency-guided Video Classification via Adaptively weighted learning ICME 2017 Saliency-guided Video Classification via Adaptively weighted learning Yunzhen Zhao and Yuxin Peng* Institute of Computer Science and Technology, Peking University, Beijing 100871, China {pengyuxin@pku.edu.cn}

Guide Line Introduction Method Experiment Conclusion

Introduction Large scale Internet videos (by CISCO) In 2021, it would take an individual more than 5 million years to watch the amount of video that will cross global IP networks each month. Globally, IP video traffic will be 82 percent of all consumer Internet traffic by 2021, up from 73 percent in 2016. Big video data arouses the need of video classification, as it’s one of the key techniques for video understanding and analysis. Source: Cisco Visual Networking Index, 2017

Introduction What is video classification? Learn semantics from video content and classify videos into pre-defined categories automatically. For examples: classifying human actions and multimedia events, etc. Birthday Celebration Parade HorseRiding PlayingGitar

human-computer interaction Introduction Wide applications human-computer interaction video search Video Classification sports analysis surveillance

Introduction Deep video classification Inspired by the great progress by DNN for image classification, DNN-based video classification has become a hotspot of research. A classical deep video classification method: The two-stream ConvNet architecture K. Simonyan and A. Zisserman, “Two-­stream convolutional networks for action recognition in videos,” in NIPS, 2014.

Introduction Deep video classification Ji, et al. developed a 3D CNN architecture 2D CNN computes features from spatial dimensions only 3D CNN computes features from both spatial and temporal dimensions temporal 3D CNN 2D CNN S. Ji, W. Wu, M. Yang, et al, “3D convolutional neural networks for human action recognition,” in IEEE TPAMI, 2014.

Introduction Two problems From the view of motion, video frames can be decomposed into salient and non-salient areas, which should be treated differently Information from multiple streams play different roles for video classification, which should be treated differently

Introduction Main contributions Use optical flow to segment video frames into salient areas and non-salient areas, which is with no supervision information Propose a hybrid framework that combines 3D and 2D CNN to model multiple stream information from salient areas and non-salient areas respectively Introduce an adaptively weighted learning method to learn different fusion weights adaptively for multiple stream information

Guide Line Introduction Method Experiment Conclusion

Method Framework Salient area prediction Motion in videos may guide us to predict salient areas Salient areas present static and motion information Non-salient areas present background information

Method Framework Hybrid CNN networks Include three stream CNN networks Two 3D CNNs model static and motion information from salient areas One 2D CNN model background information from non-salient areas

Method Framework Adaptively weighted learning Adaptively learn the fusion weights of three stream information modeled by hybrid CNN networks

Method Salient area prediction Motivation: human brains are selectively sensitive to motion. Motion in videos: Subject motion: caused by movement of the objects in the videos (useful information) Camera motion: caused by movement of cameras (need to be eliminated)

Method Salient area prediction Step1: Estimate the homography by finding the correspondences between two frames Step2: Use estimated homography to rectify the raw frames to remove the camera motion Step3: Analyze the vectors of the trajectories in the flow field to remove the vectors that are too small Step4: Apply edge detection algorithm and get the connected domain as a salient region. Heng Wang and Cordelia Schmid, “Action recognition with improved trajectories,” in ICCV, 2013.

Method Hybrid CNN networks 3D CNN 2D CNN Applies 3D convolution Computes features of three dimensions, including spatial and temporal dimensions Is suitable for salient areas 2D CNN Applies 2D convolution Computes features of two dimensions, including only spatial dimensions Is suitable for non-salient areas

Method Formal description The value of the unit of j-th feature map in the i-th convolution layer: For 2D convolution: For 3D convolution:

Method Adaptively weighted learning Motivation Information from different streams plays different roles for each class, thus different fusion weights should be learn according to different semantic classes We propose adaptively weighted learning method to learn fusion weights for multiple stream information in an adaptive way

Method Adaptively weighted learning Objective function Pj stands for the fusion score within the corresponding semantic class Nj stands for the fusion score within the non-corresponding semantic class

Method Adaptively weighted learning Classification with adaptively learned weights Though above equation, different fusion weights are considered for each class, and the final result are determined by the highest fusion score.

Guide Line Introduction Method Experiment Conclusion

Experiments Datasets UCF-101 dataset consists of 13320 video clips with 101 classes. The length of these video clips is over 27 hours in total. All the videos are collected from the YouTube website and have the fixed frame rate of 25 FPS with the resolution of 320x240. CCV dataset is a consumer video database which contains 9317 web videos of over 20 semantic categories. It consists of interesting and diverse content, with less textual tags and content descriptions, thus more complex than UCF-101. TaiChi Punch wedding dance graduation UCF-101 CCV

Experiments Evaluate metrics UCF-101: measure the results by averaging accuracy over three splits. CCV: first calculate the average precision (AP) for each class, then report mAP for the whole dataset. The evaluate metrics are the same with the listed paper for fair comparison Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM International Conference on Multimedia (ACMMM), pages 461–470, 2015.

Experiments Comparing different combinations of the three streams The first group shows the results achieved by separate stream The second group shows the results achieved by combining any two streams The third group shows the results achieved by combining all three streams

Experiments The effectiveness of whether to model saliency or not

Experiments The effectiveness of adaptively weighted learning

Experiments Compared results with states-of-the-arts

Guide Line Introduction Method Experiment Conclusion

Conclusion Conclusion Further direction Optical flow can be used to predict the salient areas in an unsupervised way Modeling multi-stream information from salient and non-salient areas respectively can boost the performance of video classification The adaptively weighted learning method can help to learn different fusion weights for different semantic classes Further direction Exploit the help of manual indicating and handcraft labeling Attempt to apply unsupervised learning into our work

Cross-media Retrieval More than video: Cross-media retrieval Our current research focus Perform retrieval among different media types, such as image, text, audio and video We have released XMedia dataset with 5 media types. This dataset and source codes of our related works: Interested in cross-media retrieval? Hope our recent overview would be helpful for you http://www.icst.pku.edu.cn/mipl/xmedia Yuxin Peng, Xin Huang, and Yunzhen Zhao, "An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges", IEEE TCSVT, 2017. arXiv: 1704.02223.

ICME 2017 Thank you!