Presentation is loading. Please wait.

Presentation is loading. Please wait.

Saliency-guided Video Classification via Adaptively weighted learning

Similar presentations


Presentation on theme: "Saliency-guided Video Classification via Adaptively weighted learning"— Presentation transcript:

1 Saliency-guided Video Classification via Adaptively weighted learning
ICME 2017 Saliency-guided Video Classification via Adaptively weighted learning Yunzhen Zhao and Yuxin Peng* Institute of Computer Science and Technology, Peking University, Beijing , China

2 Guide Line Introduction Method Experiment Conclusion

3 Introduction Large scale Internet videos (by CISCO)
In 2021, it would take an individual more than 5 million years to watch the amount of video that will cross global IP networks each month. Globally, IP video traffic will be 82 percent of all consumer Internet traffic by 2021, up from 73 percent in 2016. Big video data arouses the need of video classification, as it’s one of the key techniques for video understanding and analysis. Source: Cisco Visual Networking Index, 2017

4 Introduction What is video classification?
Learn semantics from video content and classify videos into pre-defined categories automatically. For examples: classifying human actions and multimedia events, etc. Birthday Celebration Parade HorseRiding PlayingGitar

5 human-computer interaction
Introduction Wide applications human-computer interaction video search Video Classification sports analysis surveillance

6 Introduction Deep video classification
Inspired by the great progress by DNN for image classification, DNN-based video classification has become a hotspot of research. A classical deep video classification method: The two-stream ConvNet architecture K. Simonyan and A. Zisserman, “Two-­stream convolutional networks for action recognition in videos,” in NIPS, 2014.

7 Introduction Deep video classification
Ji, et al. developed a 3D CNN architecture 2D CNN computes features from spatial dimensions only 3D CNN computes features from both spatial and temporal dimensions temporal 3D CNN 2D CNN S. Ji, W. Wu, M. Yang, et al, “3D convolutional neural networks for human action recognition,” in IEEE TPAMI, 2014.

8 Introduction Two problems
From the view of motion, video frames can be decomposed into salient and non-salient areas, which should be treated differently Information from multiple streams play different roles for video classification, which should be treated differently

9 Introduction Main contributions
Use optical flow to segment video frames into salient areas and non-salient areas, which is with no supervision information Propose a hybrid framework that combines 3D and 2D CNN to model multiple stream information from salient areas and non-salient areas respectively Introduce an adaptively weighted learning method to learn different fusion weights adaptively for multiple stream information

10 Guide Line Introduction Method Experiment Conclusion

11 Method Framework Salient area prediction
Motion in videos may guide us to predict salient areas Salient areas present static and motion information Non-salient areas present background information

12 Method Framework Hybrid CNN networks Include three stream CNN networks
Two 3D CNNs model static and motion information from salient areas One 2D CNN model background information from non-salient areas

13 Method Framework Adaptively weighted learning
Adaptively learn the fusion weights of three stream information modeled by hybrid CNN networks

14 Method Salient area prediction
Motivation: human brains are selectively sensitive to motion. Motion in videos: Subject motion: caused by movement of the objects in the videos (useful information) Camera motion: caused by movement of cameras (need to be eliminated)

15 Method Salient area prediction
Step1: Estimate the homography by finding the correspondences between two frames Step2: Use estimated homography to rectify the raw frames to remove the camera motion Step3: Analyze the vectors of the trajectories in the flow field to remove the vectors that are too small Step4: Apply edge detection algorithm and get the connected domain as a salient region. Heng Wang and Cordelia Schmid, “Action recognition with improved trajectories,” in ICCV, 2013.

16 Method Hybrid CNN networks 3D CNN 2D CNN Applies 3D convolution
Computes features of three dimensions, including spatial and temporal dimensions Is suitable for salient areas 2D CNN Applies 2D convolution Computes features of two dimensions, including only spatial dimensions Is suitable for non-salient areas

17 Method Formal description
The value of the unit of j-th feature map in the i-th convolution layer: For 2D convolution: For 3D convolution:

18 Method Adaptively weighted learning Motivation
Information from different streams plays different roles for each class, thus different fusion weights should be learn according to different semantic classes We propose adaptively weighted learning method to learn fusion weights for multiple stream information in an adaptive way

19 Method Adaptively weighted learning Objective function
Pj stands for the fusion score within the corresponding semantic class Nj stands for the fusion score within the non-corresponding semantic class

20 Method Adaptively weighted learning
Classification with adaptively learned weights Though above equation, different fusion weights are considered for each class, and the final result are determined by the highest fusion score.

21 Guide Line Introduction Method Experiment Conclusion

22 Experiments Datasets UCF-101 dataset consists of video clips with 101 classes. The length of these video clips is over 27 hours in total. All the videos are collected from the YouTube website and have the fixed frame rate of 25 FPS with the resolution of 320x240. CCV dataset is a consumer video database which contains 9317 web videos of over 20 semantic categories. It consists of interesting and diverse content, with less textual tags and content descriptions, thus more complex than UCF-101. TaiChi Punch wedding dance graduation UCF-101 CCV

23 Experiments Evaluate metrics
UCF-101: measure the results by averaging accuracy over three splits. CCV: first calculate the average precision (AP) for each class, then report mAP for the whole dataset. The evaluate metrics are the same with the listed paper for fair comparison Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM International Conference on Multimedia (ACMMM), pages 461–470, 2015.

24 Experiments Comparing different combinations of the three streams
The first group shows the results achieved by separate stream The second group shows the results achieved by combining any two streams The third group shows the results achieved by combining all three streams

25 Experiments The effectiveness of whether to model saliency or not

26 Experiments The effectiveness of adaptively weighted learning

27 Experiments Compared results with states-of-the-arts

28 Guide Line Introduction Method Experiment Conclusion

29 Conclusion Conclusion Further direction
Optical flow can be used to predict the salient areas in an unsupervised way Modeling multi-stream information from salient and non-salient areas respectively can boost the performance of video classification The adaptively weighted learning method can help to learn different fusion weights for different semantic classes Further direction Exploit the help of manual indicating and handcraft labeling Attempt to apply unsupervised learning into our work

30 Cross-media Retrieval
More than video: Cross-media retrieval Our current research focus Perform retrieval among different media types, such as image, text, audio and video We have released XMedia dataset with 5 media types. This dataset and source codes of our related works: Interested in cross-media retrieval? Hope our recent overview would be helpful for you Yuxin Peng, Xin Huang, and Yunzhen Zhao, "An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges", IEEE TCSVT, arXiv:

31 ICME 2017 Thank you!


Download ppt "Saliency-guided Video Classification via Adaptively weighted learning"

Similar presentations


Ads by Google