Presentation is loading. Please wait.

Presentation is loading. Please wait.

Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…

Similar presentations


Presentation on theme: "Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…"— Presentation transcript:

1 Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today… Number of viewers: 1 billion Hours of video watched every month: 6 billion Video uploaded every minute: 100 hours Average mobile views per day: 1 billion Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Gerald Friedland, Julia Bernd, Kurt Keutzer Memory and Speedup Summary References [1] www.yli-corpus.orgwww.yli-corpus.org [2] B. Elizalde, et al. An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content, International Symposium on Multimedia (ISM), 2013. [3] Ravanelli et al. Audio concept classification with Hierarchical Deep Neural Networks. European Signal Processing Conference (EUSIPCO), 2014 audioCaffe: DNN framework for audio Analysis MethodTrain Time (hr) Training Speedup Test Time (hr) Testing Speedup Model Size (MB) i-Vector [2] (baseline) 2.711x7.81x5100 DNN: All Frames Input (ours) 10.330.26xNA 48 DNN: Sparse Sampled (ours) 0.03478.4x0.74810.4x3 78x speedup in training and 10.4x speedup in testing Accelerate development and deployment of systems that use deep neural nets 200x reduction in input video feature size Real time analytics for applications such as video advertisement placement 18x reduction in deep neural network size 48MB reduced to 3MB Enabling content analysis system deployment in mobile devices 11 percentage-point accuracy improvement in video event classification Can leverage this to improve intrusion detection 5x accuracy improvement in video event detection Can leverage this to improve content based video search Mobile and embedded deployment of audio based event recognition Use audio and visual cues Vancouver 2011 riot brought 1600 hours of user generated video into the city’s police department. THE PROBLEM: Manual video monitoring and evidence collection is intractable when we are flooded by video data. We train our models 78.4x faster than the previous state-of-art MethodPer-Frame Classification Accuracy Per-Video Classification Accuracy DNN (2000:2000:2000:10), Dense Sampled w/ All Frames Input 18.3%27.4% [3] DNN (2000:2000:2000:10), Sparse Sampled w/ 100 Frames Input 28.6%36.8% DNN (600:600:10), Sparse Sampled w/100 Frames Input 29.3%37.4% Code available at https://github.com/ashrafk/audioCaffeInitialhttps://github.com/ashrafk/audioCaffeInitial Event classification accuracy Every video in the test set contains 1 of 10 possible events Event Categoryi-vector [2] (baseline) audioCaffe (ours) DNN (600:600:10), Sparse Sampled Birthday Party0.371.10 Flash Mob0.120.89 Getting a Vehicle Unstuck0.120.61 Parade0.321.62 Person Attempting a Board Trick 0.231.34 Person Grooming an Animal0.111.12 Person Hand-Feeding an Animal 0.281.71 Person Landing a Fish0.100.67 Wedding Ceremony0.320.92 Working on a Woodworking Project 0.191.81 Overall MAP0.22%1.18% Our pipeline Event detection accuracy In the test set, just 1 in 44 videos contains an event. Framework features: Speed Fastest publicly available CPU/GPU CNN framework Optimal memory usage for handling large-scale audio data. Ease of use Simple definition of layer for network description name: "mnist-small" layers { layer { name: "mnist" type: "data" source: "data/mnist-train- leveldb" batchsize: 64 scale: 0.00390625 } top: "data" top: "label" } layers { layer { name: "ip" type: "innerproduct" num_output: 10 } bottom: "data" top: "ip" } Event Detection on the YLI Dataset ashrafkhalid@berkeley.edu 2/3 of mobile web traffic will be video by 2018 Yahoo-Livermore-ICSI (YLI) Multimedia Event Detection Dataset [1] Total of 700,000 videos collected from Flickr 50,000 videos are labeled so far YLI is similar to TRECVID-MED Unlike TRECVID, YLI is openly available to all researchers Training setTest set Foreground (contain an event) 1000 Background (does not contain an event) 500043000 University of California, Berkeley and International Computer Science Institute Deep Learning to improve the speed and accuracy of video event recognition fewer parameters → slightly better accuracy sparse sampling→ better accuracy


Download ppt "Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…"

Similar presentations


Ads by Google