John H.L. Hansen & Taufiq Al Babba Hasan

Slides:



Advertisements
Similar presentations
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Advertisements

Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Speaker Adaptation for Vowel Classification
Optimal Adaptation for Statistical Classifiers Xiao Li.
Spatial Pyramid Pooling in Deep Convolutional
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
An i-Vector PLDA based Gender Identification Approach for Severely Distorted and Multilingual DARPA RATS Data Shivesh Ranjan, Gang Liu and John H. L. Hansen.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Introduction to Machine Learning, its potential usage in network area,
Tenacious Deep Learning
Demo.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Deeply learned face representations are sparse, selective, and robust
Compact Bilinear Pooling
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Pick samples from task t
Preface to the special issue on context-aware recommender systems
Semantic Video Classification
Robust Lung Nodule Classification using 2
Part-Based Room Categorization for Household Service Robots
Ajita Rattani and Reza Derakhshani,
Inception and Residual Architecture in Deep Convolutional Networks
Natural Language Processing of Knee MRI Reports
Lecture 5 Smaller Network: CNN
Training Techniques for Deep Neural Networks
CS 188: Artificial Intelligence
Urban Sound Classification with a Convolution Neural Network
Article and Work by: Justin Salamon and Juan Pablo Bello
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Kocaeli University Introduction to Engineering Applications
RGB-D Image for Scene Recognition by Jiaqi Guo
Very Deep Convolutional Networks for Large-Scale Image Recognition
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
Papers 15/08.
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Introduction to Object Tracking
Advances in Deep Audio and Audio-Visual Processing
Heterogeneous convolutional neural networks for visual recognition
by Khaled Nasr, Pooja Viswanathan, and Andreas Nieder
Deep neural networks for spike sorting: exploring options
Automatic Handwriting Generation
Human-object interaction
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presenter: Shih-Hsiang(士翔)
Object Detection Implementations
What's New in eCognition 9
Presented By: Firas Gerges (fg92)
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

John H.L. Hansen & Taufiq Al Babba Hasan Acoustic Scene Classification for Context Aware Human Computer Interaction John H.L. Hansen & Taufiq Al Babba Hasan John.Hansen@utdallas.edu Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science The University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A.

The Problem, Need and Industrial Relevance Intelligent Human Computer Interaction (HCI) systems heavily rely of context awareness Being able to detect and classify acoustic scenes and events from the surroundings is an important aspect of context awareness Need: Audio signal from the environment provides complementary information about context compared to other modalities Industrial Relevance: Context awareness is critical for effective interaction in speech systems Mobile devices and many other consumer products are now equipped with a microphone which can be effectively utilized for acoustic scene classification.

Project Goals and Objectives Implement and benchmark state-of-the-art algorithms for audio event detection and acoustic scene classification on open-source datasets Explore novel methods for improved acoustic scene classification performance Design and optimize deep machine learning algorithms for classification of acoustic scenes Objectives Implementation of state-of-the-art deep convolutional neural network (DCNN) algorithms for acoustic scene classification. Reach at least 80% accuracy for acoustic scene classification on a benchmark database (DCASE 2016).

Research Methods and Novelty of Approach Implementation of existing algorithms using public domain data, such as Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge data. Examine various acoustic and auditory features, including: MFCC, Stabilized Auditory Image (SAI), Spectrogram Image Features (SIF), etc. Implement state-of-the-art DCNN classifiers and propose novel modifications for further improvement. Develop acoustic scene classification systems for DCASE 2017 challenge.

Expected Outcomes and Deliverables Disclosure of the results and analysis of existing algorithms for audio event detection Novel feature extraction method description and analysis of initial results Comparative results and analysis of machine learning algorithms for audio event detection Publication in peer-reviewed conference/journal sharing the results with wider research community White papers sharing results with industry and consumers

Project Timeframe and Budget Project duration: 1 year Timeline: Months 1-3: Implementation of DCNN architecture Months 4-6: Implementation of features for acoustic events Months 7-9: Implementation of novel algorithms for acoustic event detection Months 10-12: Further improvement of algorithms on different datasets Estimated cost: $50k USD (Total)

Impact of work proposed The ability to automatically detect surrounding audio events is extremely useful for automatic speech interaction tasks The application can be extended to detect abnormal events in the user’s surroundings and warn them of possible danger The methods can be applied for detecting pathological conditions from non-speech vocal sounds (e.g., cough)

Current Project Status Implemented features and classifiers for state-of-the-art acoustic scene classification (ASC) systems Features: Stabilized Auditory Image (SAI) features Linear, logarithmic and Mel-scaled spectrogram image features (SIF) Classifiers: Convolutional Neural Network (CNN), Gaussian Mixture Model (GMM) and Probabilistic Linear Discriminant Analysis (PLDA) Novel multi-band deep CNN classifier Published conference paper at Interspeech 2017. Performed evaluation on DCASE 2016 dataset and participated in DCASE 2017 challenge.

Deep CNN system System highlights Input is 149 × 149 sized spectrogram/auditory image feature Two convolution & batch normalization layers with max-pooling & dropout (with 128 and 256 kernels and 3 × 3 receptive fields) Rectified Linear Unit as activations and final soft-max layer

GMM super-vector system Method is motivated by Gaussian Mixture Model (GMM) based methods used in speaker recognition. Implementation details Utilize 60D MFCC features (19+c0+∆+∆∆). Train GMM with EM algorithm having 64 components Perform audio segment dependent MAP adaptation and extract adapted GMM means to form super-vector (dimension: 3,840) Utilize training data to perform dimensionality reduction of the SV using Linear Discriminant Analysis (LDA) Utilize Probabilistic Linear Discriminant Analysis (PLDA) as a classifier to detect acoustic scenes

Multi-band Deep CNN system Motivation Images are not fully comparable to spectrograms Filters learnt through CNN on higher frequency components in the spectrograms may not be important for lower frequency ranges Proposed method Input spectrogram (Log/Mel) of size 149 × 149 is divided into different bands dependent on frequency range For each band, the same CNN model is utilized for classification The outputs of all the CNNs are merged in the flatten layer

Results: DCASE 2017 All spectrograms have a dimension of 149x149 ID System Accuracy (%) Fold1 Fold2 Fold3 Fold4 Mean 1 60D-MFCC-GSV 81.11 80.22 81.02 81.97 81.08 2 CNN-Log-spectrogram 81.28 81.93 79.78 80.42 80.85 3 CNN-SV-Log-spectrogram 84.27 81.15 83.43 84.36 83.30 4 CNN-Mel-spectrogram 79.91 81.10 79.04 83.93 81.00 5 CNN-SV-Mel-spectrogram 82.65 81.25 83.92 84.70 83.13 6 Multiband-CNN-Log-spectrogram 83.85 80.81 80.02 83.42 82.02 7 Multiband-CNN-SV-Log-spectrogram 85.47 84.23 84.28 84.02 84.50 8 Multiband-CNN-Mel-spectrogram 82.31 80.31 77.23 80.95 9 Multiband-CNN-SV-Mel-spectrogram 83.33 83.90 84.43 85.21 84.22 All spectrograms have a dimension of 149x149

Thank you