A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Yansong Feng and Mirella Lapata
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Loris Bazzani*, Marco Cristani*†, Alessandro Perina*, Michela Farenzena*, Vittorio Murino*† *Computer Science Department, University of Verona, Italy †Istituto.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
An Introduction of Support Vector Machine
An Overview of Machine Learning
Patch to the Future: Unsupervised Visual Prediction
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection CVPR2013 POSTER.
Discriminative and generative methods for bags of features
Joint Estimation of Image Clusters and Image Transformations Brendan J. Frey Computer Science, University of Waterloo, Canada Beckman Institute and ECE,
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Presented by Zeehasham Rasheed
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Con-Text: Text Detection Using Background Connectivity for Fine-Grained Object Classification Sezer Karaoglu, Jan van Gemert, Theo Gevers 1.
Exercise Session 10 – Image Categorization
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
1 Mean shift and feature selection ECE 738 course project Zhaozheng Yin Spring 2005 Note: Figures and ideas are copyrighted by original authors.
Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.
University of Toronto Aug. 11, 2004 Learning the “Epitome” of a Video Sequence Information Processing Workshop 2004 Vincent Cheung Probabilistic and Statistical.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Representations for object class recognition David Lowe Department of Computer Science University of British Columbia Vancouver, Canada Sept. 21, 2006.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.
Non-Photorealistic Rendering and Content- Based Image Retrieval Yuan-Hao Lai Pacific Graphics (2003)
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Image Classification for Automatic Annotation
Use of Active Learning for Selective Annotation of Training Data in a Supervised Classification System for Digitized Histology Scott Doyle 1, Michael Feldman.
ACADS-SVMConclusions Introduction CMU-MMAC Unsupervised and weakly-supervised discovery of events in video (and audio) Fernando De la Torre.
Learning Jigsaws for clustering appearance and shape John Winn, Anitha Kannan and Carsten Rother NIPS 2006.
Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:
Visual and auditory scene analysis using graphical models Nebojsa Jojic
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Guillaume-Alexandre Bilodeau
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
Data Driven Attributes for Action Detection
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Introductory Seminar on Research: Fall 2017
Image Segmentation Techniques
Outline S. C. Zhu, X. Liu, and Y. Wu, “Exploring Texture Ensembles by Efficient Markov Chain Monte Carlo”, IEEE Transactions On Pattern Analysis And Machine.
ECE539 final project Instructor: Yu Hen Hu Fall 2005
EE513 Audio Signals and Systems
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Outline Texture modeling - continued Julesz ensemble.
Patch-Based Image Classification Using Image Epitomes
Deep neural networks for spike sorting: exploring options
Presentation transcript:

A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor

Problem Description How can we represent arbitrary environments, so that we can: –Label scene elements –Classify environments –Synthesize environmental sounds Example: Coffee Shop –Basic spectral texture –Glasses clinking, doors opening, etc.

Outline of Our Approach Create a palette of sounds –Epitomes (Jojic et al) for audio Given an audio segment –Generate distributions over the palette Use the distribution for classification/detection etc

Representation: Palette of Sounds Palette World Input Audio Features To Represent Audio

Epitomes for images Epitome –Jojic, Frey, and Kannan, ICCV 2003 –Developed for images

Epitomes for Audio 1-D signal –2-D representation, but little vertical self-similarity Lots of redundancy (silence, repeated background) Much longer inputs, bigger ratio of input to epitome size –Hours of data => second epitome

Informative Sampling of Patches Original epitome: take patches at random Our approach: try to maximize coverage –reduce sampling likelihood of patches similar to those we have covered t probability of patch selection f t t*

Examples: Toy Sequence 600 frame (10 sec) epitome from 3700 frames (2 min) Informative SamplingRandom Sampling

Random Vs Informative Simulation on the toy dataset –2 secs long epitome –Likelihood Vs # of patches –Averaged over 10 runs

Examples: Outdoor Sequence 1800 frame (1 min) epitome from frames (8 min)

Classification of Events/Scenes Cars Speech P(T|e,c=1) ??? classifying c’: P(T|e,c=2) P(T|e,c=c’) Look at distributions over the epitome –Given a audio segment to classify For all the patches in the audio –Recover the transformations given the epitome Look at the distribution of the transformations to classify Cafe Highway

Experiments 3 Different Environments –Highway, Kitchen, Outdoor Parking 6 Minutes of data to train 30 sec long epitome 4 Events to Detect (manually segmented) Speech (22 examples) Car (17 examples) Utensil: Knife Chopping Vegetables (29 examples) Bird Chirp (24 examples) None of the above (30 examples)

Speech Car Knife/UtensilChirp

Detection Example Speech Detection (hard case) –Very noisy environment (148 th Ave) –Only 5 labeled examples of speech

Performance Comparison Mixture of Gaussians –For each audio segment to classify Classify every frame using the mixture Vote among the results Nearest Neighbor –Same method as for mixture of Gaussians –Computationally too expensive!

Performance Vs Amount of Training Data

Speech Knife/Utensil CarChirp

Contributions Framework for Acoustic Event Detection and Scene Classification Epitomes for Audio –Informative Sampling (Can be applied to any domain) Distributions over epitomic indexes for discrimination

Future Work Informative Sampling –Maximizing the Minimum Likelihood Discriminative Epitomes Novel Scene Classification –Rich Representation using Epitomes –Boosting, other ensemble techniques Hierarchical Acoustic Sound Analysis –Same Model for: Acoustic Event Detection, Scene Classification & Synthesis –clustering mechanisms for scene retrieval

Acknowledgments Sumit Basu Nebojsa Jojic My friends and fellow interns