Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.
Finding Structure in Home Videos by Probabilistic Hierarchical Clustering Daniel Gatica-Perez, Alexander Loui, and Ming-Ting Sun.
A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor.
LAM: Musical Audio Similarity Michael Casey Centre for Cognition, Computation and Culture Department of Computing Goldsmiths College, University of London.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Classifying Motion Picture Audio Eirik Gustavsen
Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
Presented by Zeehasham Rasheed
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Audio classification Discriminating speech, music and environmental audio Rajas A. Sambhare ECE 539.
Isolated-Word Speech Recognition Using Hidden Markov Models
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Case Study 1 Semantic Analysis of Soccer Video Using Dynamic Bayesian Network C.-L Huang, et al. IEEE Transactions on Multimedia, vol. 8, no. 4, 2006 Fuzzy.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
MSc Project Musical Instrument Identification System MIIS Xiang LI ee05m216 Supervisor: Mark Plumbley.
Predicting Voice Elicited Emotions
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Statistical techniques for video analysis and searching chapter Anton Korotygin.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.
Visual Information Retrieval
Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,
Traffic State Detection Using Acoustics
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Statistical Models for Automatic Speech Recognition
Urban Sound Classification with a Convolution Neural Network
Statistical Models for Automatic Speech Recognition
EE513 Audio Signals and Systems
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Presentation on Timbre Similarity
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems, 2005 Mohammad S. Al Awad May-2008

Outline Introduction Background Audio Event? Semantic Context? Problem statement Hierarchical Framework Modeling Performance Indexing and Retrieval

Introduction Semantic indexing and content retrieval in: ◦ Audio: speech, music, noise and silence ◦ Audiovisual: shots, dialogue and action scene Representation of high-level query semantics ◦ E.g Scenes associated with semantic meaning vs. color layouts and object positions

Background Previous work concentrated on identifying sounds like applause, gunshot, cheer or silence. Tools used: Bayesian Network and Support Vector Machine SVM to fuse information from different sounds Critique: isolated sounds carry less solid semantic

Audio Event Short audio clip that represent the sound of an object or event They can be characterized by statistical patterns and temporal evolution

Semantic Context The context of semantic concept is an analysis unit that represents more reasonable granularity for multimedia content usage Semantic Concept: gunplay scene Semantic Context: gunshots and explosions in action movie

Enhancing the problem statement? Index multimedia documents by detecting high-level semantic contexts. To characterize a semantic context, audio events highly relevant to specific semantic concepts are collected and modeled. Occurrence patterns of gunshot and explosion events are used to characterize “gunplay” scenes, and the patterns of engine and car-braking events are used to characterize “car chasing” scenes.

Hierarchical Framework Low-level events, such as gunshot, explosion, engine and car braking sounds are modeled Based on the statistical information collected from various audio event detection results two methods are investigated to fuse this information: Gaussian mixture model (GMM) and Hidden Markov model (HMM)

Hierarchical Framework (cont.)

Modeling Feature extraction Audio event modeling Confidence evaluation Semantic context modeling ◦ Gaussian mixture model ◦ Hidden Markov model

Feature Extraction Extract suitable time and frequency domain features to build feature vector Audio streams: 16-KHz, 16-bit mono, 400 samples, 50% overlap

Feature Extraction (tools) Perceptual Features ◦ STE short-time energy: is the loudness or volume ◦ BER band-energy ratio: the spectrum is divided to four bands where energy of each sub band is divided by total energy ◦ ZCR zero-crossing rate: average number of signal sign change in audio frame Mel-Frequency Cepstral Coefficients MFCC Frequency Centroid (FC) Bandwidth (BW)

Feature Extraction (feature vector) 16-dimension feature vector ◦ 1(STE)+4(BER)+1(ZCR)+1(FC)+1(BW)+8(MF CC) 16-dimension feature vector ◦ Audio frame difference between A i -A i+1 Result is 32-dimension feature vector

Audio Event Modeling Hidden Markov Model HMM is used to model audio samples Each HMM module takes the extracted features as input ◦ Forward algorithm is used to compute the log- likelihood of an audio segment with respect to each audio event ◦ Baum-Welch algorithm to estimate transitions probabilities between states (physical meaning) ◦ Clustering algorithm to determine model size and states

Audio Event Modeling (training) HMM models: gunshot, explosion, engine, car braking Training data: 100 audio events 3-10 sec representing each HMM model

Confidence Evaluation To determine how a segment is close to an audio event, a confidence metric is calculated ◦ Compare in 1second step (analysis window) the audio segment with the audio event model ◦ Use log-likelihood from Forward algorithm ◦ Audio segment might not belong to audio model ◦ Likelihood ratio test: distribution of log- likelihood

Confidence Evaluation (depicted) These confidence scores are the input of high-level modeling and provide important clues to bridge audio event and semantic context.

Semantic Context Modeling (GMM) Goal: detect high-level semantic context based on confidence scores of audio events that are highly relevant to the semantic concept Training data: 30 gunplay and car chasing scenes each 3-5 min are selected from 10 Hollywood action movies Five-fold cross validation (random 24 training, 6 testing)

GMM how does it work? Semantic context last for a period of time and not all relevant audio events exists A texture window of 5 sec is defined with 2.5 sec overlap Go through confidence values (analysis window of 1 sec step) Construct pseudo-semantic features

GMM how does it work? Semantic context detection In the case of gunplay scenes, if all the feature elements of gunshot and explosion events are located in the detection regions, it is said that the segment conveys the semantics of gunplay.

Semantic Context Modeling (HMM) Critique of GMM Model: ◦ Does not model the time duration density ◦ Segments with low or high confidence scores due to environment sounds or sound emerge HMM model captures the spectral variation of acoustic features in time by considering state transitions and giving different likelihood values ◦ Ergodic-HMM or fully connected HMM

HMM how does it work? Calculate the probability of partial observation sequence, and state i at time t given some model λ Using Forward algorithm calculate the log-likelihood value that represent how likely a semantic context is to occur

Performance Uncertainty is avoided: aural information tend to remain the same whether visual scene was day or night, downtown or forest Rare to have car chasing concept without engine sound !!  Precision is high indicates high confidence of detection results  Short length evens e.g. car braking infer lower precision  False alarms i.e. incorrect detection

Performance

Indexing and Retrieval Concept match between aural and visual information If visual information is taken into account, characteristic consistency between different video clips with the same concept Generalized framework: replacing audio events by visual object models. Thus, detect both audio and audiovisual

Future Work Careful design of pseudo-semantic feature vectors to construct meta-classifier (feature selection pool) Blind source separation (media-aesthetic rules)

Thank you