FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Video Table-of-Contents: Construction and Matching Master of Philosophy 3 rd Term Presentation - Presented by Ng Chung Wing.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
ADVISE: Advanced Digital Video Information Segmentation Engine
Speaker Clustering using MDL Principles Kofi Boakye Stat212A Project December 3, 2003.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Chinese Character Recognition for Video Presented by: Vincent Cheung Date: 25 October 1999.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
Digital Audio, Image and Video Hao Jiang Computer Science Department Sept. 6, 2007.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.
Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** * Rice University, Houston, Texas.
Multimedia Specification Design and Production 2012 / Semester 1 / L2 Lecturer: Dr. Nikos Gazepidis
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Principles of Pattern Recognition
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Application of Audio and Video Processing Methods for Language.
Speech and Language Processing
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Implementing a Speech Recognition System on a GPU using CUDA
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Frank Bergschneider February 21, 2014 Presented to National Instruments.
Bayesian Enhancement of Speech Signals Jeremy Reed.
Automatic Classification of Audio Data by Carlos H. L. Costa, Jaime D. Valle, Ro L. Koerich IEEE International Conference on Systems, Man, and Cybernetics.
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Artist Identification Based on Song Analysis
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
A maximum likelihood estimation and training on the fly approach
EE 492 ENGINEERING PROJECT
SNR-Invariant PLDA Modeling for Robust Speaker Verification
Presentation transcript:

FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng

Outline 1. Overview 2. Read in the raw speech 3. MFCC processing 4. Detect the audio scene change 5. Audio Clustering 6. Interleave Audio Clustering 7. Conclusion

Overview Automatic segmentation of an audio stream and automatic clustering of audio segments have quite a bit of attention nowadays. Automatic segmentation of an audio stream and automatic clustering of audio segments have quite a bit of attention nowadays. Example, in the task of automatic transcription of broadcast news, the data contains clean speech, telephone speech, music segments, speech corrupted by music or noise. Example, in the task of automatic transcription of broadcast news, the data contains clean speech, telephone speech, music segments, speech corrupted by music or noise.

Overview (cont ’ ) We would like to SEGMENT the audio stream into homogenous regions according to speaker identity. We would like to SEGMENT the audio stream into homogenous regions according to speaker identity. We would like to cluster speech segments into homogeneous clusters according to speaker identity. We would like to cluster speech segments into homogeneous clusters according to speaker identity.

Step1:Read in the raw speech Read in a mpeg file as input Read in a mpeg file as input Convert the file from.mpeg format to.wav format Convert the file from.mpeg format to.wav format Because the MFCC library only process on.wav file Because the MFCC library only process on.wav file

Step2:MFCC processing A wav is viewed as frames, each contains different features A wav is viewed as frames, each contains different features We make use of the MFCC library to convert the wav to MFCC features for processing We make use of the MFCC library to convert the wav to MFCC features for processing We extract 24 features for each frames We extract 24 features for each frames The result are stored in feature vectors The result are stored in feature vectors Frame1Frame 2Frame 3

Step3: Detect the audio scene change Make use of the feature vector to detect the audio scene change Make use of the feature vector to detect the audio scene change The input audio stream will be modeled as Gaussian process The input audio stream will be modeled as Gaussian process Model selection criterion called BIC (Bayesian Information Criterion) is used to detect the change point Model selection criterion called BIC (Bayesian Information Criterion) is used to detect the change point

Step3: Detect the audio scene change Denote Xi (i = 1, …,N) as the feature vector of frame i Denote Xi (i = 1, …,N) as the feature vector of frame i N is the total number of frame N is the total number of frame mi : mean of mean vector of frame i mi : mean of mean vector of frame i ∑i : full covariance matrix of frame i ∑i : full covariance matrix of frame i R(i) = N log |∑| - N1 log |∑1| - N2 log |∑2| R(i) = N log |∑| - N1 log |∑1| - N2 log |∑2| ∑, ∑1, ∑2 are the sample covariance matrices from all the data, from {x1, …,xi}, from {xi+1, …,Xn} respectively ∑, ∑1, ∑2 are the sample covariance matrices from all the data, from {x1, …,xi}, from {xi+1, …,Xn} respectively

Step3: Detect the audio scene change BIC(i) = R(i) – constant BIC(i) = R(i) – constant If there is only one change point, then the frame with highest BIC score is the change point If there is only one change point, then the frame with highest BIC score is the change point If there are more than one change point, just simple extend the algorithm If there are more than one change point, just simple extend the algorithm

Step 4:Audio Clustering As we want to speed up the audio detecting, so we just roughly find the change point. As we want to speed up the audio detecting, so we just roughly find the change point. As a result, there maybe some wrongly calculated change point. As a result, there maybe some wrongly calculated change point. In this part, we try to combine the wrongly segmented neighbor segments In this part, we try to combine the wrongly segmented neighbor segments Compare with neighbor segments, if they are speech of the same person, then combine it. Compare with neighbor segments, if they are speech of the same person, then combine it.

Step5:Interleave Audio Clustering Group all the segments of the same speaker into one node. Group all the segments of the same speaker into one node. Before Before After After Speaker 2 Speaker 1 Speaker 2 Combined Speaker1

Conclusion We would like to make a precise and speedy engine that recognize the identity of speaker in a wave file. We would like to make a precise and speedy engine that recognize the identity of speaker in a wave file. We would like to group the same speaker in the wave. We would like to group the same speaker in the wave.

Conclusion (cont ’ ) Instead of making local decision based on distance between fixed size sample, we expand the decision as wide as possible Instead of making local decision based on distance between fixed size sample, we expand the decision as wide as possible Avoid the respectively calculation by using dynamic programming. Avoid the respectively calculation by using dynamic programming. Detection algorithm can detects acoustic changing points with reasonable detestability. Detection algorithm can detects acoustic changing points with reasonable detestability.