FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.

FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng

Outline 1. Overview 2. Read in the raw speech 3. MFCC processing 4. Detect the audio scene change 5. Audio Clustering 6. Interleave Audio Clustering 7. Conclusion

Overview Automatic segmentation of an audio stream and automatic clustering of audio segments have quite a bit of attention nowadays. Automatic segmentation of an audio stream and automatic clustering of audio segments have quite a bit of attention nowadays. Example, in the task of automatic transcription of broadcast news, the data contains clean speech, telephone speech, music segments, speech corrupted by music or noise. Example, in the task of automatic transcription of broadcast news, the data contains clean speech, telephone speech, music segments, speech corrupted by music or noise.

Overview (cont ’ ) We would like to SEGMENT the audio stream into homogenous regions according to speaker identity. We would like to SEGMENT the audio stream into homogenous regions according to speaker identity. We would like to cluster speech segments into homogeneous clusters according to speaker identity. We would like to cluster speech segments into homogeneous clusters according to speaker identity.

Step1:Read in the raw speech Read in a mpeg file as input Read in a mpeg file as input Convert the file from.mpeg format to.wav format Convert the file from.mpeg format to.wav format Because the MFCC library only process on.wav file Because the MFCC library only process on.wav file

Step2:MFCC processing A wav is viewed as frames, each contains different features A wav is viewed as frames, each contains different features We make use of the MFCC library to convert the wav to MFCC features for processing We make use of the MFCC library to convert the wav to MFCC features for processing We extract 24 features for each frames We extract 24 features for each frames The result are stored in feature vectors The result are stored in feature vectors Frame1Frame 2Frame 3

Step3: Detect the audio scene change Make use of the feature vector to detect the audio scene change Make use of the feature vector to detect the audio scene change The input audio stream will be modeled as Gaussian process The input audio stream will be modeled as Gaussian process Model selection criterion called BIC (Bayesian Information Criterion) is used to detect the change point Model selection criterion called BIC (Bayesian Information Criterion) is used to detect the change point

Step3: Detect the audio scene change Denote Xi (i = 1, …,N) as the feature vector of frame i Denote Xi (i = 1, …,N) as the feature vector of frame i N is the total number of frame N is the total number of frame mi : mean of mean vector of frame i mi : mean of mean vector of frame i ∑i : full covariance matrix of frame i ∑i : full covariance matrix of frame i R(i) = N log |∑| - N1 log |∑1| - N2 log |∑2| R(i) = N log |∑| - N1 log |∑1| - N2 log |∑2| ∑, ∑1, ∑2 are the sample covariance matrices from all the data, from {x1, …,xi}, from {xi+1, …,Xn} respectively ∑, ∑1, ∑2 are the sample covariance matrices from all the data, from {x1, …,xi}, from {xi+1, …,Xn} respectively

Step3: Detect the audio scene change BIC(i) = R(i) – constant BIC(i) = R(i) – constant If there is only one change point, then the frame with highest BIC score is the change point If there is only one change point, then the frame with highest BIC score is the change point If there are more than one change point, just simple extend the algorithm If there are more than one change point, just simple extend the algorithm

Step 4:Audio Clustering As we want to speed up the audio detecting, so we just roughly find the change point. As we want to speed up the audio detecting, so we just roughly find the change point. As a result, there maybe some wrongly calculated change point. As a result, there maybe some wrongly calculated change point. In this part, we try to combine the wrongly segmented neighbor segments In this part, we try to combine the wrongly segmented neighbor segments Compare with neighbor segments, if they are speech of the same person, then combine it. Compare with neighbor segments, if they are speech of the same person, then combine it.

Step5:Interleave Audio Clustering Group all the segments of the same speaker into one node. Group all the segments of the same speaker into one node. Before Before After After Speaker 2 Speaker 1 Speaker 2 Combined Speaker1

Conclusion We would like to make a precise and speedy engine that recognize the identity of speaker in a wave file. We would like to make a precise and speedy engine that recognize the identity of speaker in a wave file. We would like to group the same speaker in the wave. We would like to group the same speaker in the wave.

Conclusion (cont ’ ) Instead of making local decision based on distance between fixed size sample, we expand the decision as wide as possible Instead of making local decision based on distance between fixed size sample, we expand the decision as wide as possible Avoid the respectively calculation by using dynamic programming. Avoid the respectively calculation by using dynamic programming. Detection algorithm can detects acoustic changing points with reasonable detestability. Detection algorithm can detects acoustic changing points with reasonable detestability.

FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.

Similar presentations

Presentation on theme: "FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.

Similar presentations

Presentation on theme: "FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng."— Presentation transcript:

Similar presentations

About project

Feedback