ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Chapter 5 Multiple Linear Regression
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
CS Instance Based Learning1 Instance Based Learning.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Introduction Mohammad Beigi Department of Biomedical Engineering Isfahan University
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Old Dominion University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition Thilo Stadelmann, Bernd Freisleben, Ralph Ewerth University of Marburg,
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Final Year Project Lego Robot Guided by Wi-Fi (QYA2)
TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.
Interactive Learning of the Acoustic Properties of Objects by a Robot
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
MDL Principle Applied to Dendrites and Spines Extraction in 3D Confocal Images 1. Introduction: Important aspects of cognitive function are correlated.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
Results from Mean and Variance Calculations The overall mean of the data for all features was for the REF class and for the LE class. The.
A NONPARAMETRIC BAYESIAN APPROACH FOR
College of Engineering
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Word Embedding Word2Vec.
Robust Full Bayesian Learning for Neural Networks
Somi Jacob and Christian Bach
Model generalization Brief summary of methods
Presentation transcript:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University Observations Prediction accuracy for the NIST 2006 results is relatively poor. Maximum correlation around 46% means just around 21% of the variance in the data is explained by the predictor. Features used in this work are highly correlated. KNN on phonetic space shows a better prediction capability. Part of the error rate related to factors beyond the “structure” of the word itself. For example, speech rate or acoustic channel are greatly effect the error rate associated with a word. Since the data used in this research is not restricted to acoustically clean data and with standard accent and speech rate, the trained models have some intrinsic inaccuracy. Despite relatively low correlation the system can still be used in practice to help users to choose better search words. Observations Prediction accuracy for the NIST 2006 results is relatively poor. Maximum correlation around 46% means just around 21% of the variance in the data is explained by the predictor. Features used in this work are highly correlated. KNN on phonetic space shows a better prediction capability. Part of the error rate related to factors beyond the “structure” of the word itself. For example, speech rate or acoustic channel are greatly effect the error rate associated with a word. Since the data used in this research is not restricted to acoustically clean data and with standard accent and speech rate, the trained models have some intrinsic inaccuracy. Despite relatively low correlation the system can still be used in practice to help users to choose better search words. Search Term Strength Prediction Using machine learning algorithms to learn the relationship between a phonetic representation of a word and its word error rate (WER). Algorithms : Linear Regression, Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Features such as: Duration, #Syllables, #Consonants, broad phonetic class (BPC) frequencies. Biphone frequencies, 2-grams of the BPC. Preprocessing includes whitening of the features using singular value decomposition (SVD). The final strength score is defined as 1-WER. Search Term Strength Prediction Using machine learning algorithms to learn the relationship between a phonetic representation of a word and its word error rate (WER). Algorithms : Linear Regression, Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Features such as: Duration, #Syllables, #Consonants, broad phonetic class (BPC) frequencies. Biphone frequencies, 2-grams of the BPC. Preprocessing includes whitening of the features using singular value decomposition (SVD). The final strength score is defined as 1-WER. Figure 3. An overview of our approach to search term strength prediction that is based on decomposing terms into features. Experimentation Data: NIST Spoken Term Detection 2006 Evaluation results which includes BBN, IBM and SRI sites. Correlation (R) and mean square error (MSE) are used to assess the prediction quality. Duration is the most significant feature with around 40% correlation. A duration model based on N-gram phonetic representation developed and trained using TIMIT dataset. Experimentation Data: NIST Spoken Term Detection 2006 Evaluation results which includes BBN, IBM and SRI sites. Correlation (R) and mean square error (MSE) are used to assess the prediction quality. Duration is the most significant feature with around 40% correlation. A duration model based on N-gram phonetic representation developed and trained using TIMIT dataset. Figure 4. The relationship between duration and error rate shows that longer words generally result in better performance. Results Table1- Results for feature based method over NIST Table2- Results for KNN in Phonetic space for BBN dataset. Future Work Use data generated carefully from acoustically clean speech with proper speech rate and accent for training. Finding features with small correlation to the existed set of features. Among the candidates are confusability score and expected number of occurrences of a word in the language model. Combining the outputs of several machines using optimization techniques such as particle swarm optimization (PSO). Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. We have developed algorithms based on some of these suggestions which improves the correlation to around 76% which corresponds to explaining 58% of the variance of the observed data. The results will be published in the near future. Future Work Use data generated carefully from acoustically clean speech with proper speech rate and accent for training. Finding features with small correlation to the existed set of features. Among the candidates are confusability score and expected number of occurrences of a word in the language model. Combining the outputs of several machines using optimization techniques such as particle swarm optimization (PSO). Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. We have developed algorithms based on some of these suggestions which improves the correlation to around 76% which corresponds to explaining 58% of the variance of the observed data. The results will be published in the near future. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Figure 1. A screenshot of our demonstration software: Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Figure 2. A common approach in STD is to use a speech to text system to index the speech signal (J. G. Fiscus, et al., 2007). Features TrainEval RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration Duration + No. Syllables Duration + No. Consonants Duration + No. Syllables + No. Consonants Duration + Length + No. Syllables /Duration Duration +#Consonants + Length/Duration + #Syllables / Duration +CVC K TrainEval MSERMSRR Figure 5. The predicted error rate is plotted against the reference error rate, demonstrating good correlation between the two. Correlation between the prediction and reference is not satisfactory. Insufficient amount of data. Training data is not based on clean speech. C ollege of Engineering Temple University