Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Building an ASR using HTK CS4706
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.
NVIS: An Interactive Visualization Tool for Neural Networks Matt Streeter Prof. Matthew O. Ward by Prof. Sergio A. Alvarez advised by and.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Profile of Phoneme Auditory Perception Ability in Children with Hearing Impairment and Phonological Disorders By Manal Mohamed El-Banna (MD) Unit of Phoniatrics,
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
CS Instance Based Learning1 Instance Based Learning.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Structure of Spoken Language
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Speech Signal Processing
Structure of Spoken Language
How Spread Works. Spread Spread stands for Speech and Phoneme Recognition as Educational Aid for the Deaf and Hearing Impaired Children It is a game used.
Old Dominion University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.
CS 551/652: Structure of Spoken Language Lecture 2: Spectrogram Reading and Introductory Phonetics John-Paul Hosom Fall 2010.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Quantitative and qualitative differences in understanding sentences interrupted with noise by young normal-hearing and elderly hearing-impaired listeners.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering.
TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-25: Vowels cntd and a “grand” assignment.
Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis.
A NONPARAMETRIC BAYESIAN APPROACH FOR
An Efficient Online Algorithm for Hierarchical Phoneme Classification
Structure of Spoken Language
Structure of Spoken Language
College of Engineering
Structure of Spoken Language
Speech Technology for Language Learning
Jennifer J. Venditti Postdoctoral Research Associate
Presenter by : Mourad RAHALI
EEG Recognition Using The Kaldi Speech Recognition Toolkit
From Word Spotting to OOV Modeling
Phonetics and Phonemics
Human Speech Perception and Feature Extraction
Phonetics and Phonemics
Presentation transcript:

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around 40% correlation. Core Features Machine Learning Algorithms Machine learning algorithms used to learn the relationship between a phonetic representation of a word and its word error rate (WER). The score is defined based on average WER predicted for a word: Strength Score = 1 − WER Algorithms: Linear Regression, Feed-Forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Preprocessing includes whitening using singular value decomposition (SVD). Two-layer, 30-neuron neural network that used back-propagation for training. Machine Learning Algorithms Machine learning algorithms used to learn the relationship between a phonetic representation of a word and its word error rate (WER). The score is defined based on average WER predicted for a word: Strength Score = 1 − WER Algorithms: Linear Regression, Feed-Forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Preprocessing includes whitening using singular value decomposition (SVD). Two-layer, 30-neuron neural network that used back-propagation for training. Figure 4. The relationship between duration and error rate shows that longer words generally result in better performance. Results Table 2. KNN’s predictions correlate well with reference WER. Summary The overall correlation between the predictions and the reference is not high, indicating that there are factors beyond the phonetic content of a search term that influence performance. A serious limitation for the current work is the size and quality of the data set. Input from more word-based and phone-based systems is needed, as well as a much larger training set. Despite these problems, the demonstration system provides useful feedback to users and can serve as a valuable training aid. Future Work The next NIST STD evaluation should provide significantly more data from a variety of application environments. With more data, we can examine acoustic scoring-based metrics to move beyond word spelling as a predictor of performance. Summary The overall correlation between the predictions and the reference is not high, indicating that there are factors beyond the phonetic content of a search term that influence performance. A serious limitation for the current work is the size and quality of the data set. Input from more word-based and phone-based systems is needed, as well as a much larger training set. Despite these problems, the demonstration system provides useful feedback to users and can serve as a valuable training aid. Future Work The next NIST STD evaluation should provide significantly more data from a variety of application environments. With more data, we can examine acoustic scoring-based metrics to move beyond word spelling as a predictor of performance. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Figure 1. A screenshot of our demonstration software: Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Figure 2. A common approach in STD is to use a speech to text system to index the speech signal (J. G. Fiscus, et al., 2007). Features TrainEval RegNNDTRegNNDT Duration Duration + No. Syllables Duration + No. Consonants Duration + No. Syllables + No. Consonants Dur. + Length + No. Syllables /Dur Dur. + # Consonants + CVC2 + Length/Dur. + #Syllables/Dur KTrainEval Figure 5. Correlation between the predicted and reference error rates. Figure 3. An overview of our approach to search term strength prediction that is based on decomposing terms into features. Wordtsunami Phonemest s uh n aa m iy Vowelsuh aa iy Consonantst s n m SyllablesTsoo nah mee BPCS F V N V N V CVCC C V C V C V ClassPhone Stops (S)b p d t g k Fricative (F)jh ch s sh z zh f th v dh hh Nasals (N)m n ng en Liquids (L)l el r w y Vowels (V) iy ih eh ey ae aa aw ay ah ao ax oy ow uh uw er Data Set NIST Spoken Term Detection 2006 Evaluation Results SitesBBNIBMSRI Sources Broadcast News (3hrs) Conversationa l Telephone (3 hrs) Conference Meetings (2 hrs) ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University  Duration  Length  No. of Syllables  No. of Vowels  No. of Consonants  Phoneme Frequency  BPC and CVC Frequency  Length/Duration  No. Syllables/Duration  No. Vowels/No. Consonants  Start-End Phoneme  2-Grams of Phonemes  2-Grams of BPC  2- and 3-Grams of CVCs Table 1. The correlation between the hypothesis and the reference WERs for both training and evaluation subsets is shown. Duration is the single most important feature. Maximum correlation is 46%, which explains 21% of the variance. Many of the core features are highly correlated. KNN demonstrates the most promising prediction capability. The data set is not balanced in that the number of data points with low error rate is much higher than the number of points with high error rates. This reduces predictor accuracy. A significant portion of the error rate is related to factors beyond the spelling of the search term, such as speech rate.