Arthur Kunkle ECE 5525 Fall 2008. Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
INSTRUCTOR:Dr.Veton Kepuska STUDENT:Dileep Narayan.Koneru YES/NO RECOGNITION SYSTEM.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Speech Recognition. What makes speech recognition hard?
A new framework for Language Model Training David Huggins-Daines January 19, 2006.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
ETL By Dr. Gabriel.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Introduction to Automatic Speech Recognition
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Introduction to Systems Analysis and Design Trisha Cummings.
Advanced File Processing
Overview of the Database Development Process
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
Statistical Language Modeling using SRILM Toolkit
Temple University Goals : 1.Down sample 20 khz TIDigits data to 16 khz. 2. Use Down sample data run regression test and Compare results posted in Sphinx-4.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By BuilderMT BMT Cloud Models and Options Manager by BuilderMT Using Cloud MoM to build and manage a Builder’s Model & Option database BuilderMT Cloud.
Presentation by Daniel Whiteley AME department
HTK Tool Kit. What is HTK tool kit HTK Tool Kit The HTK language modeling tools are a group of programs designed for constructing and testing statistical.
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus Arthur Kunkle ECE 5526 Fall 2008.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Wake-up Word Detector Douglas Rauscher ECE5525 April 30, 2008.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Yuya Akita , Tatsuya Kawahara
專題研究 (4) HDecode_live Prof. Lin-Shan Lee, TA. Yun-Chiao Li 1.
專題研究 (2) Feature Extraction, Acoustic Model Training WFST Decoding
Connecting with Computer Science2 Objectives Learn how software engineering is used to create applications Learn some of the different software engineering.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
The HTK Book (for HTK Version 3.2.1) Young et al., 2002.
Performance Comparison of Speaker and Emotion Recognition
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Juicer: A weighted finite-state transducer speech decoder
Prof. Lin-shan Lee TA. Lang-Chi Yu
Digital Speech Processing
PROJ2: Building an ASR System
Overview of Workflows: Why Use Them?
Listen Attend and Spell – a brief introduction
Presentation transcript:

Arthur Kunkle ECE 5525 Fall 2008

Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert speech data into textual transcriptions.  This system will serve as a test-bed for the development of new speech recognition technologies.  This design presentation assumes basic knowledge of the tasks an LVSR must accomplish, as well as some in-depth knowledge of the HTK framework.

System Technologies  HMM Toolkit (HTK)  Cygwin UNIX Emulation Environment  Practical Extraction and Reporting Language (PERL)  Subversion Configuration Management Tool

System Requirements  The LVSR shall… 1. Be capable of incorporating prepared data that conforms to a standard HTK interface (defined in “System Design”). 2. Automatically generate language and acoustic models of all available conforming input data. 3. Be configurable to use multiple processors and/or remote computers to share workload for model re-estimation and testing. 4. Have a scheduling mechanism to run different configuration profiles and create different results directories for each, containing the acoustic and language models. 5. Record all HTK tool output for a “run” in time stamped log files. 6. Merge Language Models together and determine the optimum weighting for models based upon measuring model Perplexity. 7. a list of users information regarding run errors and completion status.

System Design  The following directory structure will capture each stage of the workflow on the left:

Data Preparation Phase 1  HTK needs the following items that are custom to each corpus:  (OPTIONAL) Dictionary – The list of all words found in both testing and training files in the corpus and their phonetic pronunciations. Should be “ _dict.txt”.  Word List – This is a list of all unique words found in the transcriptions. “ _word_list.txt”  Training Data List – List of all MFCC data files contributed by the source, using their absolute location on disk. Rename all utterance files to be “corpus_name>_ _.mfcc”  “Plain” MLF’s – These only include the words of each utterance. Always create this regardless of timing info availability.  “Timed” MLF’s – (OPTIONAL) These included the time boundaries of the appearing words/phones. They must be converted to HTK timing as well. (HTK uses time units of 100ns per unit)  Audio Data – convert wav/NIST/sphere format into MFCC using common parameters. Make sure that max length of HTK is observed, splitting as necessary.  A custom Perl script is used script to handle each source # Corpus location on disk Location: F:/CORPORA/TIMIT # Sound-splitting threshold (in HTK units) UtteranceSplit: 300 # Coding parameter config reference CodingConfigFile: standard_mfcc_cfg.txt

Data Preparation Phase 2  Data must be merged together.  Common data such as dictionaries should be added here.  Dictionary – The list of all words found in all files contributed in the corpus and their phonetic pronunciations.  Indexed Data Files – All the files from individual sources will be merged into a common area and their filenames will be transformed to a common naming scheme.  Word List  Training Data List  Testing Data List  “Plain” MLF’s – These only include the words of each utterance. Always create this regardless of timing info availability.  “Timed” MLF’s – (OPTIONAL) These included the time boundaries of the appearing words/phones. They must be converted to HTK timing as well. (HTK uses time units of 100ns per unit)  Transcription Files – These are transcription files that are formatted for direct use by the Language Modeling process.  Grammar File – By default, this step will generate an “open” grammar from the wordlist. Any word can legally follow another word in the final wordlist. This is used to test acoustic models only # Phone-set information PhoneSet: TIMIT # Coding parameter config reference CodingConfigFile: standard_mfcc_cfg.txt # Parameters to determine percentage of input data that is TRAIN/TEST # must add to 100 TrainDataPercent: 80 TestDataPercent: 20

Acoustic Model Generation  The Acoustic Model generation phase will generate multiple versions of HMM definition files that model the input utterances on the phone, and tri-phone level. 1. Prototype HMM is created 2. Create first HMM model for all phones 3. Tie the states for silence model 4. Re-align the models to use all word pronunciations 5. Create tri-phone HMM models 6. Use decision-based clustering to tie triphone model parameters 7. Split the Gaussian Mixtures used for each state. #Acoustic Training Configuration Profiles ProfileName: Basic #settings for pruning and floor values VarianceFloor: 0.01 PruningThresholds: RealignPruneThreshold: #Which corpus contains bootstrap data for iteration 1 BootstrapCorpus: TIMIT #how many calls to HEReest to do inbetween major AM steps ReestimationCount: 2 #file for Tree based clustering logic TreeEditFile: basic_tree.hed #determine target mixtures to apply at end of training GuassianMixtures: 8 MixtureStepSize: 2

Language Model Generation  This phase of development will create n-gram language model that will predict a symbol in a sequence given its n-1 predecessors. 1. Training text is scanned and n-grams are counted and stored in grammar files 2. Words are mapped to an “Out-of- Vocabulary Class”. Other class mapping is applied for class-based Language Models 3. The counts of the resulting grammar files are used to compute n-gram probabilities, which are stored in the language model files. 4. The goodness of the language model is measured by calculating perplexity against testing text from the corpus. #these settings dictate the Language Model generation process for all sources MaxNewWords: NGramBufferSize: #will generate up to N gram models NToGenerate: 4 FoFLevels: 32 #must include N-1 cutoff values Cutoffs: 1, 2, 3 #how much this LM should contrib to the overall model OverallContribution: 0.5 #class-model configuration items ClassAmount: 150 ClusterIterations: 1 ClassContribution: 0.7

Model Testing  The final phase of the system will be testing the acoustic and language models generate to this point.  The results will be cataloged according to the timestamp and the profile name 1. Recognition using acoustic models only and “open” grammar (i.e. no LM applied) 2. Recognition using both AM and LM. # standard HMM/LM testing parameters WordInsertionPenalty: 0.0 GrammarScaleFactor: 5.0 HMMNumbersToTest: 19

Milestones  The following actions are given in order with the time estimates for each: 1. TIMIT Data Prep : 6 hours 2. AMI Data Prep : 10 hours 3. Phase 2 Data Prep Sub-System : 20 hours 4. Acoustic Model Sub-System : 20 hours 5. Model Testing Sub-System : 12 hours 6. Lanugage Model Sub-System : 15 hours 7. RTE ‘06 Data Prep : 14 hours 8. Scheduling / Reporting : 14 hours 9. Extra Features / Refactoring : 16 hours 10. Profile Authoring : 4 hours  Total Effort Estimate: 131 hours

Open Issues/Questions  Can Acoustic and Language Model generation be run in parallel after a common data preparation workflow?  Right now, all data input into the LVSR is tagged as training data. What is the best way to choose a subset of data for Testing only? Have a percentage configuration value and pick random utterances? Have a configurable list of specific utterances set aside? If a source (corpus) specifies a testing set, should we use this by default?  Which workflow makes more sense for multiple source LM generation: Generate source-specific word level LM, generate source-specific class level LM, interpolate together. Then combine with other source-specific LM’s Use all training text to create a single word-level LM, generate class level LM, then combine to final LM.  Proposed architecture is static, requiring the process to be restarted when new data is introduced. What requirements exist for dynamically adding new data to existing models?