CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab.

Slides:

Advertisements

Similar presentations

VoiceXML: A Field Evaluation By: Kristy Bradnum Supervisor: Peter Clayton Presented in partial fulfilment of the CS Honours Project.

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Sean Powers Florida Institute of Technology ECE 5525 Final: Dr. Veton Kepuska Date: 07 December 2010 Controlling your household appliances through conversation.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.

Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.

Spik v1.0 Voice Commands Execution in a Windows Environment Dekel Abelson Eliran Dahan Instructor: Ari Todtfeld.

Why is ASR Hard? Natural speech is continuous

SPEECH RECOGNITION FOR MOBILE SYSTEMS BY: PRATIBHA CHANNAMSETTY SHRUTHI SAMBASIVAN.

HUMANOID ANIMATION DRIVEN BY HUMAN VOICE Thesis Advisor : Dr. Donald P. Brutzman Second Reader : Dr. Xiaoping Yun A Thesis By Ozan APAYDIN, Turkish Navy.

Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

ISSUES IN SPEECH RECOGNITION Shraddha Sharma

Introduction to Automatic Speech Recognition

VoiceXML Builder Arturo Ramirez ACS 494 Master’s Graduate Project May 04, 2001.

Digital Sound and Video Chapter 10, Exploring the Digital Domain.

Speech Recognition Final Project Resources

Public 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Development Challenges of Multilingual Text-to-Speech Systems Kimmo Pärssinen

Temple University Speech Recognition using Sphinx 4 (Ti Digits test) Jaykrishna shukla,Amir Harati,Mubin Amehed,& cara Santin Department of Electrical.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

Conversational Applications Workshop Introduction Jim Larson.

PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.

Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.

Temple University Goals : 1.Down sample 20 khz TIDigits data to 16 khz. 2. Use Down sample data run regression test and Compare results posted in Sphinx-4.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.

1 Computational Linguistics Ling 200 Spring 2006.

1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Speech Technologies and VoiceXML try Department of Computer Science National Cheng-Chi University.

The Voice-Enabled Web: VoiceXML and Related Standards for Telephone Access to Web Applications 14 Feb Christophe Strobbe K.U.Leuven - ESAT-SCD-DocArch.

Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus Arthur Kunkle ECE 5526 Fall 2008.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Spoken Dialog Systems and Voice XML Lecturer: Prof. Esther Levin.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Speech Recognition Speech Recognition lets you speak into a microphone to control your computer. You can give commands that the computer will carry out.

PROPOSAL : The Use of Voice Command in Operating Personal Computer By : COLLEGE OF ART & SCIENCE UNIVERSITI UTARA MALAYSIA STIW5023 ADVANCED PROGRAMMING.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Basic structure of sphinx 4

Listener-Control Navigation of VoiceXML. Nuance Speech Analysis 92% of customer service is through phone. 84% of industrialists believe speech better.

Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.

1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Reducing uncertainty in speech recognition Controlling mobile devices through voice activated commands Neil Gow, GWXNEI001 Stephen Breyer-Menke, BRYSTE003.

ALPHABET RECOGNITION USING SPHINX-4 BY TUSHAR PATEL.

Speech Recognition Created By : Kanjariya Hardik G.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

VoiceXML. Nuance Speech Analysis 92% of customer service is through phone. 84% of industrialists believe speech better than web.

#SummitNow Yes, I'm able to index audio files within Alfresco 2013 Fernando González @fegorama.

PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.

Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.

1 Speech Recognition. 2 Introduction What is Speech Recognition? - Voice Recognition? Where can it be used? - Dictation - System control/navigation -

G. Anushiya Rachel Project Officer

Yes, I'm able to index audio files within Alfresco

Dialog Design 4 Speech & Natural Language

Retrieval of audio testimonials via voice search

David Cyphert CS 2310 – Software Engineering

Artificial Intelligence 2004 Speech & Natural Language Processing

Presentation transcript:

CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab

Purposes of this project Finding out how an efficient speech recognition engine can be implemented. Examine the source code of Sphinx2 to find out the role and function of each component. Reading key chapters of Dr. Mosur K. Ravishankar’s thesis as a reference. Some demo programs will be given during oral presentation.

Presentation Agenda Project Summary/ Agenda/ Goal. (In English) Introduction. Basics of Speech Recognitions. Architecture of CMU Sphinx. –Acoustic Model and HMM. –Language Model. Java™ Platform Issues. Demo Conclusion.

Voice Technologies In the mid- to late 1990s, personal computers started to become powerful enough to support ASR The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

Basics of Speech Recognition

Speech Recognition Capturing speech (analog) signals Digitizing the sound waves, converting them to basic language units or phonemes( 音素 ). Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

Speech Recognition Process Flow Source:Microsoft Speech.NET Home( )

Recognition Process Flow Summary Step 1:User Input –The system catches user’s voice in the form of analog acoustic signal. Step 2:Digitization –Digitize the analog acoustic signal. Step 3:Phonetic Breakdown –Breaking signals into phonemes.

Recognition Process Flow Summary(2) Step 4:Statistical Modeling –Mapping phonemes to their phonetic representation using statistics model. Step 5:Matching –According to grammar, phonetic representation and Dictionary, the system returns an n-best list (I.e.:a word plus a confidence score) –Grammar-the union words or phrases to constraint the range of input or output in the voice application. –Dictionary-the mapping table of phonetic representation and word(EX:thu,thee  the)

Architecture of CMU Sphinx.

Introduction to CMU Sphinx A speech recognition system developed at Carnegie Mellon University. Consists of a set of libraries –core speech recognition functions –low-level audio capture Continuous speech decoding Speaker-independent

Brief History of CMU Sphinx Sphinx-I (1987) –The first user independent,high performance ASR of the world. –Written in C by Kai-Fu Lee ( 李開復博士，現任 Microsoft Asia 首席技術顧問 / 副總裁 ). Sphinx-II (1992) –Written by Xuedong Huang in C. ( 黃學東博士，現為 Microsoft Speech.NET 團隊領導人 ) –5-state HMM / N-gram LM. ( 我們可以推測， CMU Sphinx 的核心技術對 Microsoft Speech SDK 影響很大。 )

Brief History of CMU Sphinx (2) Sphinx 3 (1996) –Built by Eric Thayer and Mosur Ravishankar. –Slower than Sphinx-II but the design is more flexible. Sphinx 4 (Originally Sphinx 3j) –Refactored from Sphinx 3. –Fully implemented in Java. –Not finished yet.

Components of CMU Sphinx

Front End libsphinx2fe.lib / libsphinx2ad.lib Low-level audio access Continuous Listening and Silence Filtering Front End API overview.API overview

Knowledge Base The data that drives the decoder. Three sets of data –Acoustic Model. –Language Model. –Lexicon (Dictionary).

Acoustic Model /model/hmm/6k Database of statistical model. Each statistical model represents a phoneme. Acoustic Models are trained by analyzing large amount of speech data.

HMM in Acoustic Model HMM represent each unit of speech in the Acoustic Model. Typical HMM use 3-5 states to model a phoneme. Each state of HMM is represented by a set of Gaussian mixture density functions. Sphinx2 default phone set.default phone set

Gaussian Mixtures Refer to text book p 33 eq 38 Represent each state in HMM. Each set of Gaussian Mixtures are called “senones”. HMM can share “senones”.

Language Model Describes what is likely to be spoken in a particular context Word transitions are defined in terms of transition probabilities Helps to constrain the search space See examples of LM.examples of LM

N-gram Language Model Probability of word N dependent on word N-1, N-2,... Bigrams and trigrams most commonly used Used for large vocabulary applications such as dictation Typically trained by very large (millions of words) corpus

Decoder Selects next set of likely states Scores incoming features against these states Drop low scoring states Generates results

Speech in Java™ Platform

Sun Java Speech API First released on October 26, The Java™ Speech API allows Java applications to incorporate speech technology into their user interfaces. Defines a cross-platform API to support command and control recognizers, dictation systems and speech synthesizers.

Implementations of Java Speech API Open Source –FreeTTS / CMU Sphinx4. IBM Speech for Java. Cloud Garden. L&H TTS for Java Speech API. Conversa Web 3.0.

Free TTS Fully implemented with Java. Based upon Flite 1.1: a small run- time speech synthesis engine developed at CMU.Flite 1.1 Partial support for JSAPI 1.0. –Speech Recognition functions. –JSML.

Sphinx 4 (Sphinx 3j) Fully implemented with Java. Speed is equal or faster than Sphinx3. Acoustic model and Language model is under construction. Source code are available by CVS.(but you can not run any applications without models !) For Example : To check out the Sphinx4,you can using the following command. cvs -z3 co sphinx4

Java™ Platform Issues GC makes managing data much easier Native engines typically optimize inner loops for the CPU can't do that on the Java platform. Native engines arrange data to optimize cache hits can't really do that either.

DEMO Sphinx-II batch mode. Sphinx-II live mode. Sphinx-II Client / Server mode. A Simple Free TTS Application. (Java-based) TTS vs (c-based)SR. Motion Planner with Free TTS-using Java Web Start™.(This is GRA course final project)

Summary Sphinx is a open source Speech Recognition developed at CMU. FE / KB / Decoder form the core of SR system. FE receives and processes speech signal. Knowledge Base provide data for Decoder. Decoder search the states and return the results. Speech Recognition is a challenging problem for the Java platform.

Reference Mosur K.Ravishankar, Efficient Alogrithms for Speech Recognition, CMU, Mosur K.Ravishankar, Kevin A. Lenzo,Sphinx-II User Guide, CMU,2001. Xuedong Huang,Alex Acerd,Hsiao- Wuen hon,Spoken Language Processing,Prentice Hall,2000.

Reference (on-line) On-line documents of Java™ Speech API – media/speech/ media/speech/ On-line documents of Free TTS – On-line documents of Sphinx-II –

Q & A