Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ubiquitous Computing Max Mühlhäuser, Iryna Gurevych (Editors) Part V : Ease-of-use Chapter 17: Mobile Speech Recognition Dirk Schnelle.

Similar presentations


Presentation on theme: "Ubiquitous Computing Max Mühlhäuser, Iryna Gurevych (Editors) Part V : Ease-of-use Chapter 17: Mobile Speech Recognition Dirk Schnelle."— Presentation transcript:

1 Ubiquitous Computing Max Mühlhäuser, Iryna Gurevych (Editors) Part V : Ease-of-use Chapter 17: Mobile Speech Recognition Dirk Schnelle

2 Ubiquitous Computing 2 : Introduction Voice based interaction with mobile devices –Is NOT copying existing speech recognizers to the device and run it Limititations of the device have to be considered –Computational power –Limited memory –… Different architectures for enabling speech recognition on the device exist

3 Ubiquitous Computing 3 : Speech Recognition Recognizer transcribes spoken language into text

4 Ubiquitous Computing 4 : General Architecture Signal Processor –Generates real valued vectors  i from the speech signal –At regular intervals e.g. each 10 msec Model –Contains a set of prototypes  i Decoder –Converts  i into an utterance –Finds the  i closest to  i for a given distance function d

5 Ubiquitous Computing 5 : Recognizer Types Word based Recognizer –Acoustic symbols a i are words –Example {a 1 =one, a 2 =two, a 3 =three} –No post processing required –Unflexible –For smaller vocabularies Phoneme based Recognizer –Acoustic symbols a i are phonemes –Phonemes are small sound units –Example {a 1 =W, a 2 =AH, a 3 =N, a 1 =., a 2 =T, a 3 =U, a 1 =W, …} –Requires post processing –More acurate Reduce decoding to small sound units –More flexible –Can handle a larger vocabulary more easily Analogy:First attempts to start writing –Symbols for each word –Symbol for each syllable –Letters that wew find today

6 Ubiquitous Computing 6 : Limitations of Embedded Devices Memory –Impossible to store large models Computational Power –Signal Processor and Decoder are computational intensive Power Consumption –Computational intensive tasks consume too much battery  Lifetime is reduced Floating Point Support –Current processors (StrongARM, XScale) do not support floating point arthmetics –Emulation is computational intenisive and slow

7 Ubiquitous Computing 7 : Main Architectures Classification by Schnelle Service Dependent Speech Recogniton –Recognizer as a service in the network –Audio Streaming –Media Resource Control Protocol (MRCP) –Distributed Speech Recognition (DSR) Device Inherent Speech Recognition –Recognizer on the device –Hardware based Speech Recogniton –Dynamic Time Warping (DTW) –Hidden Markov Models (HMM) –Artificial Neural Networks (ANN) Different classification by Zaykovskiy Client –Signal Processor and Decoder on the device Client-Server –Signal Processor on the device –Decoder as a service in the network Server –Signal Processor and Decoder as a recognition service on the server

8 Ubiquitous Computing 8 : Parameters of Speech Recognition in UC General Parameters –Speaking Mode Isolated word recognition vs. Continuous speech –Speaking Style Read speech vs. Spontanous speech –Enrollment Speaker dependent vs. Speaker independent –Vocabulary Size of the vocabulary –Perplexity Number of words to follow a single word –SNR Signal-to-noise ratio –Transducer Noise cancelling headset vs. Telephone UC specific Parameters –Network dependency None vs. Full network dependent –Network bandwith Amount of data to dsend over the network –Transmission degradation Loss of information while transmitting the data –Server load Scalabilty of the server if any –Integration and Maintanance Ease of access and applying bugfixes –Responsiveness Real-time capabilities

9 Ubiquitous Computing 9 : Audio streaming Use of the embedded device as a microphone replacement Stream audio over the wireless network –Bluetooth –WI-FI Signal Processor on the server Advantages –Full featured recognizer –Large language models Disadvantage –Requires a stable wireless network connection –Very large amoung of data streamed over the network –Own propietary protocol –Real-time capabilities?

10 Ubiquitous Computing 10 : MRCP Standard for audio streaming Adopted by industry API to enable clients control media resources over a network Based on the Real Time Streaming Protocol (RTSP)

11 Ubiquitous Computing 11 : DSR Standard by ETSI (European Telecommunicaton Standards Institute) Goal –Reduce network traffic –Use computational capbilities of the embedded device

12 Ubiquitous Computing 12 : DSR Implementation DSR Front-End (Signal Processor) DSR Backend (Decoder) Sphinx profiling

13 Ubiquitous Computing 13 : Front-End Processing Quantization Pre-Emphasis –Signal is weaker in higher frequencies –High-pass filter Framing –Division in overlapping frames of N samples Windowing –Hamming Window

14 Ubiquitous Computing 14 : Front-End Processing II Power Spectrum –DFT Mel Spectrum Power Cepstrum –ceps=spec 1 –First 13 cepstral parameters are called the features

15 Ubiquitous Computing 15 : Front-End Profiling Utterances –Short 2.05 sec –Medium 6.05 sec –Long 30.04 sec Canditates for improvement –fft –spec magnitude

16 Ubiquitous Computing 16 : Hardware based speech recognition Broad range –Partial (FFT computation in DSR) –Full featured recognizers Advantages –Less runtime problems –Hardware is designed for this purpose Disadvantage –Loss of flexibility General –Advantages and disadvantages are dependant on the implemented technology

17 Ubiquitous Computing 17 : Dynamic Time Warping Signal processor –Comparable to front-end in DSR Output  =(  1,…,  n ) Prototype Storage  =(  i,1,..,  i,m ) Comparator d(  i,  j )> μ –Distance funtion d –Time warping function (relationship between the elements of  i and  j ) Problem –Unlikeliy that length of input and template are the same E.g. length of o in word DTW uses dynamic programming

18 Ubiquitous Computing 18 : Hidden Markov Models

19 Ubiquitous Computing 19 : Unit Matching HMMs described as λ=(S,A, B, π, V) –States S=(s 1,…,s n ) –Transition probabilities A={a i,j } a i,j denotes probability p(s i, s j ) to move from state s i to s j –Output probabilities B=(b 1,…,b n ) b i (x) denotes probability q(x | s j ) to observe x in state s i –Observations O Domain of b i Probability of output sequence O Result –Scoring for different recognition hypotheses

20 Ubiquitous Computing 20 : Rabiner‘s basic questions 1.Given the observation sequence O=O 1 O 2 …O r and a model λ, how do we efficiently compute p(O|λ), the probability of the observation sequence, given the model?  Evaluation Problem 2.For decoding, the question to solve is, given the observation sequence O=O 1 O 2 …O r and the model how we choose a corresponding state sequence Q=Q 1 Q 2 …Q t which is optimal in some meaningful sense (i.e. best “explains” the observations)?  Important for speech recognition, find the correct state sequence 3.How do we adjust the model parameters λ to maximize p(O|λ)?  Training

21 Ubiquitous Computing 21 : Viterbi Algorithm Solves Rabiner‘s question 2 Based on dynamic programming Tries to find the best score (highest probabilty) along a single path at time t (trellis) Computational intensive –Requires |A u | multiplications and a additions –|A u | is the number of transitions in the model  Rabiner, Jellinek Computational optimizations Try to replace multiplications by additions Usually increase speed at cost of accuracy

22 Ubiquitous Computing 22 : Lexical Decoding and Semantic Analysis Lexical Decoding –Eliminate those words that do not have a valid dictionary entry Alternative: Statistical grammar –Sequences are reduced to a couple of phonemes in a row, e.g. trigrams –Output: list of trigrams ordered by score –Not suitable for isolated word recognition Semantic Analysis –Eliminate those parts that do not match an allowed sequence of words in a dictionary Both steps are –Not computational intensive –Fast memory access –Smaller vocabularies are Faster to handle Require less memory

23 Ubiquitous Computing 23 : Artificial Neural Networks Derived from the way the human brain works Goal: create a system that –Is able to learn –Can be used for pattern classification Alternative to HMM Output of a neuron Large amount of calculations –Advantage: only additions and multiplications –Disadvantage: too many operations –Not usable on devices with a low CPU frequency Nearly no optimization possible –Solution: implement in hardware

24 Ubiquitous Computing 24 : Future Research Directions Overview of challenges to implement speech recognition on embedded devices None of the architectures is ideal in all aspects Hope of researchers –Embedded devices become powerful enough to run of-the-shelf speech recognition –Does not solve the problems we have today Main approaches –Hardware engineers try to improve performance of embedded devices Unable to meet the challenges short time Will address some of them –Software engineers look for tricks to enable speech recognition on embedded devices Current gain speed at cost of precision

25 Ubiquitous Computing 25 : Summary Service dependent speech recognition –requires speech recognition service in the network –Same potential as desktop speech recognition –Speech recognition parameters depend on the used technology –Full network dependency (slightly better for DSR) –High server load –Additional parameters fo UC are worse Device inherent speech recognition –HMM, ANN offer highest flexibility –DTW requires less resources Speaker dependent Requires enrollment Isolated word recognition –Hardware Lowest flexibility –Bad performance –Requires too many resource –Real time may not be achieved –Smaller vocabularies –Smaller models –Lower perplexity


Download ppt "Ubiquitous Computing Max Mühlhäuser, Iryna Gurevych (Editors) Part V : Ease-of-use Chapter 17: Mobile Speech Recognition Dirk Schnelle."

Similar presentations


Ads by Google