2 Automatic speech recognition What is the task?What are the main difficulties?How is it approached?How good is it?How much better could it be?
3 What is the task? Getting a computer to understand spoken language By “understand” we might meanReact appropriatelyConvert the input speech into another medium, e.g. textSeveral variables impinge on this
4 How do humans do it?Articulation produces sound waves which the ear conveys to the brain for processing
5 How might computers do it? Acoustic waveformAcoustic signalDigitizationAcoustic analysis of the speech signalLinguistic interpretationSpeech recognition
6 Basic Block Diagram ATMega32 P2 to P9 LCD Display MATLAB Parallel Port microcontrollerP2toP9Speech RecognitionParallelPortControlLCDDisplayMATLAB
7 What’s hard about that? Digitization Signal processing Phonetics Converting analogue signal into digital representationSignal processingSeparating speech from background noisePhoneticsVariability in human speechPhonologyRecognizing individual sound distinctions (similar phonemes)Lexicology and syntaxDisambiguating homophonesFeatures of continuous speechSyntax and pragmaticsInterpreting prosodic featuresPragmaticsFiltering of performance errors (disfluencies)
8 Digitization Analogue to digital conversion Sampling and quantizing Use filters to measure energy levels for various points on the frequency spectrumKnowing the relative importance of different frequency bands (for speech) makes this process more efficientE.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
9 Separating speech from background noise Noise cancelling microphonesTwo mics, one facing speaker, the other facing awayAmbient noise is roughly same for both micsKnowing which bits of the signal relate to speechSpectrograph analysis
10 Variability in individuals’ speech Variation among speakers due toVocal range (f0, and pitch range – see later)Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc)ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.)Variation within speakers due toHealth, emotional stateAmbient conditionsSpeech style: formal read vs spontaneous
11 Speaker-(in)dependent systems Speaker-dependent systemsRequire “training” to “teach” the system your individual idiosyncraciesThe more the merrier, but typically nowadays 5 or 10 minutes is enoughUser asked to pronounce some key words which allow computer to infer details of the user’s accent and voiceFortunately, languages are generally systematicMore robustBut less convenientAnd obviously less portableSpeaker-independent systemsLanguage coverage is reduced to compensate need to be flexible in phoneme identificationClever compromise is to learn on the fly
12 (Dis)continuous speech Discontinuous speech much easier to recognizeSingle words tend to be pronounced more clearlyContinuous speech involves contextual coarticulation effectsWeak formsAssimilationContractions
13 Performance errors Performance “errors” include Non-speech soundsHesitationsFalse starts, repetitionsFiltering implies handling at syntactic level or aboveSome disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future
14 Approaches to ASRTemplate basedNeural Network basedStatistics based
15 Template-based approach Store examples of units (words, phonemes), then find the example that most closely fits the inputExtract features from speech signal, then it’s “just” a complex similarity matching problem, using solutions developed for all sorts of applicationsOK for discrete utterances, and a single user
16 Template-based approach Hard to distinguish very similar templatesAnd quickly degrades when input differs from templatesTherefore needs techniques to mitigate this degradation:More subtle matching techniquesMultiple templates which are aggregatedTaken together, these suggested …
18 Statistics-based approach Collect a large corpus of transcribed speech recordingsTrain the computer to learn the correspondences (“machine learning”)At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
19 Statistics based approach Acoustic and Lexical ModelsAnalyse training data in terms of relevant featuresLearn from large amount of data different possibilitiesdifferent phone sequences for a given worddifferent combinations of elements of the speech signal for a given phone/phonemeCombine these into a Hidden Markov Model expressing the probabilities
23 BLOCK DIAGRAM DESCRIPTION Speech Acquisition UnitIt consists of a microphone to obtain the analog speech signalThe acquisition unit also consists of an analog to digital converterSpeech Recognition UnitThis unit is used to recognize the words contained in the input speech signal.The speech recognition is implemented in MATLAB with the help oftemplate matching algorithmDevice Control UnitThis unit consists of a microcontroller, the ATmega32, to control thevarious appliancesThe microcontroller is connected to the PC via the PC parallel portThe microcontroller then reads the input word and controls the deviceconnected to it accordingly.
24 SPEECH RECOGNITION MFCC End Point Detection Feature Extraction Dynamic Time WarpingRecognizedwordX(n)XF(n)DigitizedSpeech
25 END-POINT DETECTIONThe accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum by processing only the parts of the input corresponding to speech.We will use the endpoint detection algorithm proposed by Rabiner andSambur. This algorithm is based on two simple time-domainmeasurements of the signal - the energy and the zero crossing rate.The algorithm should tackle the following cases:-Words which begin with or end with a low energy phonemeWords which end with a nasalSpeakers ending words with a trailing off in intensity or short breath
26 Steps for EPDRemoval of noise by subtracting the signal values with that of noiseWord extractionsteps –ITU [Upper energy threshold]ITL [Lower energy threshold]IZCT [Zero crossing rate threshold ]
27 Feature ExtractionInput data to the algorithm is usually too large to be processedInput data is highly redundantRaw analysis requires high computational powers and large amounts of memoryThus, removing the redundancies and transforming the data into a set of featuresDCT based Mel Cepstrum
28 DCT Based MFCC Take the Fourier transform of a signal. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.Take the logs of the powers at each of the mel frequencies.Take the discrete cosine transform of the list of mel log powers, as if it were a signal.The MFCCs are the amplitudes of the resulting spectrum.
29 MFCC ComputationAs Log Magnitude is real and symmetric IDFT reduces to DCT. The DCT produces highly un-correlated feature yt(m)(k). The Zero Order MFCC coefficient yt(0)(k) is approximately equal to the Log Energy of the frame.The number of MFCC co-effecients chosen were 13
31 Dynamic Time Warping and Minimum Distance Paths measurement Isolated word recognition:Task :Want to build an isolated word recogniserMethod:Record, parameterise and store vocabulary of reference words.Record test word to be recognised and parameterize.Measure distance between test word and each reference word.Choose reference word ‘closest’ to test word.
32 Words are parameterised on a frame-by-frame basis Choose frame length, over which speech remains reasonably stationaryOverlap frames e.g. 40ms frames, 10ms frame shift40ms20msWe want to compare frames of test and reference words i.e. calculate distances between them
33 Calculating Distances Easy:Sum differences between corresponding framesHard:Number of frames won’t always correspond
34 Solution 1: Linear Time Warping Stretch shorter soundProblem?Some sounds stretch more than others
35 Dynamic Time Warping (DTW) Solution 2:Dynamic Time Warping (DTW)TestReferenceUsing a dynamic alignment, make most similar frames correspondFind distances between two utterances using these corresponding frames
36 Dynamic ProgrammingWaveforms showing the utterance of the word “Open” at two different instants. The signals are not time aligned.
37 Place distance between frame r of Test and frame c of Reference in cell(r,c) of distance matrix DTW Process351 x4 x743 x0 x95 x2 x21TestReference
38 Constraints Global Local Endpoint detection Path should be close to diagonalLocalMust always travel upwards or eastwardsNo jumpsSlope weightingConsecutive moves upwards/eastwards
39 Empirical Results : Known Speaker SONYSUVARNAGEMINIHBOCNNNDTVIMAGINEZEE CINEMA911082
43 Evolution of ASR 2015 1995 1985 USER NOISE ENVIRONMENT POPULATION 1975 all speakers of the language including foreignapplication independent or adaptiveall styles including human-human (unaware)wherever speech occurs2015vehicle noise radiocell phonesregional accentsnative speakers competent foreign speakerssome application– specific data and one engineer yearnatural human- machine dialog (user can adapt)1995USERPOPULATIONexpert years to create app– specific language modelspeaker independent and adaptivenormal officevarious microphonestelephoneplanned speech1985NOISE ENVIRONMENT1975quiet roomfixed high – quality miccareful readingspeaker-dep.application– specific speech and languageSPEECH STYLECOMPLEXITY
44 Conclusive remarks Recorded Speech Noise Padded Gain Adjustment DC offset eliminationSpectral SubtractionEnd Point Detection