2Automatic speech recognition What is the task?What are the main difficulties?How is it approached?How good is it?How much better could it be?
3What is the task? Getting a computer to understand spoken language By “understand” we might meanReact appropriatelyConvert the input speech into another medium, e.g. textSeveral variables impinge on this
4How do humans do it?Articulation produces sound waves which the ear conveys to the brain for processing
5How might computers do it? Acoustic waveformAcoustic signalDigitizationAcoustic analysis of the speech signalLinguistic interpretationSpeech recognition
6Basic Block Diagram ATMega32 P2 to P9 LCD Display MATLAB Parallel Port microcontrollerP2toP9Speech RecognitionParallelPortControlLCDDisplayMATLAB
7What’s hard about that? Digitization Signal processing Phonetics Converting analogue signal into digital representationSignal processingSeparating speech from background noisePhoneticsVariability in human speechPhonologyRecognizing individual sound distinctions (similar phonemes)Lexicology and syntaxDisambiguating homophonesFeatures of continuous speechSyntax and pragmaticsInterpreting prosodic featuresPragmaticsFiltering of performance errors (disfluencies)
8Digitization Analogue to digital conversion Sampling and quantizing Use filters to measure energy levels for various points on the frequency spectrumKnowing the relative importance of different frequency bands (for speech) makes this process more efficientE.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
9Separating speech from background noise Noise cancelling microphonesTwo mics, one facing speaker, the other facing awayAmbient noise is roughly same for both micsKnowing which bits of the signal relate to speechSpectrograph analysis
10Variability in individuals’ speech Variation among speakers due toVocal range (f0, and pitch range – see later)Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc)ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.)Variation within speakers due toHealth, emotional stateAmbient conditionsSpeech style: formal read vs spontaneous
11Speaker-(in)dependent systems Speaker-dependent systemsRequire “training” to “teach” the system your individual idiosyncraciesThe more the merrier, but typically nowadays 5 or 10 minutes is enoughUser asked to pronounce some key words which allow computer to infer details of the user’s accent and voiceFortunately, languages are generally systematicMore robustBut less convenientAnd obviously less portableSpeaker-independent systemsLanguage coverage is reduced to compensate need to be flexible in phoneme identificationClever compromise is to learn on the fly
12(Dis)continuous speech Discontinuous speech much easier to recognizeSingle words tend to be pronounced more clearlyContinuous speech involves contextual coarticulation effectsWeak formsAssimilationContractions
13Performance errors Performance “errors” include Non-speech soundsHesitationsFalse starts, repetitionsFiltering implies handling at syntactic level or aboveSome disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future
14Approaches to ASRTemplate basedNeural Network basedStatistics based
15Template-based approach Store examples of units (words, phonemes), then find the example that most closely fits the inputExtract features from speech signal, then it’s “just” a complex similarity matching problem, using solutions developed for all sorts of applicationsOK for discrete utterances, and a single user
16Template-based approach Hard to distinguish very similar templatesAnd quickly degrades when input differs from templatesTherefore needs techniques to mitigate this degradation:More subtle matching techniquesMultiple templates which are aggregatedTaken together, these suggested …
18Statistics-based approach Collect a large corpus of transcribed speech recordingsTrain the computer to learn the correspondences (“machine learning”)At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
19Statistics based approach Acoustic and Lexical ModelsAnalyse training data in terms of relevant featuresLearn from large amount of data different possibilitiesdifferent phone sequences for a given worddifferent combinations of elements of the speech signal for a given phone/phonemeCombine these into a Hidden Markov Model expressing the probabilities
23BLOCK DIAGRAM DESCRIPTION Speech Acquisition UnitIt consists of a microphone to obtain the analog speech signalThe acquisition unit also consists of an analog to digital converterSpeech Recognition UnitThis unit is used to recognize the words contained in the input speech signal.The speech recognition is implemented in MATLAB with the help oftemplate matching algorithmDevice Control UnitThis unit consists of a microcontroller, the ATmega32, to control thevarious appliancesThe microcontroller is connected to the PC via the PC parallel portThe microcontroller then reads the input word and controls the deviceconnected to it accordingly.
24SPEECH RECOGNITION MFCC End Point Detection Feature Extraction Dynamic Time WarpingRecognizedwordX(n)XF(n)DigitizedSpeech
25END-POINT DETECTIONThe accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum by processing only the parts of the input corresponding to speech.We will use the endpoint detection algorithm proposed by Rabiner andSambur. This algorithm is based on two simple time-domainmeasurements of the signal - the energy and the zero crossing rate.The algorithm should tackle the following cases:-Words which begin with or end with a low energy phonemeWords which end with a nasalSpeakers ending words with a trailing off in intensity or short breath
26Steps for EPDRemoval of noise by subtracting the signal values with that of noiseWord extractionsteps –ITU [Upper energy threshold]ITL [Lower energy threshold]IZCT [Zero crossing rate threshold ]
27Feature ExtractionInput data to the algorithm is usually too large to be processedInput data is highly redundantRaw analysis requires high computational powers and large amounts of memoryThus, removing the redundancies and transforming the data into a set of featuresDCT based Mel Cepstrum
28DCT Based MFCC Take the Fourier transform of a signal. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.Take the logs of the powers at each of the mel frequencies.Take the discrete cosine transform of the list of mel log powers, as if it were a signal.The MFCCs are the amplitudes of the resulting spectrum.
29MFCC ComputationAs Log Magnitude is real and symmetric IDFT reduces to DCT. The DCT produces highly un-correlated feature yt(m)(k). The Zero Order MFCC coefficient yt(0)(k) is approximately equal to the Log Energy of the frame.The number of MFCC co-effecients chosen were 13
31Dynamic Time Warping and Minimum Distance Paths measurement Isolated word recognition:Task :Want to build an isolated word recogniserMethod:Record, parameterise and store vocabulary of reference words.Record test word to be recognised and parameterize.Measure distance between test word and each reference word.Choose reference word ‘closest’ to test word.
32Words are parameterised on a frame-by-frame basis Choose frame length, over which speech remains reasonably stationaryOverlap frames e.g. 40ms frames, 10ms frame shift40ms20msWe want to compare frames of test and reference words i.e. calculate distances between them
33Calculating Distances Easy:Sum differences between corresponding framesHard:Number of frames won’t always correspond
34Solution 1: Linear Time Warping Stretch shorter soundProblem?Some sounds stretch more than others
35Dynamic Time Warping (DTW) Solution 2:Dynamic Time Warping (DTW)TestReferenceUsing a dynamic alignment, make most similar frames correspondFind distances between two utterances using these corresponding frames
36Dynamic ProgrammingWaveforms showing the utterance of the word “Open” at two different instants. The signals are not time aligned.
37Place distance between frame r of Test and frame c of Reference in cell(r,c) of distance matrix DTW Process351 x4 x743 x0 x95 x2 x21TestReference
38Constraints Global Local Endpoint detection Path should be close to diagonalLocalMust always travel upwards or eastwardsNo jumpsSlope weightingConsecutive moves upwards/eastwards
39Empirical Results : Known Speaker SONYSUVARNAGEMINIHBOCNNNDTVIMAGINEZEE CINEMA911082
43Evolution of ASR 2015 1995 1985 USER NOISE ENVIRONMENT POPULATION 1975 all speakers of the language including foreignapplication independent or adaptiveall styles including human-human (unaware)wherever speech occurs2015vehicle noise radiocell phonesregional accentsnative speakers competent foreign speakerssome application– specific data and one engineer yearnatural human- machine dialog (user can adapt)1995USERPOPULATIONexpert years to create app– specific language modelspeaker independent and adaptivenormal officevarious microphonestelephoneplanned speech1985NOISE ENVIRONMENT1975quiet roomfixed high – quality miccareful readingspeaker-dep.application– specific speech and languageSPEECH STYLECOMPLEXITY
44Conclusive remarks Recorded Speech Noise Padded Gain Adjustment DC offset eliminationSpectral SubtractionEnd Point Detection