Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session.

Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session

Prepared by: Eng. Ali H. Elaywe2 This session is based on the following references: 1- Book S (Speech Recognition) 2- Also we refer to the T529 ICT CD-ROM Reference Material

Prepared by: Eng. Ali H. Elaywe3 Speech recognition is likely to be one of the keys to future developments in human–technology interaction We use speech in preference to all other forms of communication – it is something we learn as children growing up and so it is often claimed to be the most natural form of interaction But far more important is that speech recognition could dramatically increase access to information and communication technologies Topic 1: Introduction to Speech Recognition Continue

Prepared by: Eng. Ali H. Elaywe4 Generally speaking, Automatic Speech Recognition (ASR) is the recognition and understanding of human speech by machines in an automatic manner without the intervention of humans It is a difficult task to achieve due to the complication of the speech signal and associated algorithms ASR belongs to the area of Digital Signal Processing (DSP)/Computer Science In order to understand ASR system development we have to know about some of the following techniques: 1- Analog and Digital signals 2- Signal Bandwidth 3- Fourier analysis and spectrum 4- Spectrogram concept

Prepared by: Eng. Ali H. Elaywe5 Speech Recognizer (SR) involves the capture of speech signal by the speech recognizer and the recognition of the words and their meanings Capturing the sounds of speech is easy, but recognizing the words and their meaning is much more difficult !! Topic 2: Speech Recognizers Continue

Prepared by: Eng. Ali H. Elaywe6 Automatic speech recognition (ASR) systems fall into two broad categories: 1- isolated-word recognizers These systems try to recognize individual words or short phrases, often called ‘utterances’ Pauses at the beginning and end of an utterance make the recognition process much easier because there are no transition stages between utterances The CSLU Toolkit is classed as an isolated-word recognition system 2- continuous-speech recognizers In which the words are spoken at a normal speech rate without the need for short phrases Recognition is more difficult because the end of one word runs into the beginning of the next Continue Sub - Topic 2.1: Types of speech recognition systems

Prepared by: Eng. Ali H. Elaywe7 Another important distinction between ASR systems is the number of speakers that can be recognized, also all current systems require training of some form – that is, they must learn about the statistical properties of the pronunciation of the speakers. So according to these things, the ASR systems can be divided to: 1- speaker-independent (small-vocabulary) recognizers Here the system is trained with thousands of speech samples from thousands of different users (general public), so they are typically designed to recognize a restricted vocabulary of say 2000 words This is perfectly adequate for general public systems like: telephone banking systems and travel information systems. Such systems are referred to as speaker-independent Continue

Prepared by: Eng. Ali H. Elaywe8 2- speaker-enrolment (large-vocabulary) recognizers These are usually trained to a few users (usually a single individual) ASR software packages for personal dictation must handle a more extensive vocabulary – perhaps 50,000 words – and so they are trained to a single individual and referred to as speaker-enrolment systems Continue

Prepared by: Eng. Ali H. Elaywe9 Activity 1 (reflection) Have you any personal experience of ASR systems, perhaps through the use of an automated banking system or dictation software? If you have encountered such systems, how would you rate your experience? Was it successful? Did you achieve your goal(s)? Would an alternative design of user interface have been more appropriate? Continue

Prepared by: Eng. Ali H. Elaywe10 I have encountered two ASR systems: 1- telephone banking (speaker-independent) in which the system uses automatic recognition to authenticate me as the user and then process my transaction. I have experienced few problems and the system offers me greater flexibility in terms of when and where I undertake banking services. I would judge this as an appropriate use of speech recognition 2- personal dictation software (speaker-enrolment).This took quite a while to set up, due to the need to train the system to my pronunciation, and never seemed to achieve better than about 85% accuracy. I found that my typing was more accurate than this, and somehow I’ve learnt to know when I’ve made a mistake when I’m typing with a keyboard. Whilst I can appreciate the appreciate of such systems, for the moment I’ll stick with the slower keyboard interface

Prepared by: Eng. Ali H. Elaywe11 Recognizing words Vs Comprehension of the spoken words: Recognizing words from uttered sounds is only the first stage of speech interaction Purposeful (Decisive, Determined) communication requires comprehension of the spoken words So understanding what is spoken is part of the field of study known as Linguistics The more Linguistics we use, the easier is the process of speech recognition In an extremely system such as a phone touch-tone system, the decisions are made using very simple rules and are not based on any level of understanding of the user’s response!! Sub-Topic 2.2: The contribution of Linguistics Continue

Prepared by: Eng. Ali H. Elaywe12 Linguistics (read more about it) Linguistics is concerned with the structure of a particular language, or of languages in general – that is, the rules for forming acceptable utterances of the language The goal of linguists is not to ensure that people follow a standard set of rules in speaking, but rather to develop rational models to explain the language that people actually use Four elements of Linguistics are particularly important for automatic continuous-speech recognition systems. They are: 1- Phonology: the study of vocal sounds – we’ll look at this in more detail in Section 3 2- Lexicon: defines the vocabulary, or words used in a language 3- Syntax: This deals with the grammar of the language 4- Semantics: defines the conventions for deriving the meaning of words and sentences The combination of the lexicon, syntax rules and semantic rules is referred to as a language model Continue

Prepared by: Eng. Ali H. Elaywe13 Activity 2 (self-assessment) Explain why continuous-speech recognizers are generally speaker-dependent? Isolated-word speech recognizers are trained on the characteristics of individual words using hundreds of different utterances of the same word from many different speakers. In this way it is possible to build up measurements relating to the statistical properties of the words that are independent of the speaker ( speaker-independent ) Continuous-speech recognizers, on the other hand, have to handle larger vocabularies as well as the transitions between words. Recognition results are better if these features are determined from the measurements of a single speaker ( speaker-enrolment or speaker dependent)

Prepared by: Eng. Ali H. Elaywe14 Speech recognition builds on numerous ideas associated with the study of signals and signal processing, topics that are frequently taught from a mathematical perspective A signal is a quantity that carries some useful information SP (Signal Processing) is the Mathematical/CS study of Signals and Systems DSP (Digital Signal Processing) is the Mathematical/CS study of Digital Signals and Systems I have deliberately chosen to avoid such a mathematical approach; after all, our goal is to explore human–computer interaction Nevertheless, there are some terms and concepts that are fundamental. Like the converting an analogue speech signal into a digital speech signal Continue Sub-Topic 2.3: Preparation: Analogue and Digital systems

Prepared by: Eng. Ali H. Elaywe15 In most practical situations, analogue signals are continuous-time signals and digital signals are discrete-time signals The key features of such an analogue signal are: 1. It can take any value within a range 2. It can change continuously with time 3. The main draw back of analog systems is their sensitivity to noise The key features of such a digital signal are: 1. It is restricted to a finite set of values within a range 2. It is allowed to change only at fixed, regular intervals 3. The main advantage of digital systems is their lack of sensitivity to noise and their easy manipulation by digital computers ASR is mainly a Digital Signal Processing (DSP) activity Continue

Prepared by: Eng. Ali H. Elaywe16 Activity 3 (self-assessment / revision) (see T529 ICT CD-ROM) (a) How are each of the following pairs of sinusoids related? The general equation for a sinewave can be written as: y = A × sin(ωt), where A is the amplitude and ω is the angular frequency measured in radians per second (i) x = A × sin( ωt) y = A × sin(2 ωt) Comparing the equations for x and y with the standard form we find that the sinewave for y has the same amplitude as the sinewave for x but twice the angular frequency and hence twice the frequency (ii) x = A × sin( ωt) y = A/2 × sin( ωt) In this case the sinewave for y has the same frequency as the sinewave for x but half the amplitude Continue

Prepared by: Eng. Ali H. Elaywe17 (iii) x = A × sin( ωt) y = A × sin( ωt + π/4) The sinewaves for x and y have the same amplitude and frequency, but y has been advanced by π/4 radians, or 45 degrees. If you compare the two graphs, as shown in Figure 1, then y reaches its peak before x Figure 1 Phase and sinewaves Continue

Prepared by: Eng. Ali H. Elaywe18 (b) Write down the expression for a sinewave of amplitude 4 and frequency 200 Hz Substituting the values into the general equation for a sinewave gives the expression: x = A sin( ωt) ω = 2πf, f is the frequency in hertz (Hz) ω = 2 × π × 200 = 400 × π x = 4 × sin(400πt) Continue

Prepared by: Eng. Ali H. Elaywe19 Activity 4 (self-assessment / revision) (see T529 ICT CD-ROM) (a) Briefly explain each of the following terms: Periodic The term applied to signals that repeat themselves at regular intervals. Periodic signals tend to exhibit strong peaks in their spectra Period The period of a periodic signal is the time it takes for the signal to repeat itself. Alternatively, the period is equal to the duration of one cycle. The period is the reciprocal of the frequency Continue

Prepared by: Eng. Ali H. Elaywe20 Bandwidth of analogue signals The difference between the highest and lowest frequencies present in a signal or the maximum range of frequencies that can be transmitted by a system Spectrum A graph showing the frequencies present in a signal (b) A signal covers the frequency range from 100 Hz to 3.4 kHz. What is the bandwidth of the signal? The bandwidth of a signal extending from 100 Hz to 3400 Hz is 3300 Hz or 3.3 kHz (c) A sinewave has a period of 50 ms. What is its frequency? For a periodic signal the frequency is the reciprocal of the period. If the period (T) is 50 ms then the frequency (f=1/T) is 20 Hz Continue

Prepared by: Eng. Ali H. Elaywe21 The sampling rate is the frequency at which an analogue signal is sampled to create a digital representation. It is usually expressed in hertz (Hz), so it is easy to confuse the sampling rate with the frequency of the signal being sampled The more numbers, or levels, used to cover a given voltage range, the more closely packed the levels become, and so the smaller the interval between adjacent levels The quantization interval is the size of the interval between adjacent levels. It can be defined as the input range divided by the number of levels available Continue

Prepared by: Eng. Ali H. Elaywe22 Activity 5 (self-assessment / revision) (see T529 ICT CD-ROM) This activity relates to the topic of analogue-to-digital conversion (a) What is the minimum sampling rate required for a signal with a bandwidth covering frequencies up to 6 kHz? The sampling rule states that the minimum sampling rate must equal twice the bandwidth of the signal. If the bandwidth of the signal is 6 kHz, then the sampling rate must not be less than 12 kHz (b) An analogue-to-digital converter has an input voltage range of ±2.5 V. If the resolution of the converter is 12 bits, what is the quantization interval? The quantization interval of an analogue-to-digital converter is equal to the input voltage range divided by the number of binary codewords. For a 12-bit converter there are 2 12, or 4096, codewords. Hence the quantization interval of this converter is 5/4096 volts, or approximately 1 millivolt Continue

Prepared by: Eng. Ali H. Elaywe23 (c) What is the peak level of quantization noise produced by the converter defined in (b)? The peak quantization noise is generally taken to be equal to half the quantization interval. So in this case the peak quantization noise will be 0.5 millivolts Continue

Prepared by: Eng. Ali H. Elaywe24 Activity1 (see T529 ICT CD-ROM (sound digitization)) A complex waveform has a frequency spectrum that extends from 3 kHz to 7.5 kHz. What is the minimum sampling rate to meet the requirements of the sampling theorem? Answer: The bandwidth of this signal is (7.5 – 3) kHz = 4.5 kHz. The minimum sampling rate is twice this, which is 9 kHz Suppose that the above sampling frequency is used along with 8 bits representation for each sample. Then what data rate will be generated after digitization? Answer: Data rate = 8 (bit/sample) * 9 (Ksamples/sec) = 72 Kbps Continue

Prepared by: Eng. Ali H. Elaywe25 Activity2 (see T529 ICT CD-ROM (sound digitization)) A converter with 4-bit resolution is used to cover an input range from +2.5 volts to –2.5 volts. What is the quantization interval? Hence find the peak quantization noise. Answer: A resolution of 4 bits means 2 4 levels, or 16. The input range is 5 volts. The quantization interval is therefore 5/16 volts, or 0.3125 volts The peak quantization noise is half of this, or approximately 0.16 volts Continue

Prepared by: Eng. Ali H. Elaywe26 Activity3 (see T529 ICT CD-ROM (sound digitization)) (a) What does the graph above show? The graph above shows that the samples are converted in 8 bit code word Continue

Prepared by: Eng. Ali H. Elaywe27 (b) What is the quantization interval assuming that the signal amplitude varies between +15 and -5 volts? Quantization interval = the range of volts / the number of levels = 15 + 5 / 2 8 = 20 / 256 = 0.078 volts (c) If the sample taken is +2.025 volts (see the previous graph), what will be the corresponding code word? It will be 0000 0010 the same as for +2 volt sample (d) How many different samples (see the previous graph) can be represented by code words? The use of eight bits in this case determines the number of different sample sizes that can be represented in binary form = 256 different samples Continue

Prepared by: Eng. Ali H. Elaywe28 Any well behaved signal can be generated/composed of a suitable number of sine waves The above composition and all its associated studies are called Fourier Analysis Activity 6 (self-assessment / revision) (see T529 ICT CD-ROM) This activity relates to the topic of Fourier analysis (a) Briefly explain the term Fourier analysis Fourier analysis is the process determining the frequency components (frequency domain) from a time domain signal The resulting spectrum is termed a line spectrum (b) Match each of the signals shown on the left of Figure 2 to its corresponding spectrum on the right Continue

Prepared by: Eng. Ali H. Elaywe29 Figure 2 Signals and spectra Continue

Prepared by: Eng. Ali H. Elaywe30 The matching signals and spectra are shown in Figure 3 Signal (a) is the result of combining two sinewaves, hence the spectrum displays two peaks at the frequencies corresponding to these sinewaves Signal (b) comprises three sinewaves, hence the spectrum displays three peaks at the frequencies corresponding to these sinewaves Signal (c) is known as a square wave. Its spectrum consists of a series of decaying peaks Continue

Prepared by: Eng. Ali H. Elaywe31 Figure 3 Signals and their corresponding spectra Continue

Prepared by: Eng. Ali H. Elaywe32 A spectrum is defined mathematically as the magnitude square of the Fourier Transform of a signal Generally speaking, it is the idea of a frequency spectrum graph. A frequency spectrum can show the amplitude, phase or power of the components of a waveform Another Example (Radio Spectrum) Part of a periodic, non-sinusoidal waveform is shown in Figure 4 (a) The amplitude line spectrum corresponding to the periodic, non-sinusoidal waveform (see Figure 4 (a)) is composed of 3 sinewaves (also called harmonics) and is shown in Figure 4 (b) Please note that for periodic signals, the frequency spectrum is always a line spectrum Continue

Prepared by: Eng. Ali H. Elaywe33 Continue Figure 4 (a) Periodic, non-sinusoidal waveform composed of component sinewaves Figure 4 (b) Amplitude line spectrum of the periodic, non- sinusoidal waveform

Prepared by: Eng. Ali H. Elaywe34 FFT: The (Discrete) Fourier Transform is calculated using digital computers by means of the Fast Fourier Transform (FFT) algorithm FFT was developed by two scientists by the names of Cooley and Tuckey The computer generated frequency pictures (spectrum) that you see in text books is usually calculated using the FFT So the spectrum is defined mathematically as the magnitude square of the FFT of a signal Continue

Prepared by: Eng. Ali H. Elaywe35 Important Notes: The speech signals of humans have distinct frequency signatures for different sounds/words etc… From Fourier analysis (see Figure 3) we conclude that there are two views of a sound signal: 1- a time-domain view that describes how the signal amplitude varies over time 2- and a frequency-domain view (or spectrum) that defines the amplitude of the frequencies present in the signal over a specified interval of time The time-domain and frequency-domain representations can be combined into a spectrogram (3-D), a graph that displays the changes in frequency and amplitude over time

Prepared by: Eng. Ali H. Elaywe36 All of the experimental work that you will undertake in this module utilizes your computer’s sound card and the CSLU Toolkit So you will need to ensure that you know how to configure your microphone and sound card, and that you can record speech samples. You will also need to install the CSLU Toolkit and learn how to use the SpeechView package The following experiments, detailed in Book E, Part1 (Speech Recognition), will explain what you need to do Experiment 1: Sound recording set-up Experiment 2: Installation of the CSLU Toolkit Experiment 3: The SpeechView program Sub-Topic 2.4: Preparation: getting ready for the experiments

Prepared by: Eng. Ali H. Elaywe37 All ASR systems measure various acoustic features of speech and use this information as the starting point of the recognition process In this part we will describe the characteristics of a speaker- independent, isolated-word recognizer, such as that built into the CSLU Toolkit An isolated-word recognizer (e.g., CSLU Toolkit) can be viewed as comprising three separate stages: (see Figure 5) Stage 1: Capturing the data (captures a digital representation of the spoken word. This representation is referred to as the waveform) Stage 2: Phoneme estimation (converts the waveform into a series of elemental sound units, referred to as phonemes, so as to classify the word(s) prior to recognition) Stage 3: Uses various forms of mathematical analysis to estimate the most likely word consistent with the series of recognized phonemes Let’s now take you through each step in more detail Continue Topic 3: Speech recognition

Prepared by: Eng. Ali H. Elaywe38 Continue Figure 5 The speech recognition process: the first part of the figure is a time-domain signal whereas the next two parts are mixed time-frequency domain representations of the speech signal

Prepared by: Eng. Ali H. Elaywe39 The speech signal is usually captured in the time-domain by a recording equipment such as a mike and associated DSP circuitry such as that found on a sound card A speech waveform consists of the individual quantized samples (A/D conversation) of the analogue signal derived from the output of a microphone Figure 6(a) shows a recording of the words ‘Mary had a little lamb…’ captured with my computer’s sound card. Two important settings used were: A sampling rate of 11.025 kHz and a resolution of 16 bits Zooming in on a small segment of this recording, such as that shown in Figure 6(b), it is possible to see the individual samples and the step-like effect resulting from quantization Continue Sub-Topic 3.1: Capturing speech

Prepared by: Eng. Ali H. Elaywe40 Continue Figure 6 Digital speech recording – the waveform

Prepared by: Eng. Ali H. Elaywe41 Quantization and Noise: An undesired signal in DSP is called Noise Noise can of many types and can arise from many source 1- Types of Noise: Noise can be of various types such as White Noise and Colored Noise etc… A- White noise is said to have all frequencies (just like white light has all colors) B- Colored noise is supposed to have components from certain frequency ranges, not all frequencies, and hence the name colored noise 2- Sources of Noise: There are many sources that generate noise including the noise effects due to the Quantization process in the A/D converters (it is the step-like or non-smooth waveform effect), noise from other electronic equipment, thermal noise, electromagnetic noise, lightening etc… Continue

Prepared by: Eng. Ali H. Elaywe42 Quality of recording: Most speech recognizers are very sensitive to the quality of the recording and can produce lots of errors if there is too much extraneous noise It’s a bit like trying to hold a conversation at a football match - the background noise makes it difficult to understand what other people are saying !! The effect of such noise can clearly be seen in the waveforms of Figure 7, recorded with my microphone and computer: The black waveform (lower amplitude) is virtually free of noise The grey waveform (higher amplitude) was recorded with the microphone positioned too close to my computer’s cooling fan. The noise has hidden, or masked, some of the fine detail visible in the black waveform Such type of noise is usually called Additive Noise (others could multiplicative, convoluted etc...) Continue

Prepared by: Eng. Ali H. Elaywe43 Continue Figure 7 Magnified speech waveforms illustrating the effects of noise

Prepared by: Eng. Ali H. Elaywe44 So the noise increase the difficulty in retrieving the original signal specially if the power of the noise signal is high !! Speech corpora: Capturing clean speech samples has been a key step in the development of ASR systems, for they provide the raw data used to train the recognizer Thousands of examples of different speakers are required, all speaking the same words and under similar recording conditions. The resulting data sets are known as ‘speech corpora’ This would be an appropriate point to break off and complete Experiment 4 (recording speech) in Book E, Part 1

Prepared by: Eng. Ali H. Elaywe45 The phonemes: The fundamental sound elements of spoken language are called the phonemes Once a speech sample has been captured it can be processed to determine phonemes Although there are a great many speech sounds available in the languages of the world, any single language comprises only a limited subset of possible sounds The English language, for example, comprises 42 different phonemes. Some of these are exclusive to English, others may be found in other languages Continue Sub-Topic 3.2: Phonemes — the elemental parts of speech

Prepared by: Eng. Ali H. Elaywe46 The set of phonemes for English can be thought of as an alphabet, for they represent the elemental sounds of speech If we combine the appropriate sequence of phonemes we can make the correct sound corresponding to any word Speech Recognition: Alternatively, if we reverse the process – that is, then the speech recognition deals with the detection of the sequence of phonemes to recognize the spoken word The challenge, therefore, is to find a technique that will enable identification of each phoneme In the English language there are two broad classes of phonemes: (Important) 1- Vowels (voiced sounds) 2- Consonants (unvoiced sounds) Continue

Prepared by: Eng. Ali H. Elaywe47 1- Vowels: are said to be voiced sounds, that is the sounds are dominated by a stable vibration of the vocal chords. There is very little movement of the lips, tongue or teeth To see what I mean, try making the following sounds with your fingers lightly pressed against the lower part of your neck: ‘a’ as in hay, ‘ee’ as in beet, ‘oa’ as in boat, and ‘i’ in bite Vowels are further subdivided into: A- monophthongs, those having a single sound (e.g. ‘ee’ of beet) and B- diphthongs where there is a distinct change in sound quality from start to finish (e.g. ‘i’ of bite) Continue

Prepared by: Eng. Ali H. Elaywe48 2- Consonants (unvoiced sounds) Involve rapid movements of lips, tongue or teeth and much less, if any, voicing Again, try making these sounds: ‘p’ as in pat, ‘b’ as in bat, ‘th’ as in there, ‘ch’ as in church, ‘s’ as in sit Consonants are subdivided into: A- Approximants, or semivowels (e.g. ‘y’ in yes), B- Nasals (e.g. ‘m’), C- Fricatives (e.g. ‘th’ in thing) D- Plosives (e.g. the ‘p’ in pat) E- Affricatives (e.g. ‘ch’ in church) You can learn about these subgroups via the course CD-ROM: from the Start menu, select Speech Toolkit, then Getting Started, click on Tutorials, then click on Spectrogram Reading Important Note: You will not be assessed on the classification of phonemes of the English language !! Continue

Prepared by: Eng. Ali H. Elaywe49 The Vocal Tract Model (The Biological Model): All these sounds are produced by the vocal tract, which includes the lips, tongue, and teeth (referred to as the articulators), the oral cavity and nasal cavity (separated by the velum), the oesophagus and the glottis (or vocal chords) Figure 8 shows a cross-section of the Human Vocal Tract (The Biological Model) Now try to complete Experiment 5 in Book E, Part 1 Continue

Prepared by: Eng. Ali H. Elaywe50 Figure 8 Human Vocal Tract (The Biological Model)

Prepared by: Eng. Ali H. Elaywe51 Spectrogram: So far we have viewed speech only as a digitized representation (A/D) of an analogue signal, such as the sample shown in Figure 6 we can expect utterances of phonemes to exhibit variations in amplitudes and frequencies of the speech signals over time The spectrogram (or voice-print) is the ideal tool for measuring variations of amplitude and frequency over time So spectrogram is joint time-frequency analysis Please recall from Fourier analysis, that a spectrum is defined as the magnitude square of the Fourier transform (FT) Spectrogram is defined as the magnitude square of the Short-Time Fourier transform (STFT) What is STFT? Simply stated, it is the Fourier transform of a small segment of speech signal. This small segment is created using a window function. Hence STFT is also know as a class of Windowed Fourier transforms Continue Sub-Topic 3.3: Spectrograms — time and frequency combined

Prepared by: Eng. Ali H. Elaywe52 How to read the spectrogram? It is a time-frequency plot A sample three-dimensional (3-D) spectrogram generated by SpeechView is shown in Figure 9 The top part of the figure 9 shows the sampled waveform – the units of the time scale are milliseconds (ms)· This is not part of the spectrogram but is just plotted together with it The bottom part ((3-D) spectrogram) of the figure 9 is a combination of amplitude and frequency information. The vertical scale corresponds to frequency, whilst the darkness of grey tone is related to amplitude or strength Continue

Prepared by: Eng. Ali H. Elaywe53 Continue Figure 9 A 3-D spectrogram

Prepared by: Eng. Ali H. Elaywe54 How is the spectrogram constructed? The details of how the spectrogram is constructed for a short sample of speech are shown in Figure 10 First, the waveform is divided into short time segments of perhaps 10–20 ms duration. These segments are numbered 1, 2, 3 in Figure 10(a) Second, a spectrum (or frequency domain graph) is calculated for each segment as shown in Figure 10(b) Third, Display all three spectra on a single time axis, as illustrated in Figure 11 The time axis is running into the thickness of the paper, and the resulting graph is commonly referred to as a ‘waterfall’ display The key advantage here is that we can see how the peaks and the channels (or troughs) of the spectra change over time Continue

Prepared by: Eng. Ali H. Elaywe55 Continue Figure 10 Time and frequency domain representations of the spoken phrase ‘Mary had a’

Prepared by: Eng. Ali H. Elaywe56 Continue Figure 11 Waterfall spectral display

Prepared by: Eng. Ali H. Elaywe57 Color or greyscale coding of spectrum amplitude Fourth, The 3-D spectrogram can go one stage further, in that the calculated frequency amplitudes are colour coded, or greyscale coded, as illustrated in Figure 12(a). Now imagine yourself looking down on to the spectrum. What you would see (on screen) is the bar of grayscales (or colours) shown in Figure 12(b) Fifth, If we apply greyscale coding (or colour coding) to each of the spectra segments and arrange the greyscales bars (or colours bars) vertically, the result might look something like Figure 13 (contrived spectrogram) Assuming that the highest amplitude (or highest strength) peaks of the individual spectra are showed by dark grey, then you can see that over time the peak increases and then decreases in frequency The colour coding will become clearer when you perform Experiment 6 Continue

Prepared by: Eng. Ali H. Elaywe58 Figure 12 greyscale coding of spectrum amplitude Figure 13 Contrived spectrogram Continue

Prepared by: Eng. Ali H. Elaywe59 So the time-domain and frequency-domain (or spectrum) representations can be combined into a (3-D) spectrogram, a graph that displays the changes in frequency and amplitude over time Continue

Prepared by: Eng. Ali H. Elaywe60 Example1: How is the spectrogram Interpreted? An example will illustrate how effective the spectrogram can be in identifying the strong resonances associated with the vocal tract Figure 14 shows the spectrogram for an exaggerated utterance of the sound ‘a’ in the word ‘hay’ The scale at the top of the figure shows the elapsed recording time in milliseconds. Below that is the quantized waveform display At the bottom is the 3-D greyscale spectrogram The vertical frequency scale of the spectrogram covers the range 0–8 kHz, for the default 16 kHz sampling rate (Fs=2*Fmax) Continue

Prepared by: Eng. Ali H. Elaywe61 Figure 14 Formant frequencies for the vowel ‘a’ as in ‘hay’ Continue

Prepared by: Eng. Ali H. Elaywe62 Formants: (Very Important) The spectrogram (see Figure 14) shows four black (or dark grey) bands, corresponding to strong frequency peaks, or resonances. we measured the first resonant peak to occur at a frequency of 213 Hz. The second, third and fourth peaks occurred at 1600, 2453 and 3467 Hz respectively These resonances of the vocal tract are called formants and are usually referred to as F1, F2, F3, F4, and so on The first three formants are key characteristics for phoneme recognition, whilst F4 and F5 are thought to indicate the tonal quality of the voice Continue

Prepared by: Eng. Ali H. Elaywe63 Example2: The beauty of spectrograms for speech recognition The power of spectrograms for speech recognition is best demonstrated by comparing common elements of speech. The five parts of Figure 15 show spectrograms for utterances of the word equivalents of the five English vowels ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’ respectively Look carefully at the dark horizontal lines in Figure 15. These lines track the shift of the formants with time - that is, the variation in the resonant frequency of the vocal tract during an utterance of each vowel You can see that each utterance has its own distinct set of lines, or formant contours, that provide another acoustical characteristic to help recognize individual phonemes, and hence words Continue

Prepared by: Eng. Ali H. Elaywe64 Figure 15 Spectrograms for the spoken equivalents of the vowels /a/, /e/, /i/, /o/, /u/ Continue

Prepared by: Eng. Ali H. Elaywe65 Example3: ‘bow’ and ‘cow’ Let’s use the rhyming words ‘bow’ and ‘cow’ These two words have the same ending, ‘ow’, but different starts corresponding to the consonants ‘b’ and ‘c’ The spectrograms are shown in Figure 16. Again, look closely at the dark lines representing the formant contours, particularly their number and shape As expected, the right-hand halves of the spectrograms are very similar, albeit not identical The left-hand sides show distinctive features for each consonant (‘b’ or ‘c’) The central portions of each spectrogram show some differences, corresponding to the transition from one phoneme to another - this effect is referred to as co-articulation Continue

Prepared by: Eng. Ali H. Elaywe66 Figure 16 Spectrograms of the words ‘bow’ and ‘cow’ Continue

Prepared by: Eng. Ali H. Elaywe67 Co-articulation: The previous effect is referred to as co-articulation – the phonetic effects created as the articulators move from their initial position to a new position so as to create the new sound It has been observed experimentally that the co-articulation effects hold important clues for word recognition Continue

Prepared by: Eng. Ali H. Elaywe68 Activity 7 (self-assessment) Look carefully at the three spectrograms shown in Figure 17. Two of the words represented by these spectrograms end in the same vowel sound. By tracking the first four formants can you identify which two? Figure 17 Spectrograms for Activity 7 Continue

Prepared by: Eng. Ali H. Elaywe69 We need to track the formants by drawing some lines across the spectrograms, as shown in Figure 18. Once these are drawn it becomes clear that samples (a) and (b) share the same vowel ending, whilst sample (c) is quite different In fact the first two words are ‘bay’ and ‘hay’. The third word is ‘pow’ Figure 18 Spectrograms with formats lines Continue

Prepared by: Eng. Ali H. Elaywe70 Experiment 6 (Vowel spectrograms) This would be an appropriate point to break off and complete Experiment 6 in Book E, Part 1 Continue

Prepared by: Eng. Ali H. Elaywe71 Example 4: Transitions between phonemes in ‘pan’ and ‘ban’ Figure 19 shows a speech recording to explore the transitions between phonemes Each word is made up of three phonemes. The words ‘pan’ and ‘ban’ differ in the first phoneme, whilst ‘ban’ and ‘bat’ differ in the last phoneme. These differences show up quite clearly in the spectrogram (see Figure 19) In comparing ‘pan’ and ‘ban’ you can see that the initial plosive phoneme (‘p’ or ‘b’) slightly changes the second phoneme – the transitions are different Similarly, the final phoneme of ‘ban’ and ‘bat’ is influenced by the initial phoneme pair of ‘ba’. The word ‘ban’ has a long vowel sound that runs into the nasal ‘n’, whilst ‘bat’ has a short vowel separated from the terminal plosive ‘t’ Continue

Prepared by: Eng. Ali H. Elaywe72 Figure 19 Spectrograms for the words ‘pan’, ‘ban’ and ‘bat’ Continue

Prepared by: Eng. Ali H. Elaywe73 So why is Speech Recognition Difficult? 1- The previous examples highlight the fact that a lot of MEASUREMNTS (like capturing the correct formant contours, transition between phonemes (co-articulation) and strengths of formants) are involved These measurements have been associated with Spectrogram only Other measurements may be associated with extracting other features of the speech signal such as Pitch Period extraction etc 2- Writing the computer programs to extract these features and then writing code for making correct DECISIONS about speech recognition makes it a very difficult task !! There are many other factors that add to the task difficulty but we will not go into the details of them here !! Continue

Prepared by: Eng. Ali H. Elaywe74 Experiment 7 (Consonant Spectrograms) This would be an appropriate point to break off and complete Experiment 7 in Book E, Part 1

Prepared by: Eng. Ali H. Elaywe75 The developers of the CSLU Toolkit have: 1- recorded thousands of word pronunciations 2- measured the formant contours for individual phonemes and 3- measured transitions between phonemes (co- articulation) Based on this experimental data they have determined that all the various combinations of phonemes can be represented by 544 distinct phoneme categories The process of speech analysis for recognition purposes somewhat simplified, is as follows: (2 simple steps) Continue Sub-Topic 3.4: Phoneme characterization

Prepared by: Eng. Ali H. Elaywe76 Step1: Feature Extraction (Figure 20): The digital speech is analyzed in frames of 5–20 ms duration with successive frames spaced 10 ms apart For each frame the spectrum (I.e., the spectrogram) is calculated and a number of spectral features (such as the formant frequencies) are extracted and stored The short duration of frames means that they can’t capture all the co-articulation effects To overcome this, the spectral data of a single frame is combined with the spectral data from the frames at –60, –30, +30 and +60 ms with respect to itself, as illustrated in Figure 20 This means that five neighboring time frames make up a single context window The context window is represented by some 130 acoustical feature values with implicit temporal dependencies Continue

Prepared by: Eng. Ali H. Elaywe77 Figure 20 Multiple-frame context window Continue

Prepared by: Eng. Ali H. Elaywe78 Step2: Feature Extraction (Figure 21) for one context window: Feature Extraction is mainly based on estimating probabilities From the set of 130 numerical values it is possible to estimate the probability that the context window represents any one of the 544 phoneme categories The calculation is repeated for each context window until the entire waveform has been processed A representation of all this processing is shown in Figure 21 for an utterance of the word ‘two’. This utterance is made up of three of the 544 phoneme categories: 1- the plosive ‘t’ 2- a transition from ‘t’ to ‘u’ and 3- the ‘u’ Continue

Prepared by: Eng. Ali H. Elaywe79 Figure 21 Time-aligned phoneme categorization Continue

Prepared by: Eng. Ali H. Elaywe80 The vertical axis of Figure 21 represents the phoneme categories and the horizontal axis represents time: 1- Each cell represents the probability of occurrence of a single phoneme category within a single context window. The darker (like in ‘t’ or ‘u’) the cell colour the higher the probability that the data within the cell represents the specific phoneme category (‘t’ or ‘u’) 2- Cells across a single row represent the change of probability over time (like in transition from ‘t’ to ‘u’) see Figure 21 There are dark grey squares indicating a high probability that these context windows represent a transition from a ‘t’ to a ‘u’ sound Experiment 8 (Phoneme transitions) This would be an appropriate point to break off and complete Experiment 8 in Book E, Part 1

Prepared by: Eng. Ali H. Elaywe81 The Final stage of the recognition process: The final stage of the recognition process is to extract entire words, or phrases, from the captured speech data In the case of the CSLU Toolkit the words to be recognized are known a priori, that is the application defines a set of words, hence the time-aligned phoneme categorization sequence can be calculated and searched (e.g., by CSLU Search Algorithm) for within the measured data Continue Sub-Topic 3.5: Word recognition

Prepared by: Eng. Ali H. Elaywe82 CSLU Search Algorithm (Activity 8 (exploratory)) The goal is to decide if the captured data (see Figure 22) represent an utterance of the word ‘yes’ or the word ‘no’ We can also assume that these utterances will be preceded and followed by silence, since this is an isolated-word recognizer. All other words are regarded as ‘garbage’ Step 1: Extracting phoneme category results from measured data (Figure 22): Extract the time-aligned phoneme category results from the measured data (for the two words i.e. 'yes' or 'no') are shown in Figure 22 Continue

Prepared by: Eng. Ali H. Elaywe83 Figure 22 Time-aligned phoneme categorization results for captured utterance Continue

Prepared by: Eng. Ali H. Elaywe84 Step 2: Creating/Generating the Word search template (Figure 23): Convert the two target words (‘yes’ or ‘no’) into their equivalent time-aligned sequence of phoneme categories (or the word search template): A- For the word ‘yes’ the known sequence comprises seven phoneme categories as follows: (see Figure 23) 1- $sil < y transition from silence to the start of the ‘y’ 2- y > $mid transition from ‘y’ to next phoneme 3- $front < E the front of the ‘e’ phoneme 4- middle of the ‘e’ phoneme 5- E > $fric end of the ‘e’ to a fricative phoneme 6- $mid < s transition to ‘s’ 7- s > $sil transition from ‘s’ to silence Continue

Prepared by: Eng. Ali H. Elaywe85 Figure 23 Word search template Continue

Prepared by: Eng. Ali H. Elaywe86 B- For the word ‘no’ the known sequence comprises five phoneme categories: (see Figure 23) 1- $sil < n transition from silence to the start of the ‘n’ 2- n > $back transition from ‘n’ 3- $nas < oU transition of nasal to start of ‘o’ sound 4- middle of the ‘o’ sound 5- oU > $sil transition from the ‘o’ to silence Important Note: You do not need to remember the details of these categorizations; they are included here for illustration only !! Step 3: Decision about which word was spoken (Figure 23 over Figure 22): Now Lets go back to the question: (Activity 8) Which of the words ‘yes’ or ‘no’ is represented by the data shown in Figure 22 ?? Continue

Prepared by: Eng. Ali H. Elaywe87 A- You have to imagine that you can pick up Figure 23 (search templates for the known sequences for words ‘yes’ and ‘no’) and place it over Figure 22 (the time-aligned phoneme category results extracted from the measured data) B- Then you need to compare the holes in the word search template (Figure 23) with the locations of the high probability squares (darkest shade of grey) in Figure 22 C- Assume each square that shows through a hole has a numeric value (the probability estimate) and these can be combined to provide a final estimate of the word probability of word recognition. This must be repeated for both the top and bottom parts of the search template. Whichever part gives the higher total decides the word that was spoken The correct answer in our present case is ‘yes’ was the word that was spoken Continue

Prepared by: Eng. Ali H. Elaywe88 Word spotting: The phoneme category matching process can be applied to the task of recognizing single words or phrases Suppose that in designing a vending machine we want to recognize the type of drink ordered by a customer Assume that the options on offer are ‘tea’, ‘coffee’, ‘hot chocolate’, or ‘orange juice’ The user might not give a clear answer such as 'tea' but may give a succinct order by saying ‘Tea, please’. Alternatively they might say something like ‘Well, let me think, umm … I’d like some coffee’ The solution is to use the technique of word spotting – that is, looking for a key word (or phrase) within the spoken phrase, in our case the key words ‘tea’, ‘coffee’, ‘hot chocolate’, or ‘orange juice’ Continue

Prepared by: Eng. Ali H. Elaywe89 The ASR would be set to recognize the combination ANY ANY Where ‘ANY’ stands for anything other than the key word or silence. Provided the key word has a much higher probability than any other word (or silence) in its set of time-aligned phoneme categories, it will be recognized Word spotting plays a key part in the performance of the speech recognition engine built into the CSLU RAD design tool that you will meet in Books D and E

Prepared by: Eng. Ali H. Elaywe90 1) Read Book S (Speech Recognition) 2) Do All activities in Book S 3) Do Experiments from 1 t0 8 in BOOK E 4) Read Part 1 and Part2 of Book D 5) Familiarize your self more with use the CSLU Toolkit’s Rapid Application Developer (RAD) 6) Try to finish TMA01 (the cut-off date will be on 6/12/2008) Topic 4: Preparation for Next Session

Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session.

Similar presentations

Presentation on theme: "Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session.

Similar presentations

Presentation on theme: "Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session."— Presentation transcript:

Similar presentations

About project

Feedback