Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Pattern and Speech recognition Speech recognition John Beech School of Psychology PS1000.

Similar presentations


Presentation on theme: "1 Pattern and Speech recognition Speech recognition John Beech School of Psychology PS1000."— Presentation transcript:

1 1 Pattern and Speech recognition Speech recognition John Beech School of Psychology PS1000

2 2 Speech Recognition Listening to speech isn’t like reading. Speech sounds are produced by changing the position and shape of the tongue and the position and shape of the lips. The shape of the vocal tract changes continuously in a fluid way and these shapes depend on previous shapes. Individual sounds of words are called phonemes. E.g. ‘rat’ and ‘cat’ differ by just one phoneme. There are about 40 phonemes in English and the average duration is 50 msec (1/20 sec). In the context of a word, this is a very short time to identify each phoneme from 40 others. Also, phonemes are not simple and distinct as in the letters in words.

3 3 Speech perception Often there are quite subtle differences between the sounds of phonemes. E.g. in both [ba] and [pa] the lips open and soon after the vocal folds vibrate. In [ba] it is after 20 ms and in [pa] it is after 40 ms. The longer gap between the lip opening and the voice onset for “pa” is because there is more of a puff of air - and therefore it lasts longer. Known as ‘voice onset time’ Thus [b] and [p] sounds are differentiated by voice onset. Differences in voice onset time between “ba” and “pa” start 20 ms 40 ms “ba” lips voice “pa”lips voice

4 4 Speech perception If we take any point in time the sound being produced is not just the intended sound, but it is affected by the previous sound AND the following sound. This subtle variation in sound is known as co-articulation. Thus phonemes can vary according to context, for examples the [p] in [pit] sounds different to the [p] in spit. In [pit], [p] more aspirated.

5 5 Speech perception Phonemes are produced by distinct articulation. E.g. [t] at the front and [k] at the rear of the mouth. Phonemes vary on 3 dimensions of amplitude, frequency and duration. Can be measured by speech spectrogram.

6 6 Speech perception Speech spectrograms showing how words look when depicted by their frequencies over time. Note how even when the beginning consonant is the same, the shape of the speech signal differs depending on the following vowel. Words used: first col. going down: bet, bee, boat & bird; second column going down: debt, deal, dome & dirt.

7 7 Spectrogram and phonetic transcription of “speech perception lab”

8 8 The McGurk effect: a demonstration What we think we hear sometimes is not what we actually hear The ‘McGurk effect’ after Harry McGurk: The McGurk Effect: Installation and Operation In the following demonstration: 1.Sound & Sight: You should hear the sound ‘Da’ when looking at the picture and listening to the sound. This is an illusory sound midway between [ba] & [ga]. 2.Sound alone: Next close your eyes and listen to the sound and you should hear ‘Ba’. 3. Sight alone: Put your fingers in your ears and look at the face and you should perceive the sound to be ‘Ga’. Sound & SightDa Sound aloneBa Sight aloneGa

9 9 Speech perception What we think we hear sometimes is not what we actually hear Bastian et al. (1961) inserted gap of silence between [s] and [eet]. If gap short people thought is was [sleet] and if longer thought it was [spleet].

10 10 Categorical perception Liberman et al. 1957 demonstrated that the perception of phonemes is categorical. As mentioned before, phonemes with voice onset times less than 20 msec are perceived as /b/ and those with voice onsets more than 40 msec are perceived as /p/. start20 msec40 msec “ba”lips open voice “pa”lips open voice

11 11 Categorical perception In both [ba] and [pa] the lips open and soon after the vocal folds vibrate. In [ba] voice onset is less than 20 ms and in [pa] it is after 40 ms. Liberman et al. created artificial versions of [ba] and [pa] with different voice onset times and then played pairs in sequence and asked if same/different sounds. Liberman et al. (1957) task: if participant thinks that he or she hears: “pa” - “pa” then says “same” “ba” – “ba” then says “same” “pa” – “ba” then says “different” The stimulus materials varied in voice onset times.

12 12 Categorical perception Liberman et al. found If both voice onsets were 0-20 ms or both were 40-60 ms, ie same side of boundary, then Ss would say that there was no difference. If on different sides of the boundary, (e.g. first is 15 msec and second is 45 msec in voice onset) then they said ‘different’. Liberman et al showed that the range of voice onset times between 0-20 msec are all classified as within the category of being a “b” sound.

13 13 Categorical perception One hypothesis is that we learn which voice onset times are relevant. Unfortunately, also found in 1- month olds (Eimas et al., 1971)! This suggests that we’re born able to recognize phonemes, and phonemes in other languages as well. At 10 months, processing of phonemes is specific to mother tongue and not to other languages. Do we have phoneme detectors? Not in the case of vowels. Unlike consonants, we do not categorize vowels.

14 14 Phonemic boundaries Summerfield (1981) showed that the voice onset boundary difference between [ba] and [pa] changed according to the context of speech in which it was heard. Thus at a fast speech rate a [ba] with a short onset time, would be perceived as a [pa] with a long onset time. Suggests that we learn to interpret phonemes according to the context of the rate of speech. However, babies can do it too!

15 15 Phonemic boundaries Eimas and Miller (1980) showed that babies can judge relative durations depending on different speech contexts. This must be useful as different speech rates would be very confusing for learning a language. However, at the moment, it is difficult to see how this is achieved.

16 16 Our perception of words Words are built up from phonemes and syllables. Words in turn are used to construct sentences. Making sense of speech sounds (going up) – we start at the level of processing phonemes and these in turn are processed into higher-order units as shown. or producing a sentence (going down) – we have a thought that is at the level of the phrase or longer. We then need to decompose in order to produce speech.

17 17 Our perception of words An early theory was by Morton (1969 & 1970). Ears (and/or eyes) bombarded with word features (acoustic, semantic and visual). Logogens are word units that fire if critical level reached. This level changes with word frequency, just like Treisman’s model.

18 18 Our perception of words On average we know 60-75,000 words, excluding the variants (e.g. like, likes, liked, liking, alike, liken, likeness, likewise). This means each one we hear is accessed from this lexicon. How? The question is what are the important elements of the sounds in word that are used to access the correct entry? Do we use phonemes?

19 19 Our perception of words Infants seem to use syllables for organisation, or at least they organise according to rhythms of their language. Experimental work suggests that for adults in French, syllabic structure is important, but in English it is not (Mehler et al, 1981). However, we are not much further on. We are still not sure how a word is accessed from the lexicon.

20 20 What’s in a word? We’ve already described the constituent sounds, but there is more to a word than this. There is its meaning. Words consist of morphemes (don’t confuse with morphine). These are speech elements which have a meaning and which can’t be subdivided into further elements. E.g. ‘book’ is a morpheme, while ‘books’ is two: ‘book’ and ‘-s’ for plural.

21 21 What’s in a word? Morphemes are of two kinds: stems and affixes (‘-s’, ‘-ing’). In English the more common affix is the suffix at the end of words (e.g. ‘-s’) and the less common the prefix (e.g. ‘un-stable’). Other languages can have infixes – affixes put into the stem - but this occurs rarely and usually informally in English.

22 22 What’s in a word? With some verbs the vowel in the stem is modified to change meaning (e.g. say-said, run-ran). This goes back to earlier times before the suffix –ed was introduced to denote past tense (cook-cooked).

23 23 What’s in a word? Inflectional affixes don’t change a word’s meaning (e.g. -ed, ing), but derivational affixes do (e.g. un-usable, use-less). It might seem that to get at meaning that we analyze the word’s stems and affixes. Unfortunately, there are times when this does not work. There are words with common stems, (e.g pro-blem, em-blem, blem-ish) where the stem (e.g.‘blem’) doesn’t have a common meaning. (There may have been one at one time for blem, from c.14 French blemir - to make pale.)

24 24 What’s in a word? We have only sampled the complexity of this subject studied by linguists. But our concern is more about psycholinguistics. We want to know how words are accessed in the mental lexicon or dictionary.

25 25 How does our mental lexicon work? Marslen-Wilson (1987) found in a shadowing task a correlation between the time to recognise a word and the point at which its sound becomes unique in relation to other words. E.g. when the z sound in ‘trousers’ is encountered, this eliminates ‘trowel’, ‘trounce’ and ‘trout’. We recognise a word before its acoustic offset.

26 26 Speech perception: How does our mental lexicon work? Marslen-Wilson (1987): Listening to the word “trousers” Number of candidate words at each point (starting with “tr”) 238 4 1 /tr/ /ow/ /z/ /er/ /z/ (trousers) /tr/ /ow/ /el/ (trowel) /tr/ /ow/ /n/ /s/ (trounce) /tr/ /ow/ /t/ (trout)

27 27 Speech Perception: Zwitserlood (1989) Zwitserlood (1989): used cross-modal priming (e.g. listening and reading) with Dutch participants, but in translation, this is what happened. People heard just the segment of speech capt, which is part of the word captain. As soon as the ‘t’ had been sounded, the printed word ship was shown on the screen. The actual task was to press either a “word” button or a “no-word” button whether the word “ship” was a real word or not a word. [See next figure]

28 28 Speech Perception: Zwitserlood (1989) Heard “capt..” shown word ship press “yes” button Heard “capt..” shown word blag press “no” button Priming effects 1.If shown control word (e.g. boot) slower than for word ship. 2.If shown captive it was also responded to faster than control word. Conclusion When listening to “captain” and reaching the /t/ sound, all candidate words activated (“captain, “captive”, “capture”)

29 29 Speech Perception: Zwitserlood (1989) However, if the full word was spoken (either captive or captain), then only the related word was activated. Thus the spoken word captive would only influence the printed word captive and not the printed word captain. Thus hearing the full word meant that capture had been rapidly de-activated. [As shown in the next figure]

30 30 Speech Perception: Zwitserlood (1989) BUT if: Heard “captain” shown word ship press “yes” button THEN ONLY “captain” primed the word ship Conclusion After passing /t/, hearing the full word meant that “captive” had rapidly de-activated.

31 31 Getting at the meaning of a word: words with several meanings Entries in a dictionary describe the meaning of words. The mental lexicon stores these word meanings. However, what about the many words that have several meanings, such as “bay”? 1. Type of shore 2. Alcove E.g. bay: 3. Of hound 4. Laurel 5. Reddish-brown colour. Which way could our mental lexicon work? 1.Activate all 5 meanings in parallel? 2.Activate only one of the 5 given the context?

32 32 Swinney (1979): words with several meanings Swinney (1979) wondered if the other meanings are also activated even when the context singled out one meaning. Ss listened to sentences with the word embedded. E.g. ‘the villains decided to kill their master so they hatched a plot…’ Immediately after the ambiguous word was spoken they saw a printed word. For instance, after hearing the ambiguous word plot they might see one of its two different meanings: either land or plan. The task was to decide if the word on the screen was a real word or not. For instance, if they saw land they would respond “word” but for blag they would respond “non- word”.

33 33 Swinney (1979): words with several meanings Start of the sentence they listened to: “the villains decided to kill their master so they hatched a plot…” Listened to “..hatched a plot… Saw word/nonword: plan (congruent with sentence) land (incongruent with sent. congruent with “plot”) bike (incongruent with sent. neutral with “plot”) blag (nonword) Result: plan and land faster than bike. Conclusion: Both meanings of “plot” highly activated at this point – even though the sentence context points to “plan” and not to “land”.

34 34 Swinney (1979): words with several meanings Thus Swinney found that this sentence context activated the two different meanings ‘land’ or ‘plan’. Decision times were faster for both words, compared with a neutral condition. In other words, a sentence context activates both alternative meanings of the word plot. This showed that the activation of both meanings was high at the point when the printed word was shown, even though the sentence context was pointing to only one particular meaning.

35 35 Swinney (1979): words with several meanings THEN tried “hatched a plot to kill..” before presenting the words. This time “land” deactivated and “plan” still active. Further work refined Swinney’s work and showed that activation of a particular meaning was also determined by its frequency of use. Thus if the alternative meaning was rarely used, then activation was weak in the first place. So the effect happens more strongly when two meanings are equally likely.

36 36 Tanenhaus et al. (1979): words with several meanings Tanenhaus et al. (1979) used the same type of paradigm. They looked at words in which the alternative meanings are different also in grammatical terms. Would both form still be activated? For instance, watch has two meanings for the noun and the verb. Similarly, cross. In the sentence: ‘Jim started to…’ only a verb can follow. In our mental dictionary do we scan only for a verb at this point? The answer is ‘no’, we activate both the noun and verb meanings. [see next figure]

37 37 Tanenhaus et al. (1979): words with several meanings Listened to “Jim started to watch… Saw word look (congruent with sentence) clock (incongruent) chair (neutral) blag (nonword) Result: look and clock faster than chair. Conclusion: both meanings of watch highly activated even though one is a noun and the other a verb.

38 38 Summary of Speech Recognition Phonemes are the smallest unit of speech and vary in loudness, pitch and time. When we speak each phoneme can vary according to surrounding phonemes (co- articulation). Also we interpret according to the rate of speech rather than the absolute length of time of the duration of phonemes. Thus we differentiate t and d in terms of their duration, but if the rate of speech is fast, we don’t become confused, as we take into account that relatively speaking t and d now have different durations.

39 39 Summary of Speech Recognition We can hypothesize what we hear (e.g. s__eet). Also we categorize consonant phonemes, we don’t tolerate ambiguity. But we don’t categorize vowel phonemes. In English, we probably don’t use syllables or morphemes in order to understand a word. But the French seem to use syllables. We look up words in an ordinary dictionary sorted by alphabetical order getting closer to the word by a series of decisions. This could be described as a linear process.

40 40 Summary of Speech Recognition The mental lexicon is different. There is a wide activation of meanings compatible with the stream of sounds, irrespective of appropriateness of syntax (grammar). Word frequency leads to strong activation of highly frequent words. There is also a counter suppression of inappropriate entries. Thus although several entries are activated, once there is more information to narrow it down to one word, the other words are rapidly de-activated. This process takes place rapidly during or just after hearing the word.


Download ppt "1 Pattern and Speech recognition Speech recognition John Beech School of Psychology PS1000."

Similar presentations


Ads by Google