Presentation is loading. Please wait.

Presentation is loading. Please wait.

Paul De Palma Ph. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma 1 Syllables and Concepts.

Similar presentations


Presentation on theme: "Paul De Palma Ph. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma 1 Syllables and Concepts."— Presentation transcript:

1 Paul De Palma Ph. D. Candidate Department of Linguistics University of New Mexico Slides available at: 1 Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

2 An Engineered Artifact 2 Syllables Principled word segmentation scheme No claim about human syllabification Concepts Words and phrases with similar meanings No claim about cognition

3 Reducing the Search Space 3 ASR answers the question: What is the most likely sequence of words given an acoustic signal? Considers many candidate word sequences To Reduce the Search Space Reduce number of candidates Using Syllables in the Language Model Using Concepts in a Concept Component

4 Syllables in LM: Why? 4 Switchboard (Greenberg, 1999, p. 167) Cumulative Frequency as a Function of Frequency Rank

5 Most Frequent Words are Monosyllabic 5 Number Syllablesp er Word % of Corpus by Token % of Corpus by Type Polysyllabic words are easier to recognize (Hamalainen, et al., 2007) And (of course) fewer syllables than words (Greenberg, 1999, p. 167)

6 Reduce the Search Space 2: Concept Component 6 Word Map:GOSyllable Map:GO A flightax f_l_ay_td A ticketax t_ih k_ax_td book airline travelb_uh_kd eh_r l_ay_n t_r_ae v_ax_l book reservationsb_uh_kd r_eh s_axr v_ey sh_ax_n_z Create a reservationk_r_iy ey_td ax r_eh z_axr v_ey sh_ax_n Departingd_ax p_aa_r dx_ix_ng Flyf_l_ay Flying f_l_ay ix_ng getg_eh_td I am leavingay ae_m l_iy v_ix_ng

7 The (Simplified) Architecture of an LVCSR System 7 1. Feature Extractor Transforms an acoustic signal into a collection of 39 feature vectors The province of digital signal processing 2. Acoustic Model Collection of probabilities of acoustic observations given word sequences 3. Language Model Collection of probabilities of word sequences 4. Decoder Guesses a probable sequence of words given an acoustic signal by searching the product of the probabilities found in the acoustic and language models

8 Simplified Schematic 8 Feature Extractor signal Words Decoder Acoustic Model Language Model

9 Enhanced Recognizer 9 Feature Extractor signal Syllables Decoder Acoustic Model Syllable Language Model assumed My Work Concept Component Syllables, Concepts My Work P(O|S) P(S)

10 How ASR Works 10 Input is a sequence of acoustic observations: O = o 1, o 2,…, o t Output is a string of words: W = w 1, w 2,…, w n Then “The hypothesized word sequence is that string W in the target language with the greatest probability given a sequence of acoustic observations.” (1)

11 Operationalizing Equation 1 11 (1) (2) Using Bayes’ Rule: (3) Since the acoustic signal is the same for each candidate, (3) can be rewritten (4) Acoustic Model (likelihood O|W) Language Model (prior probability) Decoder

12 LM: A Closer Look 12 A collection of probabilities of word sequences p(W) = p(w 1 …w n ) (5) Can be written by the probability chain rule: (6)

13 Markov Assumption 13 Approximate the full decomposition of (6) by looking only a specified number of words into the past Bigram  1 word into the past Trigram  2 words into the past … N-gram  n words into the past

14 Bigram Language Model 14 Def. Bigram Probability: p(w n | w n-1 ) = count(w n-1 w n )/count(w n-1 ) (7) Minicorpus paul wrote his thesis james wrote a different thesis paul wrote a thesis suggested by george the thesis jane wrote the poem (e.g., ) p(paul| ) = count( paul)/count( ) = 2/5 P(paul wrote a thesis) = p(paul| ) * p(wrote|paul) * p(a|wrote) * p(thesis|a) * p( |thesis) =.075 P(paul wrote the thesis) = p(paul| ) * p(wrote|paul) * p(the|wrote) * p(thesis|the) * p( |thesis) =.0375

15 Experiment 1: Perplexity 15 Perplexity: PP(X) Functionally related to entropy: H(X) Entropy is a measure of information Hypothesis PPX(of syllable LM) < PPX (of word LM) Syllable LM contains more information

16 Definitions 16 Let X be a random variable p(x) be its probability function Defs: H(X) = -∑ x ∈ X p(x) * lg(p(x)) (1) PP(X) = 2 H(X) (2) Given certain assumptions 1 and def. of H(X), PP(X) can be transformed to: p(w 1 …w n ) -1/n Perplexity is the nth inverse root of the probability of a word sequence 1. X is an ergodic and stationary process, n is arbitrarily large

17 Entropy As Information 17 Suppose the letters of a polynesian alphabet are distributed as follows: 1 ptkaiu 1/8¼ ¼ Calculate the per letter entropy H(P) = -∑ i ∈ {p,t,k,a,i,u} p(i) * lg(p(i)) = = 2 ½ bits 2.5 bits on average required to encode a letter (p: 100, t: 00, etc) 1. Manning, C., Schutze, H. (1999). Foundations of Statistiical Natural Language Processing. Cambridge: MIT Press.

18 Reducing the Entropy 18 Suppose This language consists of entirely of CV syllables We know their distribution We can compute the conditional entropy of syllables in the language H(V|C), where V ∈ {a,i,u} and C ∈ {p,t,k} H(V|C) = 2.44 bits Entropy for two letters, letter model: 5 bits Conclusion: The syllable model contains more information than the letter model

19 Perplexity As Weighted Average Branching Factor 19 Suppose: letters in alphabet occur with equal frequency At every fork we have 26 choices

20 Reducing the Branching Factor 20 Suppose ‘E’ occurs 75 times more frequently than any other letter p(any other letter) = x 75 * x + 25*x = 1, since there are 25 such letters x =.01. Since any letter, w i, is either E or one of the other 25 letters p(w i ) = =.76 and Still 26 choices at each fork ‘E’ is 75 times more likely than any other choice Perplexity is reduced Model contains more information

21 Perplexity Experiment 21 Reduced perplexity in a language model is used as an indicator that an experiment with real data might be fruitful Technique (for both syllable and word corpora) 1. Randomly choose 10% of the utterances from a corpus as a test set 2. Generate a language model from the remaining 90% 3. Compute the perplexity of the test set given the language model 4. Compute the mean over twenty runs of step 3

22 The Corpora 22 Air Travel Information System (Hemphill, et al., 2009) Word types: 1604 Word tokens: 219,009 Syllable types: 1314 Syllable Tokens: 317,578 Transcript of simulated human-computer speech (NextIt, 2008) Word types: 482 Word tokens: 5,782 Syllable types: 537 (This will have repercussions in Exp. 2) Syllable tokens: 8,587

23 Results 23 BigramsTrigrams Mean Words NextIt Mean Syllables NextIt Mean Words ATIS Mean Syllables ATIS Notice drop in perplexity from words to syllables. Perplexity of for trigram syllable ATIS  At every turn, less than ½ as many choices as for trigram word ATIS

24 Experiment 2: Syllables in the language Model 24 Hypothesis: A syllable language model will perform better than a word- based language model By what Measure?

25 Symbol Error Rate 25 SER = (100 * (I + S + D))/T Where: I is the number of insertions S is the number of substitutions] D is the number of deletions T is the total number of symbols SER = 100(2+1+1)/5 = Alignment performed by a dynamic programming algorithm  Minimum Edit Distance

26 Technique 26 Phonetically transcribe corpus and reference files Syllabify corpus and references files Build language models Run a recognizer on 18 short human-computer telephone monologues Compute mean, median, std of SER for 1-gram, 2-gram, 3-gram, 4-gram over all monologues

27 Results 27 MeansSubstitutionInsertionDeletionSER N=2,3, % %81.74%85.3% Syllables Normed by Words Syllables Compared to Words Mean SER Words Mean SER Syllables N= N= N=

28 Experiment 3: A Concept Component 28 Hypothesis: A recognizer equipped with a post-processor that transforms syllable output to syllable/concept output will perform better than one not equipped with such a processor

29 Technique Develop equivalence classes from the training transcript : BE, WANT, GO, RETURN 2. Map the equivalence classes onto the reference files used to score the output of the recognizer. 3. Map the equivalence classes onto the output of the recognizer 4. Determine the SER of the modified output in step 3 with respect to the reference files in step 2.

30 Results 30 Mean SER Syllables Mean SER Concepts N= N= N= MeansSubstitutionInsertionDeletionSER N=2,3,4 1.06%0.95%1.09%1.02% Concepts Normed by Syllables 2% decline overall. Why? Concepts Compared to Syllables

31 Mapping Was Intended to Produce an Upper Bound On SER 31 For each distinct syllable string that appears in the hypothesis or reference files, search each of the concepts for a match If match, substitute concept for the syllable string: ay w_uh_dd l_ay_kd  WANT Misrecognition of a single syllable  no insertion

32 Misalignment Between Training and Reference Files 32 Equivalence classes constructed using only the LM training model transcript More frequent in reference files: 1 st person singular (I want) Imperatives (List all flights) Less frequent in reference files: 1 st person plural (My husband and me want) Polite forms (I would like) BE does not appear (There should be, There’s going to be, etc.)

33 Summary Perplexity: syllable language model contains more information than a word language model (and probably will perform better) 2. Syllable language model results in a 14.7% mean improvement in SER 3. The very slight increase in mean SER for a concept language model justifies further research

34 Further Research Test the given system over a large production corpus 2. Develop of a probabilistic concept language model 3. Develop necessary software to pass the output of the concept language model on to an expert system

35 The (Almost, Almost) Last Word “But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one under any known interpretation of the term.” Cited in Jurafsky and Martin (2009) from a 1969 essay on Quine.

36 36 The (Almost) Last Word He just never thought to count.

37 37 The Last Word Thanks To my generous committee: Bill Croft, Department of Linguistics George Luger, Department of Computer Science Caroline Smith, Department of Linguistics Chuck Wooters, U.S. Department of Defense

38 References 38 Cover, T., Thomas, J. (1991). Elements of Information Theory. Hoboken, NJ: John Wiley & Sons. Greenberg, S. (1999) Speaking in shorthand—A syllable-centric perspective for understanding pronunciation variation. Speech Communication, 29, Hemphill, C., Godfrey, J., Doddington, G. (2009). The ATIS Spoken Language Systems Pilot Corpus. Retrieved 6/17/09 from: Hamalainen, A., Boves, L., de Veth, J., ten Bosch, L. (2007) On the utility of syllable-based acoustic models for pronunciation variation modeling. EURASIP Journal on Audio, Speech, and Music Processing , Jurafsky, D., Martin, J. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Jurafsky, D., Martin, J. (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Manning, C., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. NextIt. (2008). Retrieved 4/5/08 from: NIST. (2007) Syllabification software. National Institute of Standards: NIST Spoken Language Technology Evaluation and Utility. Retrieved 11/30/07 from:

39 Additional Slides 39

40 Transcription of a recording 40 REF: (3.203,5.553) GIVE ME A FLIGHT BETWEEN SPOKANE AND SEATTLE REF: (15.633,18.307) UM OCTOBER SEVENTEENTH REF: (26.827,29.606) OH I NEED A PLANE FROM SPOKANE TO SEATTLE REF: (43.337,46.682) I WANT A ROUNDTRIP FROM MINNEAPOLIS TO REF: (58.050,61.762) I WANT TO BOOK A TRIP FROM MISSOULA TO PORTLAND REF: (73.397,77.215) I NEED A TICKET FROM ALBUQUERQUE TO NEW YORK REF: (87.370,94.098) YEAH RIGHT UM I NEED A TICKET FROM SPOKANE SEPTEMBER THIRTIETH TO SEATTLE RETURNING OCTOBER THIRD REF: ( , ) I WANT TO GET FROM ALBUQUERQUE TO NEW ORLEANS ON OCTOBER THIRD TWO THOUSAND SEVEN

41 Transcribed and Segmented 1 41 REF: (3.203,5.553) GIHV MIY AX FLAYTD BAX TWIYN SPOW KAEN AENDD SIY AE DXAXL REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH REF: (26.827,29.606) OW AY NIYDD AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXL REF: (43.337,46.682) AY WAANTD AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUW REF: (58.050,61.762) AY WAANTD TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDD REF: (73.397,77.215) AY NIYDD AX TIH KAXTD FRAHM AEL BAX KAXR KIY TUW NUW YAORKD REF: (87.370,94.098) YAE RAYTD AHM AY NIYDD AX TIH KAXTD FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RAX TER NIXNG AAKD TOW BAXR THERDD REF: ( , ) AY WAANTD TUW GEHTD FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN 1. Produced by University of Colorado transcription software (to a version of ARPAbet), National Institute of Standards (NIST) syllabifier, and my own Python classes that coordinate the two.

42 With Inserted Equivalence Classes 1 42 REF: (3.203,5.553) GIHV MIY GO BAX TWIYN SPOW KAEN AENDD SIY AE DXAXL REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH REF: (26.827,29.606) OW WANT AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXL REF: (43.337,46.682) WANT AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUW REF: (58.050,61.762) WANT TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDD REF: (73.397,77.215) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW YAORKD REF: (87.370,94.098) YAE RAYTD AHM AY WANT GO FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RETURN AAKD TOW BAXR THERDD REF: ( , ) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN 1. A subset of a set in which all members share an equivalence relation. WANT is an equivalence class with members, I need, I would like, and so on.

43 Including Flat Language Model Word Perplexity 43 Flat LMN = 1N = 2N = 3N = 4 Mean Appendix C Median, Appendix C Std Dev Appendix C Mean ATIS Median ATIS Std Dev ATIS

44 Including Flat LM Syllable Perplexity 44 Flat LMN = 1N = 2N = 3N = 4 Mean Appendix C Median, Appendix C Std Dev Appendix C Mean ATIS Median ATIS Std Dev ATIS

45 Words and Syllables Normed by Flat LM 45 N = 1N = 2N = 3N = 4 Mean Appendix C33.14%6.69%4.15%3.92% Mean ATIS17.58%1.71%1.13%1.07% N = 1N = 2N = 3N = 4 Mean Appendix C23.35%8.18%6.52%6.31% Mean ATIS11.95%11.96%1.96%1.95% Words Data Normed by Flat LM Syllable Data Normed by Flat LM

46 Syllabifier from National Institute of Standards and Technology (NIST, 2007) Based on Daniel Kahn’s 1976 dissertation from MIT (Kahn, 1976) Generative in nature and English-biased Syllabifiers 46

47 Estimates of the number of English syllables range from 1000 to 30,000 Suggests that there is some difficulty in pinning down what a syllable is. Usual hierarchical approach Syllables syllable Coda (C) onset (C) Nucleus (V) rhyme

48 Sonority rises to the nucleus and falls to the coda Speech sounds appear to form a sonority hierarchy (from highest to lowest) vowels, glides, liquids, nasals, obstruents Useful but not absolute: e.g, both depth and spit seem to violate the sonority hierarchy Sonority


Download ppt "Paul De Palma Ph. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma 1 Syllables and Concepts."

Similar presentations


Ads by Google