Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexicon Optimization Approaches for Language Models of Agglutinative Language Mijit Ablimit Tatsuya Kawahara Askar Hamdulla Media archiving laboratory,

Similar presentations


Presentation on theme: "Lexicon Optimization Approaches for Language Models of Agglutinative Language Mijit Ablimit Tatsuya Kawahara Askar Hamdulla Media archiving laboratory,"— Presentation transcript:

1 Lexicon Optimization Approaches for Language Models of Agglutinative Language Mijit Ablimit Tatsuya Kawahara Askar Hamdulla Media archiving laboratory, Kyoto University Xinjiang University 1 1

2 2 1. Introduction 2.Review of lexicon optimization methods 3.Morphological segmenter for Uyghur language; ASR results based on various morphological units 4.Morpheme concatenation approaches by comparing two layers of ASR results 5.Discriminative lexicon optimization approaches Outline

3 Introduction 3

4 Introduction to Uyghur The Uyghur (Uyghur: ئۇيغۇر‎, Uyghur; simplified Chinese: 维吾尔 ) are a Turkic ethnic group living in Eastern and Central Asia.Uyghursimplified Chinese 维吾尔Turkic ethnic group Central Asia Today Uyghur people live primarily in the Xinjiang Uyghur Autonomous Region in the People's Republic of China.Xinjiang Uyghur Autonomous RegionPeople's Republic of China

5

6

7 Uyghur language Uyghur Language belongs to Turkish Language Family of Altaic Language system. Present day Uyghur language is written in Arabic letters with some modifications. Writing direction: Right-to-Left Sentences in Uyghur consist of words which are separated by space or punctuation marks.

8 Uyghur Language (Text)

9

10 Uyghur Language (Morphological structure) Uyghur language is an agglutinative language in which words are formed by suffixes attaching to a stem (or root). Each word consists of a root (or stem) attached in the rear by zero to many (longest is about 10 suffixes or may be more) suffixes without any splitter between them. And a few words can be added with a prefix in the head of a stem, and only 7 (difficult to find more) prefixes are used. Thus the morpheme structure of Uyghur word is “prefix + stem (or root) + suffix1 + suffix2 + …”

11 Uyghur Language (Morphological structure) Suffixes that make semantic changes to a root are derivational suffixes. Suffixes that make syntactic changes to a root are inflectional suffixes. When a root linked with the derivational suffixes becomes a stem. So the root set is included in the stem set. Sometimes the words “stem” and “root” are used without distinguishing. have number of phonetic phenomena such as vowel weakening, phonetic harmony etc.

12 Uyghur Language (Morphological structure) Morpheme is the smallest meaning bearing unit. Here, morpheme is refer to any one of prefix, stem, or suffix. Morpheme consists of syllable (s), And syllable is the sequence of phonemes.

13 Uyghur Language (phonology)

14

15

16 Uyghur2English Character Transcription rule for Inner Processing ئۇيغۇر كونا يېزىق ھەرپلىرىنىڭ لاتىنچە ئىپادىلىنىش جەدۋىلى ئابچدئەئېفگغھئىجكلمن abqdEefgGHijklmn ڭئوئۆپقرسشتئۇئۈۋخيزژ NoOpKrsxtuvwhyzZ

17 For instance

18 18 1 Introduction The practical formulation for ASR: P(U|X) Acoustic Model (AM) P(X|S) Lexical Model P(S) Language Model (LM) U is the acoustic features of speech S is the corresponding text X is phoneme sequence Text S is represented by smaller units t previous n-1 units

19 19 Morpheme unit provides optimal lexicon.  Good coverage and small vocabulary size.  Preserve linguistic information including both word and morpheme boundaries.  Applicable to NLP, like information retrieval, machine translation. But  too short, easily confused Previous Rule-based or data-driven optimization.  Based on frequency and length.  Splitting less frequent units while keeping frequent units.  May not outperform word lexicon Problems with agglutinative languages 1 Introduction

20 20 Unsupervised segmentation methods reported to have better ASR performance for agglutinative languages.  Not carrying linguistic information.  Dependant on training data. ASR system is related with both text and speech.  Previous methods are based on only text.  Rarely considering phonological similarity related with ASR system.  There are co-articulation problems related with unit selection.  Phonetic changes : phonetic harmony or disharmony.  Morphological changes: omission, insertion. 1 Introduction Problems with agglutinative languages

21 21 Review of Lexicon optimization methods

22 22 Word Error Rate (WER), Morpheme (syllable, character)Error Rate (MER), SER, CER :  WER  MER, SER, CER Lexicon size. Perplexity: Evaluations for ASR system 2 Review of Lexicon optimization methods Entropy on test corpus S N t : unit token N w : word token

23 23 Merging short and frequently co-occurred units while splitting infrequent and long units. Morphological attributes. Mutual Bigram (MB) can provide a threshold.  These methods can reduce WER or lexicon size,  but over-generalized. Data driven approaches 2 Review of Lexicon optimization methods

24 24 T. Shinozaki and S. Furui (2002) build some evaluation functions based on WER and unit properties, and iteratively trained on ASR results. Perplexity is calculated for basic phoneme units as a criterion for building new units (K. Hwang, 1997). Merging based on largest log likelihood increase. Sub-word units is used to detect OOV (Choueiter 2009, Parada 2011), and other methods are used to solve this problem:  Sub-word modeling.  Online learning of OOV words. Statistical modeling approaches 2 Review of Lexicon optimization methods

25 25 Discover lexicon from untagged corpora (Goldsmith 2001, Brent 1999, Jurafsky 2000, Chen 1998, Goldwater 2009, etc.). Bayesian framework utilized for unit selection to overcome overfitting problem of Maximum Likelihood estimation (Goldwater, Brent, etc.). Maximum a Posteriori (MAP) is utilized for different structures (Creutz 2005 et all.). (frequency and length properties are utilized) Unsupervised lexicon extraction 2 Review of Lexicon optimization methods

26 26 Discriminative language modeling approach Discriminative approaches are used to optimize language model (Brian Roark, Murat Saraclar et al 2004) not lexicon. ASR samples are utilized as positive and negative samples. Various features are trained on ASR results. Discriminative lexicon optimization method proposed in this research is inspired by this method. 2 Review of Lexicon optimization methods

27 27 Lexicon optimization approaches for various languages A lot of lexicon optimization researches have been done for agglutinative languages like Japanese, Korean, Turkish, and other highly inflectional languages like Finnish, German, Estonian, Arabic, Thai. Agglutinative languages have a clear morphological structure compared to other fusional languages. Stem + suf 1 + suf 2 + … + suf n It is possible to extract linguistic sub-word units with high accuracy. All inflectional languages investigated methods based on data- driven methods or utilized frequency and length. 2 Review of Lexicon optimization methods

28 28 Lexicon optimization approaches for Turkish Murat Saraclar (2007) and K. Hacioglu (2003) compared all morphological units based ASR results.  Linguistic morphemes are extracted based on two-level-morphology.  Statistical morphemes are extracted by Morfessor program.  Mutual probability threshold is used.  Stem-ending approach used frequency based optimization and have improved results. K. Oflazer (2005) et al investigated multi-layer sub-word optimization.  Based on the stem-ending units, or syllable based units are adopted for OOV. Discriminative language modeling applied for Turkish, and 1% improvement reported. 2 Review of Lexicon optimization methods

29 29 Lexicon optimization approaches for Japanese and Korean No word delimiters in Japanese. Both supervised and unsupervised lexicon extraction is investigated.  L.M. Tomokiyo and K. Ries (1998) investigated a concatenative approach based on perplexity. R.K. Ando and Lillian Lee (2000) investigated segmentation based on three types of Japanese characters, kanji, hiragana, katakana, which often indicate word boundary.  Unsupervised lexicon extraction methods are investigated by M. Nagata (1997) based on a small number words iteratively augmented on a raw corpus.  Bayesian model (Mochihashi 2009) is applied for Japanese and Chinese lexicon extraction. Lexicon optimization is widely used for Korean.  O.W. Kwon, K. Hwang et al (2003) morpheme or pseudo morpheme concatenative based on morphological and phonological properties.  O.W. Kwon (2000) applied a holistic approaches of rule based and statistical approaches. 2 Review of Lexicon optimization methods

30 30 Lexicon optimization approaches for Other inflectional languages The Morfessor program is used to investigate many inflectional languages like Finnish, Turkish, Arabic, Estonian, Basque etc. Morphological phonological criteria along with statistical methods are applied for many languages like, German, Arabic, Thai etc. Improved ASR results are reported. 2 Review of Lexicon optimization methods

31 31 1. Linguistic morpheme unit segmentation.  Can preserve linguistic information, convenient for downstream NLP processing such as IR, MT.  Supervised segmentation is necessary. 2. Feature extraction from two layers of ASR results.  Directly linked to the ASR accuracy of word and morpheme units.  Features like Frequency, length, morphological attributes are utilized to extract useful samples. 3. Discriminative approach to morpheme concatenation.  Automatically learn the effective features.  Machine learning methods like SVM and perceptron are used. 1 Introduction Our approaches to lexicon optimization

32 32 1 Introduction Flow chart of the proposed lexicon optimization

33 33 Morpheme segmentation in Uyghur language and Baseline ASR results

34 34 Outline Morphological segmenters for Uyghur language Statistical properties based on various units ASR results based on various morphological units 3 Sub-word segmentation in Uyghur language and Baseline ASR results

35 35 Uyghur language is an agglutinative language, belongs to Turkish Language Family of Altaic Language system. Müshükning k ǝ lginini korg ǝ n chashqan hoduqup qachti. (The mouse who saw the cat coming was startled and escaped.) words are separated naturally morpheme sequence: format “ prefix + stem + suffix1 + suffix2 + … ” Müshük+ning k ǝ l+g ǝ n+i+ni kor+g ǝ n chashqan hoduq+up qach+ti. Suffer from phonological morphological changes syllable sequence: format “CV[CC]” (C: consonant; V: vowel) Mü+shük+ning k ǝ l+gi+ni+ni kor+g ǝ n chash+qan ho+du+qup qach+ti. Uyghur language and morphology 3.1 Morphological segmenters for Uyghur language

36 Inducing Uyghur morphemes Stem + suffix structure, few prefixes (7 in this research). Stems are longer, and independent linguistic units, fairly unchanged after suffixation. 2 types of suffixes: Derivational and Inflectional. Stem and suffix boundary is chosen as the primary target of segmentation. 36  First, split to stem and word-ending.  Second, word-ending split to single suffixes. 3.1 Morphological segmenters for Uyghur language

37 37 Insertion, deletion, phonetic harmony, and disharmony (assimilation, or weakening). The tense, person, genitive …all expressed in suffix. 1) assimilation is recovered to standard surface forms. almiliring = alma+lar+ing 2) morphological change, which is deletion and insertion. oghli = oghul + i ; binaying = bina+[y]+ing 3) phonetic harmony. Kyotodin = Kyoto + din; Newyorktin = Newyork + tin 4) ambiguity. berish = bar(go/have)+ish; berish = b ǝ r(give)+ish Phonological and morphological changes in surface forms 3.1 Morphological segmenters for Uyghur language

38 We segment 18,400 word list to morphemes, the accuracy is around 92%. Based on vocabulary, difficult to generalize to text data. When incorporated with some specific applications, like spell checker, dictionary, or search engine, some revisions are necessary. Syllable structure is C+V+C+C (V is vowel, C is consonant) except some syllables are imported from Chinese or European languages. Rule Syllable segmentation result is higher than 99%. 38 Inducing Uyghur morphemes – rule based segmentation results 3.1 Morphological segmenters for Uyghur language

39 39 A statistical model can be trained in a fully supervised way. A text and its manual segmentation is prepared. A text corpus of sentences, collected from general topics, and their manual segmentations are prepared. More than 30K stems are prepared independently and used for the segmentation task. tokensvocabulary word139.0k35.37k morpheme261.7k11.8k character936.8k sentence10025 Supervised morpheme segmentation - Statistical modeling 3.1 Morphological segmenters for Uyghur language

40 40 For a candidate word, all the possible segmentation results are extracted before their probabilities are computed to get the best result. Intra-word bi-gram probabilistic formulation is: Probabilistic model for morpheme segmentation Surface realization is considered. Standard morpheme format is exported. 3.1 Morphological segmenters for Uyghur language

41 Word coverage is 86.85%. Morpheme coverage is 98.44%. The morpheme segmentation accuracy is 97.66% Morpheme segmentation accuracy and coverage Morphological segmenters for Uyghur language

42 n-gram language models based on various morphological units – corpora preparation 42 A corpus of about 630k sentences from general topics like novels, newspapers, books (newspaper, novels, science...). The surface forms of morphemes are kept unchanged. Linguistic information, like word and morpheme boundaries are preserved. unitssentencewordmorphsyllablechar training corpus 1620k11.58m21.88m31.74m78.18m test corpus m0.408m0.592m1.46m 3.2 Statistical properties based on various units

43 Tri-gram language models based on various units 43 Training corpus Characters (millions) tokens (millions) perplexity normalized by words word morphsyllableword morphsyllable L1/ L 1/ L 1/ L 1/ L 1/ L 1/ L 1/ Statistical properties based on various units

44 44 Vocabulary comparison of various units Vocabulary size explode by words 3.2 Statistical properties based on various units

45 45 Perplexity comparison of various n-grams, normalized by words  Perplexities by various unit sets will converge to similar results.  Slight gain by longer units with smaller size of n.  Morpheme slightly outperformed word, because of small OOV rate. 3.2 Statistical properties based on various units

46 46 Uyghur ASR experiments - Uyghur Acoustic Model Training speech corpus is selected from general topics. And used for Uyghur acoustic model (AM) building. Test corpus is 550 different sentences from news, each sentence is read by at least one male and one female. Corpus Unique sentences SpeakersFemaleMaleAge Total utterance time (hour) Training 13.7K k158.6 Test ASR results based on various morphological units

47 Four different language models are built. ASR systems based on various morphological units ASR results based on various morphological units LM namesWord Stem- Suffix morph- 3gram morph- 4gram morph- 5gram Vocabulary227.9k74.5k55.2k Morph Error Rate(%) Word Error Rate (%) ) Morpheme based model 4)Syllable based model 1) Word based model 3) Stem-Suffix (word endings) based model The syllable vocabulary is 6.58k and the syllable error rate is 28.73%. Word-based ASR result is automatically segmented to morphemes and syllables. Corresponding MER is 18.88%, SER is 15.42%.

48 48 Supervised morphological unit segmentation achieved 97.6% for Uyghur language. Morpheme provides syntactic and semantic information which is convenient for ASR and NLP researches. Uyghur LVCSR system on various linguistic units are built. Longer units (word) outperform other sub-word units in ASR application. 3 Sub-word segmentation in Uyghur language and Baseline ASR results Conclusion

49 49 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

50 50 Outline  Corpora and Baseline systems  Problematic sample extraction  Experimental results 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

51 Aligned ASR results of word and morpheme units 51 reference word Yash cheghinglarda bilim elishinglar ker ǝ k reference morph Yash chegh_ing_lar_da bilim el_ish_ing_lar ker ǝ k word ASR result Yash cheghinglarda bilim berishinglar ker ǝ k O O O X O morph ASR result Yash chegh_ing_da bilim el_ish_ing_lar ker ǝ k O X O O O Word unit provides better ASR performance, vocabulary size explode, and causing OOV. Morpheme unit smaller vocabulary size, high coverage, but often short and easily confused. 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

52 52 Unit formsMorpheme unitWord unit Speech corpus training corpus test corpus training corpus test corpus 1-gram coverage gram coverage gram coverage perplexity Comparison of training and test speech corpora Training data is prepared to cover all the phoneme sequences, so it contains larger stem vocabulary. Julius system is used as decoder. 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

53 ASR results on various morphological units 53  Morpheme 4-gram generate best results for morpheme based LMs.  Cutoff-F means units whose frequency is less than F are considered UNKNOWN. Frequent Morpheme Sequence (FMS), Co- occurred less than 500 times are merged. 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results Baseline modelsWER (%) Vocabulary size Morph. 4-gram cutoff k cutoff k Word 3-gram cutoff k cutoff k FMS k Stem & word endings k Best ASR results are aligned for sample extraction

54 54 Comparing ASR results of word and morpheme units We extract useful patterns from the ASR results. Analyze reasons for the confusion. Main reasons for misrecognition Examples (English translation) Phonetic harmony or co-articulation yigirm ǝ -yigirmi (twenty), vottura-votturi (middle) Confusion in frequent short stems with many derivatives biz, vu, bash, y ǝ r (we, he/her, head, land) Phonetic similarity h ǝ mm ǝ - ǝ mma (all, but) Ambiguityuni-u+ni (he, him) Too many suffix insertionsish+lap+p+i+ish+ni We focus on reasons for confusion, and ascribed to several features: error frequency, length, and attribute (stem or word- ending). The co- articulation problem can be solved by merging units. 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

55 55 The patterns we considered are three types of features.  First, the error frequency of word unit.  Second, the length of morphemes.  Third, Attributes (stem and word-ending) Features for extracting samples from ASR results stem and word- ending features are special for Agglutinative languages. 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

56 56 Word candidates are selected according to these features. And added to lexicon of morpheme unit.  Iteratively application of error frequency feature shows accumulative improvements. When we extract misrecognized words from the test set, we found that only 50% of them are covered by the training data set. IterationsBaselineFirst round Second round WER(%) on training data WER(%) on test data Vocabulary size27.0k40.3k46.0k Experimental evaluation for - error frequency feature 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

57 57 The result is significantly outperformed both of the baseline models. ModelsWER(%) △ WER (%) Vocabulary size Morpheme-based baseline k Error frequency feature k Length feature k Attribute features k Attribute + Length features k Attribute + Length + Error freq features k Experimental evaluation for - different features The effects of various features and their combined effects. The features have accumulative effects. 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

58 We have proposed a manual feature extraction approach based on two layers ASR result comparison. Instead of speculating linguistic or statistical properties, we directly analyze the ASR results and identify useful features. The proposed method significantly reduced both WER and lexicon size compared to the best word-based model. Directly related with ASR results, no direct link with OOV. 58 Conclusion 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results

59 59 Discriminative Lexicon Optimization Approach for ASR Automatic Speech Recognition of Agglutinative Language based on Lexicon Optimization

60 60 Feature extraction from the aligned ASR results Discriminative approaches for lexicon optimization Lexicon design and baseline ASR systems Experimental evaluations Outline 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

61 Aligned ASR results of morpheme & word based models reference word Yash cheghinglarda bilim elishinglar ker ǝ k reference morph Yash chegh_ing_lar_da bilim el_ish_ing_lar ker ǝ k word ASR result Yash cheghinglarda bilim elishinglar ker ǝ k O O O O O morph ASR result Yash chegh_ing_da bilim el_ish_ing_lar ker ǝ k O X O O O Means: Study hard when you are young. CRITICAL CASE We automate the manual feature extraction approach 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language 61

62 62 Sample extraction from the aligned ASR results The CIRITICAL samples are extracted. Word ≠ Morph.percentage XX68% OX28.5% XO3.5% Naïve method =error frequency sampling Naïve method:  Misrecognized morphemes are merged into words.  WER greatly reduced with this experiment.  Difficult to cover all words. 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

63 63 Given all the training sample pairs: We can extract binary features: And desired value: wm i m j …word ≠ morph. cheghinglardachegh_ing_lar_da O X y i =1 Feature extraction from the aligned ASR results 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language i: sample index

64 64 unsupervised utilizes all CRITICAL_CASE as Word ≠ Morph. Word ≠ Morph.percentage XX68% OX28.5% XO3.5% supervised unsupervised All the training binary samples are extracted: These samples are feed to machine learning algorithms:  Perceptron, SVM, LR Feature extraction from the aligned ASR results 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

65 For the perceptron, we define an evaluation function: The standard sigmoid function is applied to the linear estimation function. The weight vector is updated as: Discriminative approach - Perceptron Easily converging into a local optimum with a large dimension of features 65 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

66 66 Both methods solve the following unconstrained optimization problem with different loss functions: For SVM, the loss function is: For Logistic Regression, the loss function is: Discriminative approach – SVM & LR 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

67 67 Lexicon Design These features are then generalized to all units in the text corpus. The concatenation process is repeated in sequence while the condition (g(w)>0.5) is met. Can be applied to sub-word within word boundary; search is done while the condition is met. All the speech and text training corpora are consistent with previous. 4-gram LM and Cutoff-5 are used for all experiments. 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

68 Baseline ASR results with various units 68 Baseline modelsWER (%) Vocabulary size OOV Morph. 4-gram cutoff k0.3% cutoff k0.7% Word 3-gram cutoff k2.8% cutoff k4.4% Outlier samples are removed when the frequency of the CRITICAL samples are less than a filtering threshold N The N-gram feature dimension covered by the speech training corpus are :  unigram features: 17K  bigram features : 53K Best ASR results are compared for feature extraction 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

69 Text training data Acoustic Model Speech training data ASR results Feature & Weight estimation Lexical entry selection Proposed method Morph LM ASR system Word LM Enhanced LM Flow chart of discriminative approach 69 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

70 70 Effect of sample filtering threshold with unigram feature thresholdN=0N=1N=2N=3N=4N=5 percep tron WER (%) Lexicon size 104.5K90.2K74.8K63.6K55.3K50.1K LR WER (%) Lexicon size 102.4K91.2K79.9K70.1K62.4K56.5K SVM WER (%) Lexicon size 103.4K94.6K83.7K73.5K65.4K59.2K 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language  SVM and LR are more robust against less reliable samples.

71 71 UnitsWordSub-word Featuresunigrambigramunigrambigram perceptron WER (%) Lexicon size 74.8K67.3K40.7K 49.9K LR WER (%) Lexicon size 102.4K85.4K44.0K65.8K SVM WER (%) Lexicon size 103.4K80.1K34.7K55.1K Comparison of results on different units and features  Sub-word optimization significantly reduce both WER and lexicon size. 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

72 Supervised and Unsupervised feature extraction 72 perceptron WER (%) lexicon size49.7K49.9K LR WER (%) lexicon size46.3K65.8K SVM WER (%) lexicon size45.1K55.1K Word ≠ morph. percentage XX68% OX28.5% XO3.5% sub-word bigram feature supervised unsupervised The unsupervised training is scalable to a large speech data. 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

73 Summary of the results 73 ModelsWER(%) Lexicon size OOV baseline morpheme k0.7% baseline word k2.8% best MI method k0.7% SVM sub-word bigram feature cutoff k0.7% cutoff k0.9%  The optimized system is very stable. Not much effected by Cutoff rate.  SVM & LR are more robust than perceptron. 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

74 74 Conclusion A novel discriminative approach to lexicon optimization for highly inflectional languages. Automatically optimize lexical units, outperformed other methods with smallest WER and lexicon size. SVM & LR are more effective than perceptron. Sub-word optimization based on bigram feature produce the best result. Can be trained on un-transcribed speech data. 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language

75 75 Concluding remarks Uyghur language based morphological analyzers and ASR systems for various morphological units are investigated and optimized. Supervised and unsupervised morpheme segmentation methods and ASR applications are compared. A novel approach based on comparing two layers of ASR results are proposed. The proposed discriminative approach has significantly reduced both WER and lexicon size. The optimized lexicon based ASR system is very stable compared to baseline systems.

76 76 Discussion on generality of the proposed method The basic assumption is word unit outperform predefined morpheme unit. There are similar tendencies in other languages like Turkish, Korean. The predefined morphemes, for some languages, may have a good ASR performance, thus does no satisfy the basic assumption.  Units in Japanese are not clear, the extracted lexicon can be the optimal.  Rule based or statistical approaches already optimized the lexicon. Larger training data size of resource rich languages benefits the longer units, which is consistent with our basic assumption.

77 77 Thank you


Download ppt "Lexicon Optimization Approaches for Language Models of Agglutinative Language Mijit Ablimit Tatsuya Kawahara Askar Hamdulla Media archiving laboratory,"

Similar presentations


Ads by Google