Presentation on theme: "Introduction Lexicon Optimization Approaches for Language Models of Agglutinative Language Mijit Ablimit Tatsuya Kawahara Askar Hamdulla Media archiving."— Presentation transcript:
1 IntroductionLexicon Optimization Approaches for Language Models of Agglutinative LanguageMijit Ablimit Tatsuya Kawahara Askar HamdullaMedia archiving laboratory, Kyoto UniversityXinjiang University1
2 Outline Introduction Review of lexicon optimization methods Morphological segmenter for Uyghur language; ASR results based on various morphological unitsMorpheme concatenation approaches by comparing two layers of ASR resultsDiscriminative lexicon optimization approaches
4 Introduction to Uyghur The Uyghur (Uyghur: ئۇيغۇر, Uyghur; simplified Chinese: 维吾尔) are a Turkic ethnic group living in Eastern and Central Asia.Today Uyghur people live primarily in the Xinjiang Uyghur Autonomous Region in the People's Republic of China.
7 Uyghur languageUyghur Language belongs to Turkish Language Family of Altaic Language system.Present day Uyghur language is written in Arabic letters with some modifications.Writing direction: Right-to-LeftSentences in Uyghur consist of words which are separated by space or punctuation marks.
10 Uyghur Language (Morphological structure) Uyghur language is an agglutinative language in which words are formed by suffixes attaching to a stem (or root).Each word consists of a root (or stem) attached in the rear by zero to many (longest is about 10 suffixes or may be more) suffixes without any splitter between them.And a few words can be added with a prefix in the head of a stem, and only 7 (difficult to find more) prefixes are used. Thus the morpheme structure of Uyghur word is“prefix + stem (or root) + suffix1 + suffix2 + …”
11 Uyghur Language (Morphological structure) Suffixes that make semantic changes to a root are derivational suffixes. Suffixes that make syntactic changes to a root are inflectional suffixes.When a root linked with the derivational suffixes becomes a stem. So the root set is included in the stem set. Sometimes the words “stem” and “root” are used without distinguishing.have number of phonetic phenomena such as vowel weakening, phonetic harmony etc.
12 Uyghur Language (Morphological structure) Morpheme is the smallest meaning bearing unit.Here, morpheme is refer to any one of prefix, stem, or suffix.Morpheme consists of syllable (s),And syllable is the sequence of phonemes.
18 The practical formulation for ASR: Introduction1 IntroductionThe practical formulation for ASR:P(U|X) Acoustic Model (AM)P(X|S) Lexical ModelP(S) Language Model (LM)U is the acoustic features of speechS is the corresponding textX is phoneme sequenceText S is represented by smaller units tprevious n-1 units
19 Problems with agglutinative languages 1 IntroductionProblems with agglutinative languagesMorpheme unit provides optimal lexicon.Good coverage and small vocabulary size.Preserve linguistic information including both word and morpheme boundaries.Applicable to NLP, like information retrieval, machine translation.Buttoo short, easily confusedPrevious Rule-based or data-driven optimization.Based on frequency and length.Splitting less frequent units while keeping frequent units.May not outperform word lexicon
20 Problems with agglutinative languages 1 IntroductionProblems with agglutinative languagesUnsupervised segmentation methods reported to have better ASR performance for agglutinative languages.Not carrying linguistic information.Dependant on training data.ASR system is related with both text and speech.Previous methods are based on only text.Rarely considering phonological similarity related with ASR system.There are co-articulation problems related with unit selection.Phonetic changes : phonetic harmony or disharmony.Morphological changes: omission, insertion.
22 Evaluations for ASR system 2 Review of Lexicon optimization methodsEvaluations for ASR systemWord Error Rate (WER), Morpheme (syllable, character)Error Rate (MER), SER, CER :WER MER, SER, CERLexicon size.Perplexity:Entropy on test corpus SNt : unit tokenNw : word token
23 Data driven approaches 2 Review of Lexicon optimization methodsData driven approachesMerging short and frequently co-occurred units while splitting infrequent and long units.Morphological attributes.Mutual Bigram (MB) can provide a threshold.These methods can reduce WER or lexicon size,but over-generalized.
24 Statistical modeling approaches 2 Review of Lexicon optimization methodsStatistical modeling approachesT. Shinozaki and S. Furui (2002) build some evaluation functions based on WER and unit properties, and iteratively trained on ASR results.Perplexity is calculated for basic phoneme units as a criterion for building new units (K. Hwang, 1997).Merging based on largest log likelihood increase.Sub-word units is used to detect OOV (Choueiter 2009, Parada 2011), and other methods are used to solve this problem:Sub-word modeling.Online learning of OOV words.
25 Unsupervised lexicon extraction 2 Review of Lexicon optimization methodsUnsupervised lexicon extractionDiscover lexicon from untagged corpora (Goldsmith 2001, Brent 1999, Jurafsky 2000, Chen 1998, Goldwater 2009, etc.).Bayesian framework utilized for unit selection to overcome overfitting problem of Maximum Likelihood estimation (Goldwater, Brent, etc.).Maximum a Posteriori (MAP) is utilized for different structures (Creutz 2005 et all.). (frequency and length properties are utilized)
26 Discriminative language modeling approach 2 Review of Lexicon optimization methodsDiscriminative language modeling approachDiscriminative approaches are used to optimize language model (Brian Roark, Murat Saraclar et al 2004) not lexicon.ASR samples are utilized as positive and negative samples.Various features are trained on ASR results.Discriminative lexicon optimization method proposed in this research is inspired by this method.
27 Lexicon optimization approaches for various languages 2 Review of Lexicon optimization methodsLexicon optimization approaches for various languagesA lot of lexicon optimization researches have been done for agglutinative languages like Japanese, Korean, Turkish, and other highly inflectional languages like Finnish, German, Estonian, Arabic, Thai.Agglutinative languages have a clear morphological structure compared to other fusional languages.Stem + suf1 + suf2 + … + sufnIt is possible to extract linguistic sub-word units with high accuracy.All inflectional languages investigated methods based on data-driven methods or utilized frequency and length.
28 Lexicon optimization approaches for Turkish 2 Review of Lexicon optimization methodsLexicon optimization approaches for TurkishMurat Saraclar (2007) and K. Hacioglu (2003) compared all morphological units based ASR results.Linguistic morphemes are extracted based on two-level-morphology.Statistical morphemes are extracted by Morfessor program.Mutual probability threshold is used.Stem-ending approach used frequency based optimization and have improved results.K. Oflazer (2005) et al investigated multi-layer sub-word optimization.Based on the stem-ending units, or syllable based units are adopted for OOV.Discriminative language modeling applied for Turkish, and 1% improvement reported.
29 Lexicon optimization approaches for Japanese and Korean 2 Review of Lexicon optimization methodsLexicon optimization approaches for Japanese and KoreanNo word delimiters in Japanese. Both supervised and unsupervised lexicon extraction is investigated.L.M. Tomokiyo and K. Ries (1998) investigated a concatenative approach based on perplexity. R.K. Ando and Lillian Lee (2000) investigated segmentation based on three types of Japanese characters, kanji, hiragana, katakana, which often indicate word boundary.Unsupervised lexicon extraction methods are investigated by M. Nagata (1997) based on a small number words iteratively augmented on a raw corpus.Bayesian model (Mochihashi 2009) is applied for Japanese and Chinese lexicon extraction.Lexicon optimization is widely used for Korean.O.W. Kwon, K. Hwang et al (2003) morpheme or pseudo morpheme concatenative based on morphological and phonological properties.O.W. Kwon (2000) applied a holistic approaches of rule based and statistical approaches.
30 Lexicon optimization approaches for Other inflectional languages 2 Review of Lexicon optimization methodsLexicon optimization approaches forOther inflectional languagesThe Morfessor program is used to investigate many inflectional languages like Finnish, Turkish, Arabic, Estonian, Basque etc.Morphological phonological criteria along with statistical methods are applied for many languages like, German, Arabic, Thai etc .Improved ASR results are reported.
31 Our approaches to lexicon optimization 1 IntroductionOur approaches to lexicon optimization1. Linguistic morpheme unit segmentation.Can preserve linguistic information, convenient for downstream NLP processing such as IR, MT.Supervised segmentation is necessary.2. Feature extraction from two layers of ASR results.Directly linked to the ASR accuracy of word and morpheme units.Features like Frequency, length, morphological attributes are utilized to extract useful samples.3. Discriminative approach to morpheme concatenation.Automatically learn the effective features.Machine learning methods like SVM and perceptron are used.
32 Flow chart of the proposed lexicon optimization 1 IntroductionFlow chart of the proposed lexicon optimization
33 Morpheme segmentation in Uyghur language and Baseline ASR results
34 Outline Morphological segmenters for Uyghur language 3 Sub-word segmentation in Uyghur language and Baseline ASR resultsOutlineMorphological segmenters for Uyghur languageStatistical properties based on various unitsASR results based on various morphological units
35 Uyghur language and morphology 3.1 Morphological segmenters for Uyghur languageUyghur language and morphologyUyghur language is an agglutinative language, belongs to Turkish Language Family of Altaic Language system.Müshükning kǝlginini korgǝn chashqan hoduqup qachti.(The mouse who saw the cat coming was startled and escaped.)words are separated naturallymorpheme sequence: format “ prefix + stem + suffix1 + suffix2 + … ”Müshük+ning kǝl+gǝn+i+ni kor+gǝn chashqan hoduq+up qach+ti.Suffer from phonological morphological changessyllable sequence: format “CV[CC]” (C: consonant; V: vowel)Mü+shük+ning kǝl+gi+ni+ni kor+gǝn chash+qan ho+du+qup qach+ti.
36 Inducing Uyghur morphemes 3.1 Morphological segmenters for Uyghur languageInducing Uyghur morphemesStem + suffix structure, few prefixes (7 in this research).Stems are longer, and independent linguistic units, fairly unchanged after suffixation.2 types of suffixes: Derivational and Inflectional.Stem and suffix boundary is chosen as the primary target of segmentation.First , split to stem and word-ending.Second, word-ending split to single suffixes.
37 Phonological and morphological changes in surface forms 3.1 Morphological segmenters for Uyghur languagePhonological and morphological changes in surface formsInsertion, deletion, phonetic harmony, and disharmony (assimilation, or weakening).The tense, person, genitive …all expressed in suffix.1) assimilation is recovered to standard surface forms.almiliring = alma+lar+ing2) morphological change, which is deletion and insertion.oghli = oghul + i ; binaying = bina+[y]+ing3) phonetic harmony.Kyotodin = Kyoto + din; Newyorktin = Newyork + tin4) ambiguity.berish = bar(go/have)+ish; berish = bǝr(give)+ish
38 Inducing Uyghur morphemes – rule based segmentation results 3.1 Morphological segmenters for Uyghur languageInducing Uyghur morphemes – rule based segmentation resultsWe segment 18,400 word list to morphemes, the accuracy is around 92%.Based on vocabulary, difficult to generalize to text data.When incorporated with some specific applications, like spell checker, dictionary, or search engine, some revisions are necessary.Syllable structure is C+V+C+C (V is vowel, C is consonant)except some syllables are imported from Chinese or European languages.Rule Syllable segmentation result is higher than 99%.
39 Supervised morpheme segmentation - Statistical modeling 3.1 Morphological segmenters for Uyghur languageSupervised morpheme segmentation - Statistical modelingA statistical model can be trained in a fully supervised way. A text and its manual segmentation is prepared.A text corpus of sentences, collected from general topics, and their manual segmentations are prepared.More than 30K stems are prepared independently and used for the segmentation task.tokensvocabularyword139.0k35.37kmorpheme261.7k11.8kcharacter936.8ksentence10025
40 Probabilistic model for morpheme segmentation 3.1 Morphological segmenters for Uyghur languageProbabilistic model for morpheme segmentationIntra-word bi-gram probabilistic formulation is:Surface realization is considered. Standard morpheme format is exported.For a candidate word, all the possible segmentation results are extracted before their probabilities are computed to get the best result.
41 Morpheme segmentation accuracy and coverage 3.1 Morphological segmenters for Uyghur languageMorpheme segmentation accuracy and coverageWord coverage is 86.85%. Morpheme coverage is 98.44%.The morpheme segmentation accuracy is 97.66%
42 3.2 Statistical properties based on various units n-gram language models based on various morphological units – corpora preparationA corpus of about 630k sentences from general topics like novels, newspapers, books (newspaper, novels, science...).The surface forms of morphemes are kept unchanged.Linguistic information, like word and morpheme boundaries are preserved.unitssentencewordmorphsyllablechartraining corpus1620k11.58m21.88m31.74m78.18mtest corpus118880.217m0.408m0.592m1.46m
43 Tri-gram language models based on various units 3.2 Statistical properties based on various unitsTri-gram language models based on various unitsTraining corpusCharacters (millions)tokens (millions)perplexity normalized by wordswordmorphsyllableL1/641.2340.1820.3450.501235661438427740L 1/322.4740.3660.6921.00514376898720919L 1/164.9480.7321.3842.0099153611917037L 1/89.8181.4532.7483.9875935434314640L 1/419.6012.9045.4877.9593847323213148L 1/239.1785.80610.96615.9062416244712078L 1/178.18711.58721.86931.7441408186011335
44 Vocabulary comparison of various units 3.2 Statistical properties based on various unitsVocabulary comparison of various unitsVocabulary size explode by words
45 Perplexity comparison of various n-grams, normalized by words 3.2 Statistical properties based on various unitsPerplexity comparison of various n-grams, normalized by wordsPerplexities by various unit sets will converge to similar results.Slight gain by longer units with smaller size of n.Morpheme slightly outperformed word, because of small OOV rate.
46 Uyghur ASR experiments -Uyghur Acoustic Model 3.3 ASR results based on various morphological unitsUyghur ASR experiments-Uyghur Acoustic ModelTraining speech corpus is selected from general topics. And used for Uyghur acoustic model (AM) building.Test corpus is 550 different sentences from news, each sentence is read by at least one male and one female.CorpusUnique sentencesSpeakersFemaleMaleAgeTotal utterancetime (hour)Training13.7K35318716619-2862k158.6Test55023131022-2814682.4
47 ASR systems based on various morphological units 3.3 ASR results based on various morphological unitsASR systems based on various morphological unitsFour different language models are built.1) Word based model3) Stem-Suffix (word endings) based model2) Morpheme based model4)Syllable based modelLM namesWordStem-Suffixmorph-3gram4gram5gramVocabulary227.9k74.5k55.2kMorph Error Rate(%)18.8821.6922.7321.6422.98Word Error Rate (%)25.7228.1328.9627.9229.31The syllable vocabulary is 6.58k and the syllable error rate is 28.73%.Word-based ASR result is automatically segmented to morphemes and syllables. Corresponding MER is 18.88%, SER is 15.42%.
48 3 Sub-word segmentation in Uyghur language and Baseline ASR results ConclusionSupervised morphological unit segmentation achieved 97.6% for Uyghur language.Morpheme provides syntactic and semantic information which is convenient for ASR and NLP researches.Uyghur LVCSR system on various linguistic units are built.Longer units (word) outperform other sub-word units in ASR application.
49 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results
50 Outline Corpora and Baseline systems Problematic sample extraction 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsOutlineCorpora and Baseline systemsProblematic sample extractionExperimental results
51 Aligned ASR results of word and morpheme units 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsAligned ASR results of word and morpheme unitsreferencewordYash cheghinglarda bilim elishinglar kerǝkmorphYash chegh_ing_lar_da bilim el_ish_ing_lar kerǝkword ASRresultYash cheghinglarda bilim berishinglar kerǝkO O O X Omorph ASRYash chegh_ing_da bilim el_ish_ing_lar kerǝkO X O O OWord unit provides better ASR performance, vocabulary size explode, and causing OOV.Morpheme unit smaller vocabulary size, high coverage, but often short and easily confused.
52 Comparison of training and test speech corpora 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsComparison of training and test speech corporaUnit formsMorpheme unitWord unitSpeech corpustraining corpustest corpus1-gram coverage98.499.794.597.22-gram coverage96.475.5713-gram coverage8281.647.431.3perplexity82.852.224362497Training data is prepared to cover all the phoneme sequences, so it contains larger stem vocabulary.Julius system is used as decoder.
53 ASR results on various morphological units 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsASR results on various morphological unitsBaseline modelsWER (%)VocabularysizeMorph.4-gramcutoff-227.9255.2kcutoff-528.1127.4kWord3-gram25.77227.9k26.64108.1kFMS-50028.14274.9kStem & word endings28.1374.5kBest ASR results are aligned for sample extractionFrequent Morpheme Sequence (FMS), Co-occurred less than 500 times are merged.Morpheme 4-gram generate best results for morpheme based LMs.Cutoff-F means units whose frequency is less than F are considered UNKNOWN .
54 Comparing ASR results of word and morpheme units 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsComparing ASR results of word and morpheme unitsWe extract useful patterns from the ASR results. Analyze reasons for the confusion.Main reasons for misrecognitionExamples(English translation)Phonetic harmony or co-articulationyigirmǝ-yigirmi (twenty),vottura-votturi (middle)Confusion in frequent short stems with many derivativesbiz, vu, bash, yǝr(we, he/her, head, land)Phonetic similarityhǝmmǝ-ǝmma (all, but)Ambiguityuni-u+ni (he, him)Too many suffix insertionsish+lap+p+i+ish+niThe co-articulation problem can be solved by merging units.We focus on reasons for confusion, and ascribed to several features: error frequency, length, and attribute (stem or word-ending).
55 Features for extracting samples from ASR results 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsFeatures for extracting samples from ASR resultsThe patterns we considered are three types of features.First , the error frequency of word unit.Second, the length of morphemes.Third, Attributes (stem and word-ending)stem and word-ending features are special for Agglutinative languages.
56 Experimental evaluation for - error frequency feature 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsExperimental evaluation for- error frequency featureWord candidates are selected according to these features. And added to lexicon of morpheme unit.Iteratively application of error frequency feature shows accumulative improvements.When we extract misrecognized words from the test set, we found that only 50% of them are covered by the training data set.IterationsBaselineFirst roundSecond roundWER(%) on training data31.9528.6227.01WER(%) on test data28.1126.1125.82Vocabulary size27.0k40.3k46.0k
57 Experimental evaluation for - different features 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR resultsExperimental evaluation for- different featuresThe effects of various features and their combined effects.The features have accumulative effects.ModelsWER(%)△WER(%)Vocabulary sizeMorpheme-based baseline28.11-27.3kError frequency feature26.112.0040.3kLength feature27.190.9232.8kAttribute features26.741.3636.3kAttribute + Length features25.802.3141.2kAttribute + Length + Error freq features24.893.2256.7kThe result is significantly outperformed both of the baseline models.
58 4 Morpheme Concatenation Approach based on feature extraction from two layers of ASR results ConclusionWe have proposed a manual feature extraction approach based on two layers ASR result comparison.Instead of speculating linguistic or statistical properties, we directly analyze the ASR results and identify useful features.The proposed method significantly reduced both WER and lexicon size compared to the best word-based model.Directly related with ASR results, no direct link with OOV.
59 Discriminative Lexicon Optimization Approach for ASR Automatic Speech Recognition of Agglutinative Language based on Lexicon OptimizationDiscriminative Lexicon Optimization Approach for ASR
60 Outline Feature extraction from the aligned ASR results 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageOutlineFeature extraction from the aligned ASR resultsDiscriminative approaches for lexicon optimizationLexicon design and baseline ASR systemsExperimental evaluations
61 Aligned ASR results of morpheme & word based models 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageAligned ASR results of morpheme & word based modelsMeans: Study hard when you are young.referencewordYash cheghinglarda bilim elishinglar kerǝkmorphYash chegh_ing_lar_da bilim el_ish_ing_lar kerǝkword ASRresultYash cheghinglarda bilim elishinglar kerǝkO O O O Omorph ASRYash chegh_ing_da bilim el_ish_ing_lar kerǝkO X O O OCRITICAL CASEWe automate the manual feature extraction approach
62 Sample extraction from the aligned ASR results 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageSample extraction from the aligned ASR resultsThe CIRITICAL samples are extracted.Word ≠ Morph.percentageX68%O28.5%3.5%Naïve method=error frequencysamplingNaïve method:Misrecognized morphemes are merged into words.WER greatly reduced with this experiment.Difficult to cover all words.
63 Feature extraction from the aligned ASR results 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageFeature extraction from the aligned ASR resultsGiven all the training sample pairs:We can extract binary features:And desired value:i: sample indexwmi mj…word ≠ morph.cheghinglardachegh_ing_lar_daO Xyi =1
64 Feature extraction from the aligned ASR results 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageFeature extraction from the aligned ASR resultsunsupervised utilizes all CRITICAL_CASE as Word ≠ Morph.Word ≠ Morph.percentageX68%O28.5%3.5%supervisedunsupervisedAll the training binary samples are extracted:These samples are feed to machine learning algorithms:Perceptron , SVM, LR
65 Discriminative approach - Perceptron 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language65Discriminative approach - PerceptronFor the perceptron, we define an evaluation function:The standard sigmoid function is applied to the linear estimation function.The weight vector is updated as:Easily converging into a local optimum with a large dimension of features
66 Discriminative approach – SVM & LR 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageDiscriminative approach – SVM & LRBoth methods solve the following unconstrained optimization problemwith different loss functions:For SVM, the loss function is:For Logistic Regression , the loss function is:
67 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language Lexicon DesignThese features are then generalized to all units in the text corpus.The concatenation process is repeated in sequence while the condition (g(w)>0.5) is met.Can be applied to sub-word within word boundary; search is done while the condition is met.All the speech and text training corpora are consistent with previous.4-gram LM and Cutoff-5 are used for all experiments.
68 Baseline ASR results with various units 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageBaseline ASR results with various unitsBaseline modelsWER (%)VocabularysizeOOVMorph.4-gramcutoff-227.9255.2k0.3%cutoff-528.1127.4k0.7%Word3-gram25.72227.9k2.8%26.64108.1k4.4%Best ASR results are compared for feature extractionOutlier samples are removed when the frequency of the CRITICAL samples are less than a filtering threshold NThe N-gram feature dimension covered by the speech training corpus are :unigram features: 17Kbigram features : 53K
69 Flow chart of discriminative approach 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageFlow chart of discriminative approachSpeechtraining dataASR resultsAcousticModelMorph LMFeature & WeightestimationWord LMTexttraining dataASR systemProposed methodLexical entryselectionEnhanced LM
70 Effect of sample filtering threshold with unigram feature 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageEffect of sample filtering threshold with unigram featurethresholdN=0N=1N=2N=3N=4N=5perceptronWER (%)26.6925.9325.8726.1826.2826.54Lexicon size104.5K90.2K74.8K63.6K55.3K50.1KLR25.9925.5725.9126.0126.22102.4K91.2K79.9K70.1K62.4K56.5KSVM26.0526.0326.00103.4K94.6K83.7K73.5K65.4K59.2KSVM and LR are more robust against less reliable samples.
71 Comparison of results on different units and features 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageComparison of results on different units and featuresUnitsWordSub-wordFeaturesunigrambigramperceptronWER (%)25.8725.9925.9625.27Lexicon size74.8K67.3K40.7K49.9KLR25.7525.7724.87102.4K85.4K44.0K65.8KSVM26.0525.8627.0524.61103.4K80.1K34.7K55.1KSub-word optimization significantly reduce both WER and lexicon size.
72 Supervised and Unsupervised feature extraction 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageSupervised and Unsupervised feature extractionsupervised unsupervisedperceptronWER (%)25.5525.27lexicon size49.7K49.9KLR25.3424.8746.3K65.8KSVM25.4224.6145.1K55.1KWord ≠ morph.percentageX68%O28.5%3.5%sub-wordbigram featureThe unsupervised training is scalable to a large speech data.
73 Summary of the results Models WER(%) Lexicon size OOV 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative LanguageSummary of the resultsModelsWER(%)LexiconsizeOOVbaseline morpheme28.1127.4k0.7%baseline word25.72227.9k2.8%best MI method25.6053.3kSVMsub-wordbigram featurecutoff-224.64101.2kcutoff-524.6155.1k0.9%The optimized system is very stable. Not much effected by Cutoff rate.SVM & LR are more robust than perceptron.
74 5 Discriminative Models for Lexicon Optimization for ASR of Agglutinative Language ConclusionA novel discriminative approach to lexicon optimization for highly inflectional languages.Automatically optimize lexical units, outperformed other methods with smallest WER and lexicon size.SVM & LR are more effective than perceptron.Sub-word optimization based on bigram feature produce the best result .Can be trained on un-transcribed speech data.
75 Concluding remarksUyghur language based morphological analyzers and ASR systems for various morphological units are investigated and optimized.Supervised and unsupervised morpheme segmentation methods and ASR applications are compared.A novel approach based on comparing two layers of ASR results are proposed.The proposed discriminative approach has significantly reduced both WER and lexicon size.The optimized lexicon based ASR system is very stable compared to baseline systems.
76 Discussion on generality of the proposed method The basic assumption is word unit outperform predefined morpheme unit. There are similar tendencies in other languages like Turkish, Korean.The predefined morphemes, for some languages, may have a good ASR performance, thus does no satisfy the basic assumption.Units in Japanese are not clear, the extracted lexicon can be the optimal.Rule based or statistical approaches already optimized the lexicon.Larger training data size of resource rich languages benefits the longer units, which is consistent with our basic assumption.