Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat, Sudeshna Sarkar Department of Computer Science & Engineering Indian Institute of Technology Kharagpur

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Machine Learning to Resolve POS Tagging  HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)  Maximum Entropy (Ratnaparkhi,96; etc.)  TB(ED)L (Brill,92,94,95; etc.)  Decision Tree (Black,92; Marquez,97; etc.)

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Our Approach  HMM based Simplicity of the model Language Independence Reasonably good accuracy Data intensive Sparseness problem when extending order We are adapting first-order HMM

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Schema Language Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach First-order HMM Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging First order HMM: Current state depends on previous state

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach µ = (π,A,B) Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging Model Parameters First-order HMM

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach µ = (π,A,B) Disambiguation Algorithm Raw text Tagged text … POS tagging t i  {T} or t i  T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer First-order HMM

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach µ = (π,A,B) Viterbi Algorithm Raw text Tagged text … POS tagging t i  {T} or t i  T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer First-order HMM

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i  {T},  w i {T} = Set of tags

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i  T MA (w i ),  w i {T} = Set of tags

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning HMM Parameters  Supervised Learning ( HMM-S) Estimates three parameters directly from the tagged corpus

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning HMM Parameters  Semi-supervised Learning (HMM-SS) Untagged data (observation) are used to find a model that most likely produce the observation sequence Initial model is created based on tagged training data Based on initial model and untagged data, update the model parameters New model parameters are estimated using Baum-Welch algorithm

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Smoothing and Unknown Word Hypothesis  All emission and transition are not observed from the training data  Add-one smoothing to estimate both emission and transition probabilities  Not all words are known to Morphological Analyzer  Assume open class grammatical categories

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Experiments  Baseline Model  Supervised bigram HMM (HMM-S) HMM-S HMM-S + IMA HMM-S + CMA  Semi-supervised bigram HMM (HMM-SS) HMM-SS HMM-SS + IMA HMM-SS + CMA

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Data Used  Tagged data: 3085 sentences ( ~ 41,000 words) Includes both the data in non-privileged and privileged mode  Untagged corpus from CIIL: 11,000 sentences (100,000 words) – unclean To re-estimate the model parameters using Baum-Welch algorithm

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Tagset and Corpus Ambiguity  Tagset consists of 27 grammatical classes  Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data DutchSpanishGermanEnglishFrenchBengali 1.111.191.31.341.692.09 (Dermatas et al 1995)

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set MethodAccuracy Baseline69.11 ACOPOST83.45 HMM-S74.53 HMM-S + IMA78.65 HMM-S + CMA88.83 HMM-SS73.77 HMM-SS + IMA77.98 HMM-SS + CMA89.65

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Error Analysis Actual Class Predicted Class % of total error % of class error NNCNN14.24.0 VRBVFM7.18.7 JJNN5.91.7 QFJJ5.13.7 RBJJ5.03.6 NLOCNN4.51.3 VNNVFM3.74.5

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set  Tested on 458 sentences ( 5127 words) Precision: 84.32% Recall: 84.36% F β=1 : 84.34% TypePrecision(%)Recall (%)F β=1 Frequency SYM10099.7899.89911 NEG95.4510097.6744 PRP95.7293.1894.43257 QFNUM94.7091.2492.94132 Top 4 classes in terms of F-measure

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set  Tested on 458 sentences ( 5127 words) Precision: 84.32% Recall: 84.36% F β=1 : 84.34% TypePrecision(%)Recall (%)F β=1 Frequency VJJ0000 NVB00028 JVB00012 INF10012.522.221 Bottom 4 classes in terms of F-measure

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Further Improvement  Uses suffix information to handle unknown words  Calculates the probability of a tag, given the last m letters (suffix) of a word  Each symbol emission probability of unknown word is normalized

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Further Improvement  Accuracy reflected on development set

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Conclusion and Future Scope  Morphological restriction on tags gives an efficient tagging model even when small labeled text is available  Semi-supervised learning performs better compare to supervised learning  Better adjustment of emission probability can be adopted for both unknown words and less frequent words  Higher order Markov model can be adopted

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Thank You

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

Similar presentations

Presentation on theme: "Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

Similar presentations

Presentation on theme: "Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,"— Presentation transcript:

Similar presentations

About project

Feedback