Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

Similar presentations


Presentation on theme: "Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,"— Presentation transcript:

1 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat, Sudeshna Sarkar Department of Computer Science & Engineering Indian Institute of Technology Kharagpur

2 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Machine Learning to Resolve POS Tagging  HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)  Maximum Entropy (Ratnaparkhi,96; etc.)  TB(ED)L (Brill,92,94,95; etc.)  Decision Tree (Black,92; Marquez,97; etc.)

3 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Our Approach  HMM based Simplicity of the model Language Independence Reasonably good accuracy Data intensive Sparseness problem when extending order We are adapting first-order HMM

4 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Schema Language Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging

5 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach First-order HMM Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging First order HMM: Current state depends on previous state

6 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach µ = (π,A,B) Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging Model Parameters First-order HMM

7 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach µ = (π,A,B) Disambiguation Algorithm Raw text Tagged text … POS tagging t i  {T} or t i  T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer First-order HMM

8 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach µ = (π,A,B) Viterbi Algorithm Raw text Tagged text … POS tagging t i  {T} or t i  T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer First-order HMM

9 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i  {T},  w i {T} = Set of tags

10 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i  T MA (w i ),  w i {T} = Set of tags

11 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning HMM Parameters  Supervised Learning ( HMM-S) Estimates three parameters directly from the tagged corpus

12 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning HMM Parameters  Semi-supervised Learning (HMM-SS) Untagged data (observation) are used to find a model that most likely produce the observation sequence Initial model is created based on tagged training data Based on initial model and untagged data, update the model parameters New model parameters are estimated using Baum-Welch algorithm

13 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Smoothing and Unknown Word Hypothesis  All emission and transition are not observed from the training data  Add-one smoothing to estimate both emission and transition probabilities  Not all words are known to Morphological Analyzer  Assume open class grammatical categories

14 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Experiments  Baseline Model  Supervised bigram HMM (HMM-S) HMM-S HMM-S + IMA HMM-S + CMA  Semi-supervised bigram HMM (HMM-SS) HMM-SS HMM-SS + IMA HMM-SS + CMA

15 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Data Used  Tagged data: 3085 sentences ( ~ 41,000 words) Includes both the data in non-privileged and privileged mode  Untagged corpus from CIIL: 11,000 sentences (100,000 words) – unclean To re-estimate the model parameters using Baum-Welch algorithm

16 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Tagset and Corpus Ambiguity  Tagset consists of 27 grammatical classes  Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data DutchSpanishGermanEnglishFrenchBengali 1.111.191.31.341.692.09 (Dermatas et al 1995)

17 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set

18 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set MethodAccuracy Baseline69.11 ACOPOST83.45 HMM-S74.53 HMM-S + IMA78.65 HMM-S + CMA88.83 HMM-SS73.77 HMM-SS + IMA77.98 HMM-SS + CMA89.65

19 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Error Analysis Actual Class Predicted Class % of total error % of class error NNCNN14.24.0 VRBVFM7.18.7 JJNN5.91.7 QFJJ5.13.7 RBJJ5.03.6 NLOCNN4.51.3 VNNVFM3.74.5

20 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set  Tested on 458 sentences ( 5127 words) Precision: 84.32% Recall: 84.36% F β=1 : 84.34% TypePrecision(%)Recall (%)F β=1 Frequency SYM10099.7899.89911 NEG95.4510097.6744 PRP95.7293.1894.43257 QFNUM94.7091.2492.94132 Top 4 classes in terms of F-measure

21 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set  Tested on 458 sentences ( 5127 words) Precision: 84.32% Recall: 84.36% F β=1 : 84.34% TypePrecision(%)Recall (%)F β=1 Frequency VJJ0000 NVB00028 JVB00012 INF10012.522.221 Bottom 4 classes in terms of F-measure

22 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Further Improvement  Uses suffix information to handle unknown words  Calculates the probability of a tag, given the last m letters (suffix) of a word  Each symbol emission probability of unknown word is normalized

23 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Further Improvement  Accuracy reflected on development set

24 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Conclusion and Future Scope  Morphological restriction on tags gives an efficient tagging model even when small labeled text is available  Semi-supervised learning performs better compare to supervised learning  Better adjustment of emission probability can be adopted for both unknown words and less frequent words  Higher order Markov model can be adopted

25 Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Thank You


Download ppt "Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,"

Similar presentations


Ads by Google