Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat Department of Computer Science & Engineering Indian Institute of Technology Kharagpur
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Goal Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech to each word. e.g. Noun, Verb... Syntactic Analysis Chunking: Identify and label phrases as verb phrase and noun phrase etc.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Machine Learning to Resolve POS Tagging and Chunking HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.) Maximum Entropy (Ratnaparkhi,96; etc.) TB(ED)L (Brill,92,94,95; etc.) Decision Tree (Black,92; Marquez,97; etc.)
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Our Approach Maximum Entropy based Diverse and overlapping features Language Independence Reasonably good accuracy Data intensive Absence of sequence information
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Schema Language Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging ME Model: Current state depends on history (features)
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging ME Model: Current state depends on history (features)
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning ME Model GIS (Generalized Iterative Scaling) Finds the model parameters that define the maximum entropy classifier for a given feature set and training corpus The parameters of the ME model are estimated using an off-the-shelf toolkit ( )
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text … POS tagging t i {T} or t i T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Beam Search Raw text Tagged text … POS tagging t i {T} or t i T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i {T}, w i {T} = Set of tags
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i T MA (w i ), w i {T} = Set of tags
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur What are Features? Feature function Binary function of the history and target Example,
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Features W1 W2 W3 W4 T2 T3 T4 T5 T6 T7 i-3 W1T1 i-2 i-1 i i+1 i+2 i+3 T4 Estimated Tag Feature Set 40 different experiments were conducted taking several combination from set ‘F’ pos word POS_Tag
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Features Estimated Tag Feature Set ConditionFeatures Static features for all words Current word(w i ) Previous word (w i-1 ) Next word (w i+1 ) |prefix| ≤ 4 |suffix| ≤ 4 Dynamic Features for all words POS tag of previous word (t i-1 ) W3 W4 T3 T4 T5 T6 T7 i-3 W1 T1 i-2 i-1 i i+1 i+2 i+3 W6 W7 W2 T2 pos word POS_Tag
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Features T2 T3 T4 T5 T6 C3 C4 C5 C6 C7 -3 W1 T1 C1 W2 W3 T C2 W5 W6 W7 W4 Estimated Tag Feature Set Static features for all words Current word (w i ) POS tag of the current word (t i ) POS tags of previous two words (t i-1 and t i-2 ) POS tags of next two words (t i+1 and t i+2 ) Dynamic Features for all words Chunk tags of previous two word (C i-1 and C i-2 )
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Features T2 T3 T4 T5 T6 C2 C3 C4 C5 C6 C7 i-3 W1 T1 C1 W2 W3 T7 i-2 i-1 i i+1 i+2 i+3 W5 W6 W7 W4 Estimated Tag Feature Set Static features for all words Current word (w i ) POS tag of the current word (t i ) POS tags of previous two words (t i-1 and t i-2 ) POS tags of next two words (t i+1 and t i+2 ) Dynamic Features for all words Chunk tags of previous two words (C i-1 and C i-2 ) pos word POS_Tag Chunk_Tag
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Experiments: POS tagging Baseline Model Maximum Entropy Model ME (Bengali, Hindi and Telugu) ME + IMA ( Bengali) ME + CMA (Bengali) Data Used LanguageBengaliHindiTelugu Training data20,39621,47021,416 Development data5,0235,6816,098 Test data5,2264,9245,193 No. of POS tags2725 No. of Chunk labels676
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data LanguageDutchGermanEnglishFrenchBengaliHindiTelugu Corpus Ambiguity Accuracy96%97%96.5%94.5%??? Unknown Words 13%9%11%5%33%21%56% (Dermatas et al 1995)
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results on Development Set Overall Accuracy LanguageBengaliHindiTelugu Corpus Ambiguity Accuracy79.74%83.10%67.12% Unknown Words 33%21%56%
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results on Development Set Known Words Unknown Words Overall Accuracy
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results - Bengali
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set MethodBengaliHindiTelugu Baseline ME (89.3, 60.5) (90.9,53.7) ( ) ME + IMA (84.2, 82.1) -- ME + CMA (89.3, 86.2) --
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Results Two different measures Per word basis Per chunk basis Correctly identified groups along with correctly labeled groups Evaluation Criteria MethodBengaliHindiTelugu Per word basis ME + I_POS Per chunk basis ME + I_POS87.3, , ,56.7 ME + C_POS93.3, ,74.4-
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Assessment of Error Types Predicted Class Actual Class % of total error % of class error NNNNC NNJJ NNNNP VFMVRB NNPNNPC Predicted Class Actual Class % of total error % of class error NNNNP NNJJ NNNNC JJNN VFMVAUX Bengali Hindi Predicted Class Actual Class % of total error % of class error NNJJ NNNNP PREPNLOC NNRB Telugu
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set Bengali data has been tagged using ME+IMA model Hindi and Telugu data has been tagged with simple ME model Language Number of Words POS Tagging Accuracy Chunking Accuracy Bengali Hindi Telugu Chunk Accuracy has been measured per word basis
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Conclusion and Future Scope Morphological restriction on tags gives an efficient tagging model even when small labeled text is available The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages Linguistic prefix and suffix information can be adopted More features can be explored for chunking
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Thank You