Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.

Similar presentations


Presentation on theme: "Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI."— Presentation transcript:

1 Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI

2 4 July 2006 NWAI2 What's in? Why POS tagging & chunking? Approach Challenges Unseen tag sequences Unknown words Results Future work Conclusion

3 4 July 2006 NWAI3 Intro & Motivation

4 4 July 2006 NWAI4 POS Parts-of-Speech Dionysius Thrax (ca 100 BC) 8 types – noun, verb, pronoun, preposition, adverb, conjunction, participle and article I get my thing in action. (Verb, that's what's happenin') To work, (Verb!) To play, (Verb!) To live, (Verb!) To love... (Verb!...) - Schoolhouse Rock

5 4 July 2006 NWAI5 Tagging Assigning the appropriate POS or lexical class marker to words in a given text Symbols, punctuation markers etc. are also assigned specific tag(s)

6 4 July 2006 NWAI6 Why POS tagging? Gives significant information about a word and its neighbours Adjective near noun Adverb near verb Gives clue on how a word is pronounced OBject as noun obJECT as verb Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc.

7 4 July 2006 NWAI7 Chunking Identifying simple phrases Noun phrase, verb phrase, adjectival phrase… Useful as a first step to Parsing Named entity recognition

8 4 July 2006 NWAI8 POS tagging & Chunking

9 4 July 2006 NWAI9 Stochastic approaches Availability of tagged corpora in large quantity Most are based on HMM Weischedel ’93 DeRose ’88 Skut and Brants ’98 – extending HMM to chunking Zhou and Su ‘00 and lots more…

10 4 July 2006 NWAI10 HMM Tag-sequence probabilityWord-emit probability Annotated corpus Assumptions Probability of a word is dependent only on its tag Approximate the tag history to the most recent two tags

11 4 July 2006 NWAI11 Structural tags A triple – POS tag, structural relation & chunk tag Originally proposed by Skut & Brants ’98 Seven relations Enables embedded and overlapping chunks

12 4 July 2006 NWAI12 Structural relations परीक्षा में NP 00 Beg परीक्षा NP 90 SSF । End VG 09 SSF श्रेणी प्राप्त NP 99 SSF VG परीक्षा में भी प्रथम श्रेणी प्राप्त की और विद्यालय में कुलपति द्वारा विशेष पुरस्कार भी उन्हीं को प्राप्त हुआ ।

13 4 July 2006 NWAI13 Decoding Viterbi mostly used (also A* or stack) Aims at finding the best path (tag sequence) given observation sequence Possible tags are identified for each transition, with associated probabilities The best path is the one that maximizes the product of these transition probabilities

14 4 July 2006 NWAI14 अब जीवन का एक अन्य रूप उनके सामने आया । JJ NLOC NN PREP PRP QFN RB VFM SYM

15 4 July 2006 NWAI15 अब जीवन का एक अन्य रूप उनके सामने आया । JJ NLOC NN PREP PRP QFN RB VFM SYM

16 4 July 2006 NWAI16 अब जीवन का एक अन्य रूप उनके सामने आया । JJ NLOC NN PREP PRP QFN RB VFM SYM

17 4 July 2006 NWAI17 Issues

18 4 July 2006 NWAI18 1. Unseen tag sequences Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation) Idea is to distribute some fractional probability (of seen occurrences) to unseen Good-Turing Re-estimates the probability mass of lower count N- grams by that of higher counts - Number of N-grams occurring c times

19 4 July 2006 NWAI19 2. Unseen words Insufficient corpus (even after 10 mn words) Not all of them are proper names Treat them as rare words that occur once in the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95 Known Hindi corpus of 25 K words and unseen corpus of 6 K words All words vs. Hapax vs. Unknown

20 4 July 2006 NWAI20 Tag distribution analysis

21 4 July 2006 NWAI21 3. Features Can we use other features? Capitalization Word endings and Hyphenations Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations Capitalizations, though useful for proper nouns are not very effective

22 4 July 2006 NWAI22 Contd… String length Prefix & suffix – fixed characters width Character encoding range Complete analysis remains to be done Expected to be very effective for morphologically rich languages To be experimented with Tamil

23 4 July 2006 NWAI23 4. Multi-part words Examples In/ terms/ of/ United/ States/ of/ America/ More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP Central/NNC government/NN NNPC – Compound proper noun, NN - noun NNP – Proper noun, NNC – Compound noun How does the system identify the last word in multi-part word? 10% of errors is due to this in Hindi (6 K words tested)

24 4 July 2006 NWAI24 Results

25 4 July 2006 NWAI25 Evaluation metrics Tag precision Unseen word accuracy % of unseen words that are correctly tagged Estimates the goodness of unseen words % reduction in error Reduction in error after the application of a particular feature

26 4 July 2006 NWAI26 Results - Tagger No structural tags  better smoothing Unseen data – significantly more unknowns DevS-1S-2S-3S-4Test # words 851163886397654858475000 Correctly tagged 674955385504555850603961 Precision 79.2986.6986.0486.0686.5479.22 # Unseen 15436606485896031012 Correctly tagged 672354323265312421 Unseen Precision 43.5553.6349.8444.9951.7441.6

27 4 July 2006 NWAI27 Results – Chunk tagger Training  22 K, development data  8 K 4-cross validation Test data  5 K POS tagging Precision Chunk IdentificationLabelling PreRecPreRec Dev data76.1669.5469.0566.7366.27 Average85.0272.2673.5270.0171.35 Test data76.4958.7261.2854.3656.73

28 4 July 2006 NWAI28 Results – Tagging error analysis Significant issues with nouns/multi-part words NNP  NN NNC  NN Also, VAUX  VFM; VFM  VAUX and NVB  NN; NN  NVB

29 4 July 2006 NWAI29 HMM performance (English) > 96% reported accuracies About 85% for unknown words Advantage Simple and most suitable with the availability of annotated data

30 4 July 2006 NWAI30 Conclusion

31 4 July 2006 NWAI31 Future work Handling unseen words Smoothing Can we exploit other features? Especially morphological ones Multi-part words

32 4 July 2006 NWAI32 Summary Statistical approaches now include linguistic features for higher accuracies Improvement required Tagging Precision – 79.22% Unknown words – 41.6% Chunking Precision – 60% Recall – 62%


Download ppt "Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI."

Similar presentations


Ads by Google