Download presentation
Presentation is loading. Please wait.
Published bySteven Barber Modified over 9 years ago
1
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP
2
POS Tagger for Nepali What is a Tagger? What is a Tagger? POS Tagger Disambiguate the Lexical Category of Words in a Language POS Tagger Disambiguate the Lexical Category of Words in a Language Why we need it? Why we need it? Basic Necessity for NLP Research Basic Necessity for NLP Research Nepal has moved into a position where it is feeling a need for it with the Recent development of Nepalinux Nepal has moved into a position where it is feeling a need for it with the Recent development of Nepalinux
3
Our Approach Build a Rule Based Tagger Build a Rule Based Tagger Simultaneously Build a Statistical Tagger Simultaneously Build a Statistical Tagger Combine Both for a Flexible Tagger with Better Overall Accuracy Combine Both for a Flexible Tagger with Better Overall Accuracy
4
Stochastic Tagger Prerequisites A Relatively Large/Diverse Annotated Corpus A Relatively Large/Diverse Annotated Corpus Larger and More Diverse the Corpus, Better is the Tagger Larger and More Diverse the Corpus, Better is the Tagger
5
Foundations of Stochastic Approach Markov Assumption Markov Assumption Hidden Markov Model Hidden Markov Model Viterbi Search Viterbi Search
6
System Diagram
7
HMM based Tagger N-Gram Models N-Gram Models –Unigram –Bigram –Trigram Consider Consider – तिमी /PMH एउटा /NCD गीत /NN लेख /VCN । /PUNE
8
TAGGING PROCESS Find the probability of occurrence of each category from corpus and store it Find the probability of occurrence of each category from corpus and store it –For example probability of noun occurring –No of probabilities = no of tagset For bigram extract and store bigram probabilities For bigram extract and store bigram probabilities –For example Noun following by determiner –No of bigram probabilities= (no of tagset) 2 Search the transitional probabilities path for best sequence of tags Search the transitional probabilities path for best sequence of tags
9
Tagging Process: Transitional Probabilities NNPMHVCNNCDPUNE PMH0.8200.70.890.1 NCD0.910.760.60.20.01 NN0.20.660.80.750.01 VCN0.560.30.350.40.95 PUNE00000
10
TAGGING PROCESS: AN EXAMPLE The tags are hidden but we see words Is tag sequence X likely with this word Find X that maximizes the probability product of possible sequence
11
Exploiting Markov Assumption
12
Viterbi Search Find the best sequence with the minimal steps Find the best sequence with the minimal steps –For T words and N lexical category the brute force method would require N T steps –Viterbi algorithm reduces the steps to k*T*N 2 with guarantee to find the solution
13
Rule Based Tagger Rule Based POS tagging Methodology Rule Based POS tagging Methodology A given word is given it's corresponding POS tag. We have a POS tagset of 91 tags generated by MPP for the general use of NLP. A given word is given it's corresponding POS tag. We have a POS tagset of 91 tags generated by MPP for the general use of NLP. Three Parts of tagging Three Parts of tagging 1 st root words tagging with lexicon look up. 1 st root words tagging with lexicon look up. 2 nd tag words based on it's morpheme. 2 nd tag words based on it's morpheme. 3 rd tag ambiguous or untagged words based on context. 3 rd tag ambiguous or untagged words based on context.
14
Root word tagging Root word tagging A lexicon containing root words and it's corresponding POS tag will be present. Each word will be compared to the word present in the lexicon and tagged according to it. The words could be tagged with multiple POS tags. A lexicon containing root words and it's corresponding POS tag will be present. Each word will be compared to the word present in the lexicon and tagged according to it. The words could be tagged with multiple POS tags. मानिस /NN मानिस /NN घर /NN घर /NN अँध्यारो /NC_ADQ अँध्यारो /NC_ADQ अचम्म /NC_ADQ अचम्म /NC_ADQ
15
Morpheme based tagging Morpheme based tagging Nepali being a very rich language in morphemes, so tagging the words with the help of morphemes. Nepali being a very rich language in morphemes, so tagging the words with the help of morphemes. गर् + दै /VDAI गर् + दै /VDAI गर् + आउ + छु /VCHU गर् + आउ + छु /VCHU
16
Context based tagging Context based tagging These context based rules are used when ambiguous words appear or if a word is not tagged. In this tagging process we consider the context in which it comes. We make rules based on the context for example : These context based rules are used when ambiguous words appear or if a word is not tagged. In this tagging process we consider the context in which it comes. We make rules based on the context for example : गर्ने /VNE_ADR 2 POS tags गर्ने /VNE_ADR 2 POS tags To disambiguate it we use rules such as : To disambiguate it we use rules such as : If the word is followed by a NN it is ADR and if by Verb it is VNE. If the word is followed by a NN it is ADR and if by Verb it is VNE.
17
Context based tagging Context based tagging Similarly if the word is untagged then: Similarly if the word is untagged then: झर्झरी पानी पर्यो । झर्झरी पानी पर्यो । if झर्झरी is not tagged then the word after words is a NN common noun and VYO so we could give it a ADQL adverb(qualitative). if झर्झरी is not tagged then the word after words is a NN common noun and VYO so we could give it a ADQL adverb(qualitative).
18
Current Direction Common Stemmer Design Common Stemmer Design Corpus Study Corpus Study Format of Input/Output/Storage Format of Input/Output/Storage
19
References J. Allen “Natural Language Understanding”, Pearson Edition J. Allen “Natural Language Understanding”, Pearson Edition Scott M. Thede and Mary P. Harper. A second-order Hidden Markov Model for Scott M. Thede and Mary P. Harper. A second-order Hidden Markov Model for part-of-speech tagging. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 175--182. http://citeseer.ist.psu.edu/thede99secondorder.html part-of-speech tagging. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 175--182. http://citeseer.ist.psu.edu/thede99secondorder.html http://citeseer.ist.psu.edu/thede99secondorder.html Automated Part of Speech Tagging, Handout for LING361, Fall 1995. Georgetown Automated Part of Speech Tagging, Handout for LING361, Fall 1995. GeorgetownLING361 University. http://www.georgetown.edu/faculty/ballc/ling361/tagging_overview.html University. http://www.georgetown.edu/faculty/ballc/ling361/tagging_overview.html http://www.georgetown.edu/faculty/ballc/ling361/tagging_overview.html Hardie et al. Nelralec/Bhasha Sanchar Working Paper 2 Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01) http://www.bhashasanchar.org./dfs/nelralec-wp-tagset.pdf Hardie et al. Nelralec/Bhasha Sanchar Working Paper 2 Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01) http://www.bhashasanchar.org./dfs/nelralec-wp-tagset.pdf D. Jurafsky and J. H. Martin, “Speech and Language Processing”, Pearson Edition. D. Jurafsky and J. H. Martin, “Speech and Language Processing”, Pearson Edition. IITB India, Seminar report IITB India, Seminar report
20
Questions Questions Thank You! Thank You!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.