Authors N.A.K.B.D.Gunasekara Mr. W.V.Welgama Dr.A.R.Weerasinghe.

Authors N.A.K.B.D.Gunasekara Mr. W.V.Welgama Dr.A.R.Weerasinghe

Overview Introduction Literature Review Aims & Objectives Methodology Design & Implementation Results & Evaluation Conclusion Future Work

Content Introduction Literature Review Aims & Objectives Methodology Design & Implementation Results & Evaluation Conclusion Future Work

. Pronoun Noun Adverb Verb  What is Part Of Speech Tagging ? The process of assigning a corresponding POS tag like noun, verb, preposition to every token in the text.  What is Part Of Speech Tagging ? The process of assigning a corresponding POS tag like noun, verb, preposition to every token in the text. Introduction

 Motivation An important preprocessing task in many NLP areas like Information retrieval : Stemming Selection of high content words Word sense disambiguation Speech synthesis (e.g. Text-to-Speech) Speech recognition Machine translation

Literature Review  Different POS tagging approaches POS Tagging SupervisedUnsupervised NeuralStochasticRule-basedNeuralStochasticRule-based Baum-Welch HMMCRFMEMM

Literature Review  Related Work Hidden Markov Model Based Part of Speech Tagger for Sinhala Language -2014 HMM with N-gram probabilities 90% accuracy using a Test set of 1024 words Learning a Stochastic Part of Speech Tagger for Sinhala -2013 HMM with tri-gram probabilities 62% accuracy for both known & unknown words A Stochastic Part of Speech Tagger for Sinhala – 2004 HMM with bi-gram probabilities Tagging error below 60% when the unknown word percentage is 100%

Aims & Objectives  Aims of the Research To find out whether the hybrid approach that incorporates both stochastic and rule based approaches can give a better POS tagging accuracy over solely stochastic based approach for Sinhala language. Analyze how the Sinhala POS tagging can be improved by improving the Tag set “UCSC TagSet Version 1” - “LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007 “UCSC TagSet Version 2” - An improved version of “UCSC TagSet Version 1” “UCSC TagSet Version 3” - “UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015

Aims & Objectives  Objectives To do a comparative analysis between HMM based approach and hybrid approach which incorporate both stochastic and rule based approaches. To re-annotate the corpus with improved versions of the Tag set “UCSC Annotated Corpus Version 1” annotated by “UCSC TagSet Version 1” - Collected from LTRL of UCSC “UCSC Annotated Corpus Version 2” annotated by “UCSC TagSet Version 2” - Contribution of this research “UCSC Annotated Corpus Version 3” annotated by “UCSC TagSet Version 3” - Contribution of this research

Methodology 1 Implementation of HMM Tagger 2 Integration of Stemmer 3 Extend the HMM tagger to come up with a hybrid Tagger 4 Evaluation of taggers using different Tag set versions 5 Comparative analysis on two POS tagging approaches

Design & Implementation Architecture of the HMM Tagger Incorporation of the Stemmer Architecture of the Hybrid Tagger Tag guessing using Suffix Rules

Architecture of the HMM Tagger POS tag annotated Corpus TokenizationProbability Calculator Transition & Emission Probabilities Testing input PreprocessorViterbi Path finderBack tracerTagger Output TrainerTagger Stemmer Unknown word Stem of unknown word

Incorporation of the Stemmer  First approach- In both Training phase & Tagging phase  Training phase - before calculating the emission probabilities  Tagging phase - to stem the input given to the tagger Training Set P E (|NNN) Unseen word Stemmer

Incorporation of the Stemmer  Issues in integrating stemmer in the Training Phase :  Changes in Part of Speech due to stemming  Changes in meaning due to stemming e.g. - appropriate tag is NNM - appropriate tag is NNN  Second approach - Only in the Tagging phase - Increases the tagger accuracy Unseen word Stemmer

Architecture of the Hybrid Tagger POS tag annotated Corpus TokenizationProbability Calculator Transition & Emission Probabilities Testing input PreprocessorViterbi Path finderBack tracerTagger Output TrainerTagger Stemmer Stem is in known NO YES Tag guessing Using morph rules Unkown word

Tag guessing using Suffix Rules Affix - A morpheme that is combined to a stem/root of a word to form a new word Prefix - An element that is placed at the beginning of a root word Suffix - An element that is placed at the end of a root word e.g. Suffixes in Sinhala Language In nouns : -, In adjectives: -, In verbs : -,

Tag guessing using Suffix Rules Categorized all open class words by their POS tags into separate files e.g. All NNN words in the annotated corpus are wrote into a file called NNN.txt

Tag guessing using Suffix Rules Extract common suffixes in each of categories common suffix - if only a suffix occurs in more than five distinct words

Tag guessing using Suffix Rules Calculate the probability of each listed suffixes according to their tag category

Tag guessing using Suffix Rules Calculate the probability of each listed suffixes according to their tag category No. of occurrences “” appear as a suffix in a NNM word = 180 Total no. of distinct words tagged as NNM in the training set = 1,457 Probability of the suffix “” appears in a word tagged as NNM= 180/1457 = 0.12

Tag guessing using Suffix Rules Create one file that includes all the common suffixes tagged by the tag which has the highest probability When the hybrid tagger comes to a previously unseen word, it analyses the word’s suffix and predict a tag

Results & Evaluation HMM Tagger based on UCSC Tagset Version 1 HMM Tagger based on UCSC Tagset Version 2 HMM Tagger based on UCSC Tagset Version 3 Hybrid Tagger based on UCSC Tagset Version 2 Hybrid Tagger based on UCSC Tagset Version 3 Summary of Result

Results & Evaluation  Evaluation of taggers Training set Total words : 75,830 Distinct words : 14,027 Test set Total words : 25,087 Distinct words : 6,954

HMM Tagger based on UCSC Tagset Version 1 “LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007 Contains 22 Tags Total no. of words in the input = 25,087 No. of correctly tagged words = 17,689 Accuracy of the tagger = 70.51%

HMM Tagger based on UCSC Tagset Version 2 New tag called “UNK” for all the words which do not fall into any tag category Total no. of words in the input = 25,087 No. of correctly tagged words = 17,688 Accuracy of the tagger = 70.51%

Difference in accuracy rates in Tagset version 1 & Tagset version 2  With the addition of UNK tag Accuracy of question mark tag is increased up to 100% Increments in accuracies of 10 tag categories

HMM Tagger based on UCSC Tagset Version 3 “UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015 Contains 29 tags including the UNK tag Total no. of words in the input = 25,087 No. of correctly tagged words = 17,548 Accuracy of the tagger = 69.95%

Hybrid Tagger based on UCSC Tagset Version 2 Increased overall accuracy Total no. of words in the input = 25,087 No. of correctly tagged words = 18,098 Accuracy of the tagger = 72.14%

Hybrid Tagger based on UCSC Tagset Version 2 Increased accuracy rates of open class tag categories New words often added to open class categories Hybrid approach is a good solution for the unknown word problem Total no. of words in the input = 25,087 No. of correctly tagged words = 18,098 Accuracy of the tagger = 72.14%

Hybrid Tagger based on UCSC Tagset Version 3 Increased overall accuracy Total no. of words in the input = 25,087 No. of correctly tagged words = 17,657 Accuracy of the tagger = 70.38%

Summary of Results Overall accuracy of hybrid tagger is higher than HMM tagger. But the increment in hybrid tagger accuracy is higher when used with “UCSC Tagset Version 2” “UCSC Tagset Version 3” is in a higher descriptive level number of collisions among tag categories is high HMM TaggerHybrid Tagger TagSet Version 270.51%72.14% TagSet Version 369.95%70.38%

Conclusion 1 Addition of ‘UNK’ tag leads towards a more meaningful tagging process 2 ‘UCSC TagSet Version 3’ results in decreased tagger accuracy due to the high level descriptiveness 3 Hybrid approach gives a higher POS tagging accuracy than the solely HMM based approach for Sinhala language

Hybrid POS tagging approach proposed in this research is based on bi-gram transition probabilities. Therefore, in order to further improve the tagging results, this approach can be extended to use tri-gram transition probabilities. Integration of a named entity recognizer and a morphological analyzer with the hybrid tagger can be helpful in boosting up the tagger accuracy

References [1] M. Jayasuriya and a. R. Weerasinghe, “Learning a stochastic part of speech tagger for inhala,” 2013 Int. Conf. Adv. ICT Emerg. Reg., pp. 137–143, 2013. [2]D. Kumar and G. S. Josan, “Part of Speech Taggers for Morphologically Rich Indian Languages: A Survey,” Int. J. Comput. Appl., vol. 6, no. 5, pp. 1–9, 2010. [3] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Hidden Markov Model Based Part Of Speech Tagger for Sinhala Language,” vol. 3, no. 3, pp. 1–23, 2014. [4] Wikipedia, ‘Sinhalese language’, 2015. [Online]. Available: http://en.wikipedia.org/wiki/Sinhalese_language. [Accessed: 28- Mar- 2015]. [5] T. Fernando and A. Weerasinghe, “A Morphological Parser for Sinhala Verbs,” Icter.Org. [6] R. Tsarfaty, D. Seddah, S. Kübler, and J. Nivre, “Parsing Morphologically Rich Languages: Introduction to the Special Issue,” Comput. Linguist., vol. 39, no. 1, pp. 15–22, 2013. [7] Wikipedia, ‘Language’,2015. [Online]. Available: http://en.wikipedia.org/wiki/Language. [Accessed: 28- Mar- 2015]. [8] R. Mooney, 'Part-Of-Speech Tagging, Sequence Labeling and Hidden Markov Models (HMMs)', University of Texas at Austin. [9] H. Tseng, D. Jurafsky, and C. Manning, “Morphological features help POS tagging of unknown words across language varieties,” Proc. Fourth SIGHAN Work. Chinese Lang. Process., pp. 32–39, 2005. [10]D. L. Herath and a. R. Weerasinghe, “A stochastic part of speech tagger for sinhala,” 2013 Int. Conf. Adv. ICT Emerg. Reg., pp. 137–143, 2013.

,. _PRP _NNN,_, _NNN _CC _NNN _VP _NNN _RP._. Example of Tagging a Sentence

Enhancing the size of the annotated corpus New input data taken from different newspapers are used Hybrid tagger based on “UCSC TagSet Version 2” New input data are tagged using the selected hybrid tagger Tagged output is added to the training data set The unchanged test data set is re-tagged with the enhanced training set.

Change in Hybrid Tagger accuracy with the addition of newly tagged data Tagger accuracy decreases with the addition of tagger output Reason : hybrid tagger used to enhance the training set has a tagging error of 27.86% Solution : manually inspecting the tagger output by a group of linguistic experts before adding it to the annotated corpus

Content Introduction Literature Review Aims & Objectives Data Methodology Design & Implementation Results & Evaluation Conclusion Future Work

Data  Data collection phase Implementation of HMM & Hybrid taggers “LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007 UCSC Sinhala Tagged Corpus V1 Improving the size of the annotated corpus New 120 articles from “UCSC 10M Words Sinhala Corpus” Re-annotation of the corpus with the improved tag set “UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015

Data  Corpus analysis UCSC Annotated Corpus Version 1 Total No. of words which do not fall into any tag category: 3989 No. of distinct words which do not fall into any tag category: 759

Data  Solution from “UCSC TagSet Version 3”  sidu () – QVB  lak() – QVB  path() - PAVB QVB - Question Word in Kriya Mula PAVB - Adjective in Kriya Mula

Aims & Objectives  Research Question How the accuracy varies in Hybrid POS tagging than in solely HMM based Stochastic tagging approach when unknown words are presented?

Sentence Boundary Disambiguation (SBD) A common problem in many languages Person names begin with initials, acronyms and abbreviations make the sentence boundary identification more challengeable Solution : A separate list of person name initials, commonly used acronyms and abbreviations are maintained in our preprocessing step Resulted in increased Tagger accuracy

Challenges involved in the POS Tagging  Ability of some words to play multiple Part Of Speech – A verb – An adjective  Handling unknown words which are not in the training set

Upper Bound for our POS Tagging Accuracy Words which do not fall into any tag category = 3,989 Total no. words in the Corpus = 100,917 Total no. words that can be tagged precisely in manual tagging = 96,928 Maximum accuracy in manual approach = (96,928/100,917)*100% = 96.05% In other words, Tagging error presented in manual approach = 3.95%

Emission probability P(w i |t i ) Emission probability indicates the likelihood of a given word is tagged by a particular tag (assuming that the word is depended only on its tag) Calculate by dividing the number of occurrences a particular tag appears in the corpus with the given word, by the total number of occurrences that tag appears in the corpus _JJ _NNN _NNN _VNF _VFM._.

Emission probability P(w i |t i )

Transition probability P(t i |t i-1 ) Bi-gram transition probability indicates the probability of a tag being depended on the previous tag Calculate by dividing the number of occurrences where the ti-1, ti tag sequence appears, by the total number of occurrences where the tag ti-1 appears in the corpus _JJ _NNN _NNN _VNF _VFM._.

Transition probability P(t i |t i-1 )

Hidden Markov Model Main goal in HMM is to come up with the most probable tag sequence t1 …tn given the word sequence w1…..wn, such that P(t1 …..tn | w1…..wn) is the maximum. Applying the Bayes Rule P(X|Y) = [ P(Y|X) * P(X) ] / P(Y)

Hidden Markov Model Remove the denominator P(W) as it is same for all the sequences Applying Likelihood and Transition assumptions

Tagging With Hidden Markov Model

Viterbi Algorithm Find the best possible POS tag path, given a sequence of words For the task of decoding A Dynamic Programming Algorithm

Accuracy Rate per each Tag UCSC TagSet Version 1UCSC TagSet Version 2

UCSC TagSet Version 1 TAGDescription CCConjunction DETDeterminer FRWForeign Word JJAdjective JVBAdjective in Kriya Müla NNFCommon Noun Feminine NNMCommon Noun Masculine NNNCommon Noun Neuter NNPAProper Noun Animate NNPIProper Noun Inanimate NVBNoun in Kriya Müla POSTPostposition PRPPronoun QFNUMNumber Quantifier RBAdverb RPParticle SYMNot Classified UHInterjection VFMVerb Finite Main VNFVerb Non Finite VNNVerbal Non Finite Noun VPVerb Participle

UCSC TagSet Version 2 TAGDescription CCConjunction DETDeterminer FRWForeign Word JJAdjective JVBAdjective in Kriya Müla NNFCommon Noun Feminine NNMCommon Noun Masculine NNNCommon Noun Neuter NNPAProper Noun Animate NNPIProper Noun Inanimate NVBNoun in Kriya Müla POSTPostposition PRPPronoun QFNUMNumber Quantifier RBAdverb RPParticle SYMNot Classified UHInterjection VFMVerb Finite Main VNFVerb Non Finite VNNVerbal Non Finite Noun VPVerb Participle UNKUnknown (Tag is unknown)

UCSC TagSet Version 3 TAGDescription CCConjunction CMVNF Present Participle Verb Non Finite DETDeterminer FRWForeign Word JJAdjective JVBAdjective in Kriya Müla NNFCommon Noun Feminine NNMCommon Noun Masculine NNNCommon Noun Neuter NPF Proper Noun Feminine NPM Proper Noun Masculine NPN Proper Noun Neuter NVBNoun in Kriya Müla PAVB Participle Adjective in Kriya Mula PAVNF Past Participle Verb Non Finite POSTPostposition PRPPronoun PPVB Past Participle in Kriya Mula PRVNF Present Participle Verb Non Finite QFNUMNumber Quantifier QVB Question Word in Kriya Mula RBAdverb RPParticle SYMNot Classified UHInterjection VFMVerb Finite Main VNNVerbal Non Finite Noun VPVerb Participle UNKUnknown (Tag is unknown)

Authors N.A.K.B.D.Gunasekara Mr. W.V.Welgama Dr.A.R.Weerasinghe.

Similar presentations

Presentation on theme: "Authors N.A.K.B.D.Gunasekara Mr. W.V.Welgama Dr.A.R.Weerasinghe."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Authors N.A.K.B.D.Gunasekara Mr. W.V.Welgama Dr.A.R.Weerasinghe.

Similar presentations

Presentation on theme: "Authors N.A.K.B.D.Gunasekara Mr. W.V.Welgama Dr.A.R.Weerasinghe."— Presentation transcript:

Similar presentations

About project

Feedback