Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Similar presentations


Presentation on theme: "Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad."— Presentation transcript:

1 Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad

2 Introduction POS-tagging is the process of marking up of words with their corresponding part of speech. It is not as simple as having a list of words and their part of speech, because some words have more than one tag. This problem is commonly found as huge numbers of word-forms are ambiguous.

3 Introduction (Cont.) Building a POS-Tagger for Telugu becomes complicated as Telugu words can be freely formed by agglutinating morphemes. This increases the no of distinct words. Ex: tinu (eat) tinAli (have to eat) tintunnADu (eating) {he} tinAlianukuntunnADu (wants to eat) ……. tinu is common in all the words. We can observe that by the combination of the morphemes new words are formed.

4 Introduction (cont.) This co-occurrence is common for verbs (multiple words joining to form a single verb). Because of this the number of distinct words increases which in turn decreases the accuracy.

5 Types of Taggers Taggers can be characterized as rule-based and statistical based. Rule-based taggers use hand-written rules to distinguish the tag ambiguity. Stochastic based taggers uses the probabilities of occurrences of words for a particular tag. Since Indian languages are morphologically rich in nature, developing rule based taggers is a cumbersome process. But, stochastic taggers require large amount of annotated data to train upon.

6 Models of Statistical Taggers We have tried out four different models of statistical taggers. 1. Hidden Markov Model 2. Conditional Random Fields 3. Maximum Entropy Model 4. Memory Based Learning

7 Hidden Markov Model The Hidden Markov Model (HMM) is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. To implement HMM’s we need three kinds of probabilities. Initial state Probabilities State Transition Probabilities P(t 2 /t 1 ) Emission Probabilities P(w/t) Statistical methods of Markov source or the hidden Markov modeling have become increasingly popular in the last several years.

8 Experiments using HMM’s HMM is implemented using Tri-ngram-Tagger (TnT). Experiment 1 :( Trigram) This experiment uses the previous two tags. i.e. P(t 2 /t 1 t 0 ) is used to calculate the best tag sequence. t 2 : Current tag t 1 : Previous tag t 0 : Previous 2 nd tag Correct Tags : 81.59% Incorrect Tags : 18.41%

9 Experiments using HMM’s (cont...) Experiment 2 :( Bi-gram) This experiment uses the tag of the previous tag. i.e. P ( t 2 / t 1 ) is used to calculate the best tag sequence. t 2 : Current tag t 1 : Previous tag Correct Tags : 82.32% Incorrect Tags : 17.68%

10 Experiments using HMM’s (cont...) Experiment 3: (Bi-gram with 1/100th approximation) By 1/100 th approximation it means that it outputs alternate tag if the probability is 1/100 th of the best tag Correct Tags : 82.47% Incorrect Tags : 17.53%

11 Conditional Random Fields A conditional random field (CRF) is a framework of probabilistic model to segment and label sequence data. A conditional model specifies the probabilities of possible label sequences given an observation sequence. The need to segment and label sequences arises in many different problems in several scientific fields.

12 Experiments using CRF’s Conditional random field is implemented using “CRF tool kit” Experiment 1: Features: Unigram: Current word, Previous 2nd word, Previous word, Next word, Next 2 nd word. Bi-gram: The Previous 2 nd word and the Previous word, Previous word and the Current word, and Current word and the next word. Correct Tags : 75.11% Incorrect Tags: 24.89%

13 Experiments using CRF’s (cont..) Experiment 2: Features: Unigram: Current word, Previous 2 nd word, Previous word, Next word, Next 2 nd word, and their corresponding first 4 and last 4 letters. Bi-gram: The Previous 2 nd word and the Previous word, Previous word and the current word. Correct Tags : 80.26% Incorrect Tags: 19.74%

14 Experiments using CRF’s (cont..) Experiment 3: Features: Unigram: Current word, Previous 2nd word, Previous word, and their corresponding first 3 and last 3 letters. Bi-gram: The Previous 2 nd word and the Previous word, Previous word and the current word. Correct Tags : 80.55% Incorrect Tags: 19.45%

15 Maximum Entropy Model The principle of maximum entropy (Ratnaparakhi, 1999) states that when one has only partial information about the probabilities of possible outcomes of an experiment, one should choose the probabilities so as to maximize the uncertainty about the missing information. In other way, since entropy is a measure of randomness, one should choose the most random distribution subject to whatever constraints are imposed on the problem.

16 Experiments using MEMM’s Maximum Entropy Markov Model is implemented using “Maxent”. Experiment 1: Features: (Unigram) Current word, Previous 2nd word, Previous word, Previous 2nd words Tag, Previous words Tag, Next word, Next 2nd word, first and last 4 letters of the Current word. Correct Tags : 80.3% Incorrect Tags : 19.69%

17 Experiments using MEMM’s (cont..) Experiment 2: Features: Current word, Previous 2nd word with its tag, Previous word with its tag, Previous 2nd words tag with Previous words tag, suffixes and prefixes for rare words (with frequency less than 2). Correct Tags : 80.09% Incorrect Tags: 19.9%

18 Experiments using MEMM’s (cont..) Experiment 3: Features: Current word, Previous 2nd word with its tag, Previous word with its tag, Previous 2nd words tag with previous words tag, Suffixes and prefixes for all words. Correct Tags : 82.27% Incorrect Tags : 17.72%

19 Memory Based Learning Memory-based tagging is based on the idea that words occurring in similar contexts will have the same POS tag. MBL consists of two components: A learning component which is memory-based A performance component which is similarity-based. The learning component of MBL is memory-based as it involves adding training examples to memory. In the performance of an MBL system, the product of the learning component is used as a basis for mapping input to output.

20 Experiments using MBL’s Memory based learning is implemented using Timbl. Experiment 1: Using IB1 Algorithm Correct Tags : 75.39% Incorrect Tags: 24.61% Experiment 2: Using IGTREE Algorithm Correct Tags : 75.75% Incorrect Tags : 24.25%

21 Overall results ModelResult (%) HMM82.47 MEMM82.27 CRF80.55 MBL75.75 The results for the Telugu data was low compared to other languages due to the less availability of annotated data(27336).

22 Observations On telugu corpus: Error Analysis: Actual tagAssigned tagCounts NNJJ87 VFMNN34 PRPNN31 VRBVFM25 JJNN23 NNPNN18 NLOCPREP10 VJJJJ14 QFJJ7 RPRB7

23 Observations On Telugu corpus(cond.) In the Telugu corpus of 27336 words 9801 distinct words are found. If we see the count of number of words with low frequency, say with the frequency of 1, we find 7143 words. This is due to the morphological richness in the language.

24 Observations On Telugu corpus(cond.) Word-frequency table (most frequent words) 322A 280I 199oVka 189lo 183Ayana 178ani 89kUdA

25 Conclusion The accuracy of the Telugu POS Tagging seemed to be low when compared to other Indian Languages due to agglutinative nature of the language. One could explore using Morphological analyzer by splitting verb part and the morphemes to minimize the distinct words.

26 Thank you


Download ppt "Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad."

Similar presentations


Ads by Google