Presentation is loading. Please wait.

Presentation is loading. Please wait.

7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

Similar presentations


Presentation on theme: "7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran."— Presentation transcript:

1 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran

2 7 November 2006 DBRG- University of Tehran Outline What is POS tagging How is data tagged for POS? Tagged Corpora POS Tagging Approaches Corpus Training How to Evaluate a tagger? Bijankhan Corpus Memory Based POS MLE Based POS Neural Network POS Tagger

3 7 November 2006 DBRG- University of Tehran What is POS tagging Annotating each word for its part of speech (grammatical type) in a given sentence. e.g. I/PRP would/MD prefer/VB to/TO study/VB at/IN a/DT traditional/JJ school/NN Properties: It helps parsing It resolves pronunciation ambiguities As the water grew colder, their hands grew number. (number=ADJ, not N) It resolves semantic ambiguities Patients can bear pain.

4 7 November 2006 DBRG- University of Tehran POS Application Part-of-speech (POS) tagging is important for many applications Word sense disambiguation Parsing Language modeling Q&A and Information extraction Text-to-speech Tagging techniques can be used for a variety of tasks Semantic tagging Dialogue tagging Information Retrieval….

5 7 November 2006 DBRG- University of Tehran POS Tags N nounbaby, toy V verb see, kiss ADJ adjective tall, grateful, alleged ADV adverb quickly, frankly,... P preposition in, on, near DET determiner the, a, that WhPronwh-pronounwho, what, which, … COORD coordinatorand, or Open Class

6 7 November 2006 DBRG- University of Tehran POS Tags There is no standard set of POS tags  Some use coarse classes: e.g., N, V, A, Aux, ….  Others prefer finer distinctions (e.g., Penn Treebank): PRP: personal pronouns (you, me, she, he, them, him, …) PRP$: possessive pronouns (my, our, her, his, …) NN: singular common nouns (sky, door, theorem, …) NNS: plural common nouns (doors, theorems, women, …) NNP: singular proper names (Fifi, IBM, Canada, …) NNPS: plural proper names (Americas, Carolinas, …)

7 7 November 2006 DBRG- University of Tehran How is data tagged for POS? We are trying to model human performance. So we have humans tag a corpus and try to match their performance. To creating a model  A corpora are hand-tagged for POS by more than 1 annotator  Then checked for reliability

8 7 November 2006 DBRG- University of Tehran Penn Treebank Corpus (WSJ, 4.5M) History Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) 96%+ Combined Methods 98%+ LOB Corpus Tagged

9 7 November 2006 DBRG- University of Tehran Tagged Corpora Corpus# Tags#Tokens Brown871 million British Natl61100 million Penn Treebank454.8 million Original Bijankhan550? Bijankhan402.6 million

10 7 November 2006 DBRG- University of Tehran POS Tagging Approaches POS Tagging SupervisedUnsupervised Rule-BasedStochasticNeuralRule-BasedStochasticNeural

11 7 November 2006 DBRG- University of Tehran Rule-Based POS Tagger Lexicon with tags identified for each word that ADV PRON DEM SG DET CENTRAL DEM SG CS Constraints to eliminate tags: If  next word is adj, adv, quant  And following is S bdry  And previous word is not consider-type V Then  Eliminate non-ADV tags He was that drunk.

12 7 November 2006 DBRG- University of Tehran Probabilistic POS Tagging Provides the possibility of automatic training rather than painstaking rule revision. Automatic training means that a tagger can be easily adapted to new text domains. E.g. A moving/VBG house A moving/JJ ceremony

13 7 November 2006 DBRG- University of Tehran Probabilistic POS Tagging Needs large tagged corpus for training Unigram statistics (most common part-of- speech for each word) get us to about 90% accuracy For greater accuracy, we need some information on adjacent words

14 7 November 2006 DBRG- University of Tehran Corpus Training The probabilities in a statistical model come from the corpus it is trained on. If the corpus is too domain-specific, the model may not be portable to other domains. If the corpus is too general, it will not capitalize on the advantages of domain- specific probabilities

15 7 November 2006 DBRG- University of Tehran Tagger Evaluation Once a tagging model has been built, how is it tested?  Typically, a corpus is split into a training set (usually ~90% of the data) and a test set (10%).  The test set is held out from the training.  The tagger learns the tag sequences that maximize the probabilities for that model.  The tagger is tested on the test set. Tagger is not trained on test data. But test data is highly similar to training data.

16 7 November 2006 DBRG- University of Tehran Current Performance How many tags are correct?  About 98% currently  But baseline is already 90%  Baseline algorithm: Tag every word with its most frequent tag Tag unknown words as nouns How well do people do?

17 7 November 2006 University of Tehran Memory Based Part Of Speech Tagging Experiments With Persian Text

18 7 November 2006 DBRG- University of Tehran Corpus Study At first the corpus had 550 tags. The content is gathered form daily news and common texts. Each document is assigned a subject such as political, cultural and so on.  Totally, there are 4300 different subjects.  This subject categorization provides an ideal experimental environment for clustering, filtering, categorization research. In this research, we simply ignored the subject categories of the documents and concentrated on POS tags.

19 7 November 2006 DBRG- University of Tehran Selecting Suitable Tags At first frequencies of each tags was gathered. Then many of the tags were grouped together and a smaller tag set was produced Each tag in the tag set is placed in a hierarchical structure.  As an example, consider the tag “N_PL_LOC”. Nstands for a noun PLdescribes the plurality of the tag LOCdefines the tag as about locations

20 7 November 2006 DBRG- University of Tehran The Tags Distribution

21 7 November 2006 DBRG- University of Tehran Max, Min, AVG, Total # of Tags in The Training Set

22 7 November 2006 DBRG- University of Tehran Number of Different Tags For instance, the word “آسمان” which means “the sky” in English is always tagged with "N_SING" in the whole corpus; but a word like “بالا” which means “high or above” has been tagged by several tags ("ADJ_SIM", "ADV", "ADV_NI", "N_SING", "P", and "PRO").

23 7 November 2006 DBRG- University of Tehran Classifying the Rare Words The Tags whose number of occurrences is below 5000 times in the corpus are gathered to “ETC” group.

24 7 November 2006 DBRG- University of Tehran Bijankhan Corpus

25 7 November 2006 DBRG- University of Tehran Implemented Mehtods MLE Based POS Tagger Neural Network POS Tagger Memory Based POS Tagger

26 7 November 2006 DBRG- University of Tehran Implemented Mehtods MLE Based POS Tagger Neural Network POS Tagger Memory Based POS Tagger

27 7 November 2006 DBRG- University of Tehran Memory-Based POS Tagging Memory-based POS tagging is also called Lazy Leaning, Example Based learning or Case Based Learning MBT uses some specifications of each word such as its possible tags, and a fixed width context as features. We used MBT, a tool for memory based tagger generation and tagging. (available at:

28 7 November 2006 DBRG- University of Tehran The MBT tool generates a tagger by working through the annotated corpus and creating three data structures:  a lexicon, associating words to tags as evident in the training corpus  a case base for known words (words occurring in the lexicon)  a case base for unknown words. Memory-Based POS Tagging Selecting appropriate feature sets for known and unknown words has important impact on the accuracy of the results

29 7 November 2006 DBRG- University of Tehran After different experiments, we chose “ddfa” as the feature set for known words. So “ddfa” is choosing the appropriate tag for each known word, based on the tag of two words before and possible tags of the word after it. Memory-Based POS Tagging af d d  d stand for disambiguated tags  f means focus (current) word  a is ambiguous word after the current word.

30 7 November 2006 DBRG- University of Tehran The feature set chosen for unknown word is “dFass” Memory-Based POS Tagging ssa F d current word  d is the disambiguated tag of the word before current word  a stands for ambiguous tags of the word after current word  ss are two suffix letters of the current word. The F in unknown words features indicates position of the focus word and it is not included in actual feature set for tagging.

31 7 November 2006 DBRG- University of Tehran MBT Results- Known Words “ddfa”

32 7 November 2006 DBRG- University of Tehran MBT Results- Unknown Words “dFass”

33 7 November 2006 DBRG- University of Tehran MBT Results- Overall

34 7 November 2006 DBRG- University of Tehran Implemented Mehtods Neural Network POS Tagger MLE Based POS Tagger Memory Based POS Tagger

35 7 November 2006 DBRG- University of Tehran Maximum Likelihood Estimation As a bench mark of POS tagging accuracy, we chose Maximum Likelihood Estimation (MLE) approach.  Calculating the maximum likelihood probabilities for each tag assigned to any word in the training set.  Choosing the tag with greater maximum likelihood probability (designated tag) for each word and make it the only tag assignable to that word. In order to evaluate this method we analyze the words in the test set and assign the designated tags to the words in the test set.

36 7 November 2006 DBRG- University of Tehran Maximum Likelihood Estimation OccurrenceWordTagMLE 1پدرانهADV_NI پدرانهADJ_SIM پديدارADJ_SIM پديدارN_SING پذيرفتهN_SING پذيرفتهADJ_SIM پذيرفتهV_PA پذيرفتهADJ_INO پراكنده اندV_PRE پراكنده اندV_PA0.5000

37 7 November 2006 DBRG- University of Tehran MLE Results-Known Words

38 7 November 2006 DBRG- University of Tehran MLE Results- Unknown Words, “DEFAULT” For each unknown word we assign the “DEFAULT” tag.

39 7 November 2006 DBRG- University of Tehran MLE Results- Overall, “DEFAULT” For each unknown word we assign the “DEFAULT” tag.

40 7 November 2006 DBRG- University of Tehran MLE Results- Unknown Words, “N_SING” For each unknown word we assign the “N_SING” tag.

41 7 November 2006 DBRG- University of Tehran MLE Results- Overall, “N_SING” For each unknown word we assign the “N_SING” tag, most assigned tag.

42 7 November 2006 DBRG- University of Tehran Comparison With Other Languages

43 7 November 2006 DBRG- University of Tehran Implemented Mehtods MLE Based POS Tagger Neural Network POS Tagger Memory Based POS Tagger

44 7 November 2006 DBRG- University of Tehran Neural Network Each unit corresponds to one of the tags in the tag set. Preceding Words Following Words

45 7 November 2006 DBRG- University of Tehran Neural Network For each POS tag, pos i and each of the p+1+f in the context, there is an input unit whose activation in i,j represent the probability that word i has pos pos j. Input representation for the currently tagged word and the following words: The activation value for the preceding words:

46 7 November 2006 DBRG- University of Tehran Neural Network Results on Bijankhan Corpus Training Algorithm No. of Hidden Layer No. of Input for Train Training Duration (Hour) No. of Input for Test Accuracy MLP21mil120:00: Too Low MLP31mil?1000Too Low Generalized Feed Forward 11mil95:30:571000Too Low Generalized Feed Forward 21mil?1000Too Low Generalized Feed Forward :53:351000%58

47 7 November 2006 DBRG- University of Tehran Neural Network on Other Languages English

48 7 November 2006 DBRG- University of Tehran Neural Network on Other Languages Chinese

49 7 November 2006 DBRG- University of Tehran Future Work Using more than 1 level POS tags. Unsupervised POS tagging using Hamshahri Collection Investigation of other methods for Persian POS tagging such as Support Vector Machine (SVM) based tagging KASRE YE EZAFE in Persian!

50 7 November 2006 DBRG- University of Tehran Thank You Space for Question?


Download ppt "7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran."

Similar presentations


Ads by Google