Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.

Similar presentations


Presentation on theme: "Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1."— Presentation transcript:

1 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

2 Pay Attention! -cor·pus n. pl. cor·po·ra –A large collection of writings of a specific kind or on a specific subject. –A collection of writings or recorded remarks used for linguistic analysis. –...

3 Corpus Linguistics -The study of language based on sample texts of the language in use, often with computational methods. “Text” in this case means any sample of the language. One of the bodies of text may be called a corpus—Latin for “body”. The plural is corpora.

4 Part-of-Speech (POS) Tagging -The processing of a corpus to apply tags to words or other semantic units corresponding to part-of-speech. “Part-of-speech” here has a very general meaning. –Not the seven classes you learn in English class. –Not the categories of transformational syntax (within “pure” linguistics). –Such categories as are best suited for computer processing of language (esp. parsing).

5 Training Corpora Traditional POS taggers must be trained on a tagged corpus. This corpus will usually be a list of words in the order of their appearance in the text with the tag of each word next to it. Other information may be included. Most taggers work using corpora that have been hand-tagged by humans. Some can make do with partial or non-perfect machine tagged corpora.

6 Limited Training Corpora Modern tagging methods call for the largest possible amount of training data. Oftentimes data can be hard to obtain. (Not quite the case with English anymore, but other languages can be more problematic.) Behavior of POS taggers with smaller training corpora may differ from those with larger ones.

7 Genre and Tagging Many corpora are divided into the genre the of source for the texts: transcribed speech, newswire article, fiction, etc. Training knowledge about a specific genre can be—and usually is—more valuable than general knowledge. When tagging size reaches a certain smallness general knowledge may in some cases become more valuable than genre knowledge.

8 Implementation Basics The frequencies of each tag, word, tag-tag transition, and word-tag correspondence are needed. These frequencies are extended to the general case by using them to create probabilities of a certain word having a certain tag and of a certain tag following another tag. The probabilities of a certain tag mapping to a certain word are combined and the most probable is chosen. This is the Viterbi Algorithm viewing the system as a Hidden Markov Model.

9 Intermission Who names their kid Corpus? I mean, seriously?

10 Questions for Investigation How do limited training corpora POS taggers perform? How does performance differ amongst various genres, and between general and genre-specific tagging? Is this in accordance with large corpora theories? Can a transitional point be found—the minimal size? How general is such information?

11 Results Results were found for genre-specific tagging for each of the four genres in the corpus as well as for general tagging on an exemplar of each genre. A “primed” genre-specific case wherein the system is trained on the target text as well is included for comparison.

12

13 Comparison of Taggings Primed genre-specific taggings are most accurate in all cases. General tagging is somewhat less accurate in two cases, very slightly less accurate in one case, and somewhat more accurate in one case.

14

15

16

17 Training Corpus Size No clear trend can be drawn from the sizes investigated. The small differences involved are largely inconsequential compared to differences between texts.

18 Conclusions Performance Evaluation: –Relatively poor compared to large-corpora taggers but much better than random. Effect of Genres: –Genred taggings seem to work somewhat better, although numbers found are not overwhelmingly convincing. –Existence of “transition point” still uncertain. –Best for G (biographies; memoirs) and N (adventure; fiction), second is A (press reportage), last is J (learned writing).

19 Potential Improvements This tagger was designed to be purely statistical. Given some human knowledge it could improve. Basic conjugation, declension, and other word-formation methods would be very helpful in determinations of POS. Using further depths of association among sequences of tags would also help.

20 Final Words Limited-corpora training is inadequate for many uses if one uses the standard methods. An efficient,possibly improved implementation could be helpful in preliminary work for a Baum-Welsh Re- estimation or for a tagging using a larger corpus.

21 Quiz and Q&A Define corpus. Any questions?


Download ppt "Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1."

Similar presentations


Ads by Google