1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004

2 Today Purpose of Part-of-Speech Tagging Training and Testing Collections Intro to N-grams and Language Modeling Using NLTK for POS Tagging

3 Class Exercise I will read off a few words from the beginning of a sentence You should write down the very first 2 words that come to mind that should follow these words. Example: I say “One fish” You write “two fish” Don’t second-guess or try to be clever. Note: there are no correct answers

4 Modified from Diane Litman's version of Steve Bird's notes Terminology Tagging The process of associating labels with each token in a text Tags The labels Tag Set The collection of tags used for a particular task

5 Modified from Diane Litman's version of Steve Bird's notes Example Typically a tagged text is a sequence of white- space separated base/tag tokens: The/at Pantheon’s/np interior/nn,/,still/rb in/in its/pp original/jj form/nn,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn./.

6 Modified from Diane Litman's version of Steve Bird's notes What does Tagging do? 1.Collapses Distinctions Lexical identity may be discarded e.g. all personal pronouns tagged with PRP 2.Introduces Distinctions Ambiguities may be removed e.g. deal tagged with NN or VB e.g. deal tagged with DEAL1 or DEAL2 3.Helps classification and prediction

7 Modified from Diane Litman's version of Steve Bird's notes Significance of Parts of Speech A word’s POS tells us a lot about the word and its neighbors: Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) Helps in stemming Limits the range of following words for Speech Recognition Can help select nouns from a document for IR Basis for partial parsing (chunked parsing) Parsers can build trees directly on the POS tags instead of maintaining a lexicon

8 Slide modified from Massimo Poesio's Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between Getting better information about context (best: introduce more distinctions) Make it possible for classifiers to do their job (need to minimize distinctions)

9 Slide modified from Massimo Poesio's Some of the best-known Tagsets Brown corpus: 87 tags Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags

10 Modified from Diane Litman's version of Steve Bird's notes The Brown Corpus The first digital corpus (1961) Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: –Science fiction, romance fiction, press reportage scientific writing, popular lore

11 Modified from Diane Litman's version of Steve Bird's notes Penn Treebank First syntactically annotated corpus 1 million words from Wall Street Journal Part of speech tags and syntax trees

12 Slide modified from Massimo Poesio's How hard is POS tagging? Number of tags1234567 Number of words types 353403760264611221 In the Brown corpus, - 11.5% of word types ambiguous - 40% of word TOKENS

13 Slide modified from Massimo Poesio's Important Penn Treebank tags

14 Slide modified from Massimo Poesio's Verb inflection tags

15 Slide modified from Massimo Poesio's The entire Penn Treebank tagset

16 Slide modified from Massimo Poesio's Quick test DoCoMo and Sony are to develop a chip that would let people pay for goods through their mobiles.

17 Tagging methods Hand-coded Statistical taggers Brill (transformation-based) tagger

18 Modified from Diane Litman's version of Steve Bird's notes Reading Tagged Corpora >> corpus = brown.read(‘ca01’) >> corpus[‘WORDS’][0:10] [,,,,,,,,, ] >> corpus[‘WORDS’][2][‘TAG’] ‘nn-tl’ >> corpus[‘WORDS’][2][‘TEXT’] ‘County’

19 Modified from Diane Litman's version of Steve Bird's notes Default Tagger We need something to use for unseen words E.g., guess NNP for a word with an initial capital How to do this? Apply a sequence of regular expression tests Assign the word to a suitable tag If there are no matches… Assign to the most frequent unknown tag, NN –Other common ones are verb, proper noun, adjective Note the role of closed-class words in English –Prepositions, auxiliaries, etc. –New ones do not tend to appear.

20 Modified from Diane Litman's version of Steve Bird's notes A Default Tagger > from nltk.tokenizer import * > from nltk.tagger import * > text_token = Token(TEXT="John saw 3 polar bears.") > WhitespaceTokenizer().tokenize(text_token) > NN_CD_tagger = RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')]) > NN_CD_tagger.tag(text_token),,,,, ]> NN_CD_Tagger assigns CD to numbers, otherwise NN. Poor performance (20-30%) in isolation, but when used with other taggers can significantly improve performance

21 Modified from Diane Litman's version of Steve Bird's notes Finding the most frequent tag >>>from nltk.probability import FreqDist >>>from nltk.corpus import brown >>> fd = FreqDist() >>> corpus = brown.read('ca01') >>> for token in corpus['WORDS']: fd.inc(token['TAG'])... >>> fd.max() >>> fd.count(fd.max())

22 Evaluating the Tagger This gets 2 wrong out of 16, or 18.5% error Can also say an accuracy of 81.5%.

23 Training vs. Testing A fundamental idea in computational linguistics Start with a collection labeled with the right answers Supervised learning Usually the labels are done by hand “Train” or “teach” the algorithm on a subset of the labeled text. Test the algorithm on a different set of data. Why? –If memorization worked, we’d be done. –Need to generalize so the algorithm works on examples that you haven’t seen yet. –Thus testing only makes sense on examples you didn’t train on. NLTK has an excellent interface for doing this easily.

24 Training the Unigram Tagger

25 Creating Separate Training and Testing Sets

26 Modified from Diane Litman's version of Steve Bird's notes Evaluating a Tagger Tagged tokens – the original data Untag (exclude) the data Tag the data with your own tagger Compare the original and new tags Iterate over the two lists checking for identity and counting Accuracy = fraction correct

27 Assessing the Errors Why the tuple method? Dictionaries cannot be indexed by lists, so convert lists to tuples. exclude returns a new token containing only the properties that are not named in the given list.

28 Assessing the Errors

29 Language Modeling Another fundamental concept in NLP Main idea: For a given language, some words are more likely than others to follow each other, or You can predict (with some degree of accuracy) the probability that a given word will follow another word. Illustration: Distributions of words in class-participation exercise.

30 N-Grams The N stands for how many terms are used Unigram: 1 term Bigram: 2 terms Trigrams: 3 terms –Usually don’t go beyond this You can use different kinds of terms, e.g.: Character based n-grams Word-based n-grams POS-based n-grams Ordering Often adjacent, but not required We use n-grams to help determine the context in which some linguistic phenomenon happens. E.g., look at the words before and after the period to see if it is the end of a sentence or not.

31 Modified from Diane Litman's version of Steve Bird's notes Features and Contexts w n -2 w n -1 w n w n+1 CONTEXT FEATURE CONTEXT t n-2 t n-1 tntn t n+1

32 Modified from Diane Litman's version of Steve Bird's notes Unigram Tagger Trained using a tagged corpus to determine which tags are most common for each word. E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1 time, and with VBP 1 time Performance is highly dependent on the quality of its training set. Can’t be too small Can’t be too different from texts we actually want to tag

33 Modified from Diane Litman's version of Steve Bird's notes Nth Order Tagging Order refers to how much context It’s one less than the N in N-gram here because we use the target word itself as part of the context. –Oth order = unigram tagger –1 st order = bigrams –2 nd order = trigrams Bigram tagger For tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokens What is the most likely tag for w_n, given w_n-1 and t_n-1? The tagger picks the tag which is most likely for that context.

34 Reading the Bigram table The current word The previously seen tag The predicted POS

35 Modified from Massio Poesio's lecture Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) =.00041 P(race|VB) =.00003

36 Modified from Diane Litman's version of Steve Bird's notes Combining Taggers Use more accurate algorithms when we can, backoff to wider coverage when needed. Try tagging the token with the 1st order tagger. If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag.

37 Modified from Diane Litman's version of Steve Bird's notes BackoffTagger class >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) # Construct the taggers >>> tagger1 = NthOrderTagger(1, SUBTOKENS=‘WORDS’) >>> tagger2 = UnigramTagger() # 0th order >>> tagger3 = NN_CD_Tagger() # Train the taggers >>> for tok in train_toks: tagger1.train(tok) tagger2.train(tok)

38 Modified from Diane Litman's version of Steve Bird's notes Backoff (continued) # Combine the taggers (in order, by specificity) > tagger = BackoffTagger([tagger1, tagger2, tagger3]) # Use the combined tagger > accuracy = tagger_accuracy(tagger, unseen_tokens)

39 Modified from Diane Litman's version of Steve Bird's notes Rule-Based Tagger The Linguistic Complaint Where is the linguistic knowledge of a tagger? Just a massive table of numbers Aren’t there any linguistic insights that could emerge from the data? Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

40 Slide modified from Massimo Poesio's The Brill tagger An example of TRANSFORMATION-BASED LEARNING Very popular (freely available, works fairly well) A SUPERVISED method: requires a tagged corpus Basic idea: do a quick job first (using frequency), then revise it using contextual rules

41 Brill Tagging: In more detail Start with simple (less accurate) rules…learn better ones from tagged corpus Tag each word initially with most likely POS Examine set of transformations to see which improves tagging decisions compared to tagged corpustransformations Re-tag corpus using best transformation Repeat until, e.g., performance doesn’t improve Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

42 Slide modified from Massimo Poesio's An example Examples: It is expected to race tomorrow. The race for outer space. Tagging algorithm: 1.Tag all uses of “race” as NN (most likely tag in the Brown corpus) It is expected to race/NN tomorrow the race/NN for outer space 2.Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: It is expected to race/VB tomorrow the race/NN for outer space

43 Slide modified from Massimo Poesio's Transformation-based learning in the Brill tagger 1.Tag the corpus with the most likely tag for each word 2.Choose a TRANSFORMATION that deterministically replaces an existing tag with a new one such that the resulting tagged corpus has the lowest error rate 3.Apply that transformation to the training corpus 4.Repeat 5.Return a tagger that a.first tags using unigrams b.then applies the learned transformations in order

44 Slide modified from Massimo Poesio's Examples of learned transformations

45 Slide modified from Massimo Poesio's Templates

46 Adapted from Massio Peosio's Additional issues Most of the difference in performance between POS algorithms depends on their treatment of UNKNOWN WORDS Multiple token words (‘Penn Treebank’) Class-based N-grams

47 Upcoming I will email the procedures for turning in the first assignment on Wed Sept 15 Will be over the web On Wed I’ll discuss shallow parsing Start reading the Chunking (Shallow Parsing) tutorial I will assign homework from this on Wed, due in one week on Sept 22. Next Monday I’ll briefly discuss syntactic parsting There is a tutorial on this; feel free to read it In the interests of reducing workload, I’m not assigning it however

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.

Similar presentations

Presentation on theme: "1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.

Similar presentations

Presentation on theme: "1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004."— Presentation transcript:

Similar presentations

About project

Feedback