1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004

2 Today Cascaded Chunking Example of Using Chunking: Word Associations Evaluating Chunking Going to the next level: Parsing

3 Cascaded Chunking Goal: create chunks that include other chunks Examples: PP consists of preposition + NP VP consists of verb followed by PPs or NPs How to make it work in NLTK The tutorial is a bit confusing, I attempt to clarify

4 Creating Cascaded Chunkers Start with a sentence token A list of words with parts of speech assigned Create a fresh one or use one from a corpus

5 Creating Cascaded Chunkers Create a set of chunk parsers One for each chunk type Each one takes as input some kind of list of tokens, and produced as output a NEW list of tokens –You can decide what this new list is called  Examples: NP-CHUNK, PP-CHUNK, VP-CHUNK –You can also decide what to name each occurrence of the chunk type, as it is assigned to a subset of tokens  Examples: NP, VP, PP How to match higher-level tags? It just seems to match their string description So best be certain that their name does not overlap with POS tags too

9 Let’s do some text analysis Let’s try this on more complex sentences First, read in part of a corpus Then, count how often each word occurs with each POS Determine some common verbs, choose one Make a list of sentences containing that verb Test out the chunker on them; examine further

13 Why didn’t this parse work?

17 Corpus Analysis for Discovery of Word Associations Classic paper by Church & Hanks showed how to use a corpus and a shallow parser to find interesting dependencies between words –Word Association Norms, Mutual Information, and Lexicography, Computational Linguistics, 16(1), 1991 –http://www.research.att.com/~kwc/publications.htmlhttp://www.research.att.com/~kwc/publications.html Some cognitive evidence: Word association norms: which word to people say most often after hearing another word –Given doctor: nurse, sick, health, medicine, hospital… People respond more quickly to a word if they’ve seen an associated word –E.g., if you show “bread” they’re faster at recognizing “butter” than “nurse” (vs a nonsense string)

18 Corpus Analysis for Discovery of Word Associations Idea: use a corpus to estimate word associations Association ratio: log ( P(x,y) / P(x)P(y) ) The probability of seeing x followed by y vs. the probably of seeing x anywhere times the probability of seeing y anywhere P(x) is how often x appears in the corpus P(x,y) is how often y follows x within w words Interesting associations with “doctor”: –X: honorary Y: doctor –X: doctors Y: dentists –X: doctors Y: nurses –X: doctors Y: treating –X: examined Y:doctor –X: doctors Y: treat

19 Corpus Analysis for Discovery of Word Associations Now let’s make use of syntactic information. Look at which words and syntactic forms follow a given verb, to see what kinds of arguments it takes Compute triples of subject-verb-object Example: nouns that appear as the object of the verb usage of “drink”: –martinis, cup_water, champagne, beverage, cup_coffee, cognac, beer, cup, coffee, toast, alcohol… –What can we note about many of these words? Example: verbs that have “telephone” in their object: –sit_by, disconnect, answer, hang_up, tap, pick_up, return, be_by, spot, repeat, place, receive, install, be_on

20 Corpus Analysis for Discovery of Word Associations The approach has become standard Entire collections available Dekang Lin’s Dependency Database –Given a word, retrieve words that had dependency relationship with the input word Dependency-based Word Similarity –Given a word, retrieve the words that are most similar to it, based on dependencies http://www.cs.ualberta.ca/~lindek/demos.htm

21 Example Dependency Database: “sell”

22 Example Dependency-based Similarity: “sell”

23 Homework Assignment Choose a verb of interest Analyze the context in which the verb appears Can use any corpus you like –Can train a tagger and run it on some fresh text Example: What kinds of arguments does it take? Improve on my chunking rules to get better characterizations

24 Evaluating the Chunker Why not just use accuracy? Accuracy = #correct/total number Definitions Total: number of chunks in gold standard Guessed: set of chunks that were labeled Correct: of the guessed, which were correct Missed: how many correct chunks not guessed? Precision: #correct / #guessed Recall: #correct / #total F-measure: 2 * (Prec*Recall) / (Prec + Recall)

25 Example Assume the following numbers Total: 100 Guessed: 120 Correct: 80 Missed: 20 Precision: 80 / 120 = 0.67 Recall: 80 / 100 = 0.80 F-measure: 2 * (.67*.80) / (.67 +.80) = 0.69

26 Evaluating in NLTK We have some already chunked text from the Treebank The code below uses the existing parse to compare against, and to generate Tokens of type word/tag to parse with our own chunker. Have to add location information so the evaluation code can compare which words have been assigned which labels

27 How to get better accuracy? Use a full syntactic parser These days the probabilistic ones work surprisingly well They are getting faster too. Prof. Dan Klein’s is very good and easy to run –http://nlp.stanford.edu/downloads/lex-parser.shtmlhttp://nlp.stanford.edu/downloads/lex-parser.shtml

32 Next Week Shallow Parsing Assignment Due on Wed Sept 29 Next week: Read paper on end-of-sentence disambiguation Presley and Barbara lecturing on categorization We will read the categorization tutorial the following week

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.

Similar presentations

Presentation on theme: "1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.

Similar presentations

Presentation on theme: "1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004."— Presentation transcript:

Similar presentations

About project

Feedback