Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006.

Similar presentations


Presentation on theme: "1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006."— Presentation transcript:

1 1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006

2 2 Today Tokenization Using Regular Expressions

3 3 Slide by Diane Litman Tokenization Tokenization: The task of converting a text from a single string to a list of tokens. It is harder that it seems: I’ll see you in New York. The aluminum-export ban. The simplest approach is to use “graphic words” (i.e., separate words using whitespace) Another approach is to use regular expressions to specify which substrings are valid words.

4 4 Modified from Dorr and Habash (after Jurafsky and Martin) Tokenization Issues Sentence Boundaries Include parens around sentences? What about quotation marks around sentences? Periods – end of line or not? Proper Names What to do about –“New York-New Jersey train”? –“California Governor Arnold Schwarzenegger”? Contractions –That Fred’s jacket’s pocket. –I’m doing what you’re saying “Don’t do!”.

5 5 Slide by Diane Litman Tokens vs. Types The term word can be used in two different ways: 1.To refer to an individual occurrence of a word 2.To refer to an abstract vocabulary item –For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items. To avoid confusion use more precise terminology: 1.Word Token: an occurrence of a word 2.Word Type: a vocabulary item

6 6 Example: Recognizing Dates Years: Date ranges: 1 JANUARY 1995 JULY 11-15, 1995 20-23 May 1995 February 15, 1995 September 21-23, 1995 May 15th, ‘95 Thursday, 23rd March, 1995

7 7 Regular Expressions Very powerful in Python Easy to get started, then add on more. In-class example: pig latin! If word starts with consonant(s) –Move them to the end, append “ay” Else word starts with vowel(s) –Keep as is, but add “zay”

8 8 nltk_lite.tokenize

9 9 Regex’s for Tokenizing Some simple ones: hyphen = r'(\w+\-\s?\w+)‘ »Allows for a space after the hyphen apostrophe = r'(\w+\'\w+)‘ numbers = r'((\$|#)?\d+(\.)?\d+%?)‘ »Needs to handle large numbers with commas punct = r'([^\w\s]+)‘ wordr = r'(\w+)‘ A nice python trick: r = “|”.join([url, hyphen, apostro, numbers, wordr, punct]) –Makes one string in which a “|” goes in between each substring Now run it: pattern = re.compile(r) sentence = "That art-deco poster costs $23.40.“ print str(list(tokenize.regexp(sentence, pattern)))

10 10 Tokenization Exercises The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively Write and end-of-sentence recognizer. Test it on some text that I am going to give to you. Date ranges: 1 JANUARY 1995 JULY 11-15, 1995 20-23 May 1995 February 15, 1995 September 21-23, 1995 May 15th, 1995 Thursday, 23rd March, 1995

11 11 Auto-Puzzler Will Shortz-style puzzles Name a make of car containing the letter "N." Rearrange the letters to get a new word starting with "N" that names something you might put a car in. What is it? How might we implement this?

12 12 Assignment Due in one week Write a tokenizer and sentence boundary recognizer. Will be helpful to use a list of abbreviations for distinguishing abbreviations from end of sentence markers. Do the best you can in the time we have; you won’t get everything right. Don’t spend more than about 6 hours on it.

13 13 Next Time Elementary Morphology Stemming


Download ppt "1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006."

Similar presentations


Ads by Google