Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes.

Similar presentations


Presentation on theme: "Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes."— Presentation transcript:

1 Basic Text Processing Regular Expressions

2 Dan Jurafsky 2 The original slides from: http://web.stanford.edu/~jurafsky/NLPCourseraSlides.h tml Some changes has done to these slides to fit with our NLP course

3 Dan Jurafsky Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

4 Dan Jurafsky Regular Expressions: Disjunctions Letters inside square brackets [] Ranges [A-Z] PatternMatches [wW]oodchuck Woodchuck, woodchuck [1234567890] Match Any digit PatternMatches (with red and blue color) [A-Z] An upper case letter Drenched Blossoms [a-z] A lower case letter my beans were impatientmy beans were impatient [0-9] A single digit Chapter 1: Down the Rabbit Hole

5 Dan Jurafsky Regular Expressions: Negation in Disjunction Negations [^Ss] Carat ^ means negation only when first in square bracket [] PatternMatches (with red and blue color) [^A-Z] Not an upper case letter Oyfn pripetchik [^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reaSon” [^e^] Neither e nor ^ Look here ^ a^b The pattern a carat b Look up a^b now

6 Dan Jurafsky Regular Expressions: More Disjunction Woodchucks is another name for groundhog! The pipe | for disjunction PatternMatches groundhog|woodchuckgroundhog woodchuck yours|mineyours mine a|b|ca|b|c = [abc] [gG]roundhog|[Ww]oodchuckgroundhog Groundhog woodchuck Woodchuck Photo D. Fletcher

7 Dan Jurafsky Period (.) Itself mean any character but backslash period (\.) means period Regular Expressions: ? * +. Stephen C Kleene PatternMatches colou?r Optional previous char color colour oo*h! 0 or more of previous char oh! ooh! oooh! ooooh! o+h! 1 or more of previous char oh! ooh! oooh! ooooh! baa+baa baaa baaaa baaaaa beg.nbegin begun begun beg3n Kleene *, Kleene +

8 Dan Jurafsky Regular Expressions: Anchors ^ $ ^ match the begging of the line $ match the end of the line PatternMatches (with blue color) ^[A-Z]Palo Alto ^[^A-Za-z]1 “ Hello ” \.$The end..$.$The end? The end!

9 Dan Jurafsky Example Question: Find me all instances of the word “the” in a text. Solutions: the problem#1 Misses capitalized examples problem#2 Incorrectly returns other or theology [tT]he problem#2 Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z] solves both problems1&2

10 Dan Jurafsky Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, then, other) False positives (Type I) Not matching things that we should have matched (The) False negatives (Type II)

11 Dan Jurafsky Errors cont. In NLP we are always dealing with these kinds of errors. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy or precision (minimizing false positives) Increasing coverage or recall (minimizing false negatives).

12 Dan Jurafsky Summary Regular expressions play a surprisingly large role Sophisticated sequences of regular expressions are often the first model for any text processing text For many hard tasks, we use machine learning classifiers But regular expressions are used as features in the classifiers Can be very useful in capturing generalizations 12

13 13 Exercises in the Class 1- see this link for practicing http://regexpal.com 2- Write the following test text: We looked ! Then we saw him step in on the mat. We looked ! And we saw him ! The cat in the Hat ! 3- Practice these expressions: [Ww] [em] [A-Z] [a-z] [A-Za-z] [ !]........................ [^Aa] [^!] [^A-Za-z]..................... looked|step at|ook......................... o+......................... [A-Z]$ !$ \........................... the [tT]he


Download ppt "Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes."

Similar presentations


Ads by Google