Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular expressions Day 2 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations


Presentation on theme: "Regular expressions Day 2 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

1 Regular expressions Day 2 LING 681.02 Computational Linguistics Harry Howard Tulane University

2 24-Aug-2009LING 681.02, Prof. Howard, Tulane University2 Course organization

3 Regular expressions SLP 2.1

4 24-Aug-2009LING 681.02, Prof. Howard, Tulane University4 Questions  What is a string?  A sequence of symbols.  In text, a sequence of alphanumeric characters.  What is a regular expression (RE or regex)?  A language for specifying text search strings, requiring a pattern to search for and and a corpus to search through.  What is an algebra?  A set of elements and a group of operations defined for them  e.g. the set of real numbers and the operations +, –, *, and /.  What is a false positive?  a string that is incorrectly matched > decreases accuracy  What is a false negative?  a string that is incorrectly excluded > decreases coverage  What is precedence?

5 24-Aug-2009LING 681.02, Prof. Howard, Tulane University5 Notation in Perl * + - ^ ?. | () {n} \b \w $ \1 0 or more occurrences of the previous character or RE 1 or more occurrences of the previous character or RE The two ends of a range Not (negation) or beginning of line; "caret" the previous character is optional any character either … or "pipe" grouping or put in a register n occurrences of previous character or RE word boundary white space end of line replace with RE in register 1

6 24-Aug-2009LING 681.02, Prof. Howard, Tulane University6 Exercise 2.1: REs 1. The set of all alphabetic strings.  [a-zA-Z][a-zA-Z]*  [a-zA-Z]+ 2. The set of all lower case alphabetic strings ending in a b.  [a-z]*b 3. The set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and “the the” but not “the bug” or “the big bug”).  ([a-zA-Z]+)\s+\1

7 24-Aug-2009LING 681.02, Prof. Howard, Tulane University7 Exercise 2.1: REs, cont. 4. The set of all strings from the alphabet a, b such that each a is immediately preceded by and immediately followed by a b.  (b+(ab+)+)? 5. All strings that start at the beginning of the line with an integer and that end at the end of the line with a word.  ˆ\d+\b.*\b[a-zA-Z]+$

8 24-Aug-2009LING 681.02, Prof. Howard, Tulane University8 Exercise 2.1: REs, cont. 6. All strings that have both the word grotto and the word raven in them (but not, e.g., words like grottos that merely contain the word grotto).  \bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b 7. Write a pattern that places the first word of an English sentence in a register. Deal with punctuation.  ˆ[ˆa-zA-Z]*([a-zA-Z]+)

9 24-Aug-2009LING 681.02, Prof. Howard, Tulane University9 Exercise 2.2  patterns  (r"\b(i’m|i am)\b", "YOU ARE"),  (r"\b(i|me)\b", "YOU"),  (r"\b(my)\b", "YOUR"),  (r"\b(well,?) ", ""),  (r".* YOU ARE (depressed|sad).*", r"I AM SORRY TO HEAR YOU ARE \1"),  (r".* YOU ARE (depressed|sad).*", r"WHY DO YOU THINK YOU ARE \1"),  (r".* all.*", "IN WHAT WAY"),  (r".* always.*", "CAN YOU THINK OF A SPECIFIC EXAMPLE"),  (r"[%s]" % re.escape(string.punctuation), ""),

10 NLPP

11 24-Aug-2009LING 681.02, Prof. Howard, Tulane University11 REs in Python  The re module provides Perl-type regular expression patterns, see http://www.amk.ca/python/howto/regex/ http://www.amk.ca/python/howto/regex/  NLPP goes into REs in §3.4, p. 97ff

12 Next time SLP Automata: §2.2-end & Ex. 2.3-end NLPP: finish §1, do as many of the exercises as you can


Download ppt "Regular expressions Day 2 LING 681.02 Computational Linguistics Harry Howard Tulane University."

Similar presentations


Ads by Google