Regular expressions Day 2

Presentation on theme: "Regular expressions Day 2"— Presentation transcript:

Regular expressions Day 2
LING Computational Linguistics Harry Howard Tulane University

LING 681.02, Prof. Howard, Tulane University
Course organization 24-Aug-2009 LING , Prof. Howard, Tulane University

Regular expressions SLP 2.1

LING 681.02, Prof. Howard, Tulane University
Questions What is a string? A sequence of symbols. In text, a sequence of alphanumeric characters. What is a regular expression (RE or regex)? A language for specifying text search strings, requiring a pattern to search for and and a corpus to search through. What is an algebra? A set of elements and a group of operations defined for them e.g. the set of real numbers and the operations +, –, *, and /. What is a false positive? a string that is incorrectly matched > decreases accuracy What is a false negative? a string that is incorrectly excluded > decreases coverage What is precedence? 24-Aug-2009 LING , Prof. Howard, Tulane University

LING 681.02, Prof. Howard, Tulane University
Notation in Perl * + - ^ ? . | () {n} \b \w \$ \1 0 or more occurrences of the previous character or RE 1 or more occurrences of the previous character or RE The two ends of a range Not (negation) or beginning of line; "caret" the previous character is optional any character either … or "pipe" grouping or put in a register n occurrences of previous character or RE word boundary white space end of line replace with RE in register 1 24-Aug-2009 LING , Prof. Howard, Tulane University

LING 681.02, Prof. Howard, Tulane University
Exercise 2.1: REs The set of all alphabetic strings. [a-zA-Z][a-zA-Z]* [a-zA-Z]+ The set of all lower case alphabetic strings ending in a b. [a-z]*b The set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and “the the” but not “the bug” or “the big bug”). ([a-zA-Z]+)\s+\1 24-Aug-2009 LING , Prof. Howard, Tulane University

LING 681.02, Prof. Howard, Tulane University
Exercise 2.1: REs, cont. The set of all strings from the alphabet a, b such that each a is immediately preceded by and immediately followed by a b. (b+(ab+)+)? All strings that start at the beginning of the line with an integer and that end at the end of the line with a word. ˆ\d+\b.*\b[a-zA-Z]+\$ 24-Aug-2009 LING , Prof. Howard, Tulane University

LING 681.02, Prof. Howard, Tulane University
Exercise 2.1: REs, cont. All strings that have both the word grotto and the word raven in them (but not, e.g., words like grottos that merely contain the word grotto). \bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b Write a pattern that places the first word of an English sentence in a register. Deal with punctuation. ˆ[ˆa-zA-Z]*([a-zA-Z]+) 24-Aug-2009 LING , Prof. Howard, Tulane University

LING 681.02, Prof. Howard, Tulane University
Exercise 2.2 patterns (r"\b(i’m|i am)\b", "YOU ARE"), (r"\b(i|me)\b", "YOU"), (r"\b(my)\b", "YOUR"), (r"\b(well,?) ", ""), (r".* YOU ARE (depressed|sad) .*", r"I AM SORRY TO HEAR YOU ARE \1"), (r".* YOU ARE (depressed|sad) .*", r"WHY DO YOU THINK YOU ARE \1"), (r".* all .*", "IN WHAT WAY"), (r".* always .*", "CAN YOU THINK OF A SPECIFIC EXAMPLE"), (r"[%s]" % re.escape(string.punctuation), ""), 24-Aug-2009 LING , Prof. Howard, Tulane University

NLPP

LING 681.02, Prof. Howard, Tulane University
REs in Python The re module provides Perl-type regular expression patterns, see NLPP goes into REs in §3.4, p. 97ff 24-Aug-2009 LING , Prof. Howard, Tulane University

Next time SLP Automata: §2.2-end & Ex. 2.3-end
NLPP: finish §1, do as many of the exercises as you can