# Python: Regular Expressions

## Presentation on theme: "Python: Regular Expressions"— Presentation transcript:

Python: Regular Expressions http://www.flickr.com/photos/iamthestig2/3925864142/sizes/l/in/photostream/

Patterns (Regular Expressions) Patterns are a very useful technique for processing textual data.  A pattern defines a set of strings.  The fundamental operation is set-membership. Given a string S, we can ask if S is a member of the set defined by some pattern P.

Patterns (Regular Expressions)

Case Study : Hugs and Kisses A fixed pattern is one with no variability.  A hugs and kisses pattern is an example.  The hugs-and-kisses pattern:  XOXO

Case Study : MPAA Ratings There are 5 MPAA ratings:  G  PG  PG-13  R  NC-17 The MPAA rating pattern:  G|PG|PG-13|R|NC-17

Case Study : SSN A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.

The language of regular expressions Here is an inductive definition of the syntax of the basic elements of a regular expression.  Any single character is a regular expression.  If A and B are both regular expressions, then so are  AB : this represents A followed by B; concatenation  A|B : this represents A or B; the vertical bar is special; alternation  (A) : this represents a group; the parens are special

Examples Do the following match the regex '(c|h)a?rt*'  hart  cat  car  chart  chaarrtt  hrtttt Do the following match the regex '(x|y)*' xx  xy  xxyxyyx

Case Study : SSN A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.

Repetition Patterns often include a notion of repetition. Notations are introduced to control repetition.

Case Study : SSN We can simplify our SSN pattern

Case Study : Hugs and Kisses Two Consider a hugs and kisses pattern that includes any string composed of pairs of XO’s  XO  XOXO  XOXOXO Write the following patterns  A binary string that is odd  A binary string that contains at least 3 consecutive 1's  A binary string that contains no more than 3 consecutive 1's

Character Classes A character class is pattern that concisely defines a set of characters. The term digit, for example, names a character class since a digit is defined as the set of the 10 characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Additional pattern writing rules allow us to define our own character classes and also provide several pre-defined commonly-used character classes.

Regular Expressions and Character classes Square brackets denote a set of characters. Set members are listed explicitly.  [abc] will match an 'a', b' or 'c'  [uwl] will match a 'u', 'w', or 'l' Special characters are not special in character classes.  [.a*] will match a '.', 'a', or '*' There are, however, two special characters that are still special in characters classes.  - : denotes a range meaning left-through-right  ^ : occurs at the beginning and denotes logical set negation. Examples  [a-c] will match an a or b or c and nothing else  [a-z] will match any lower case alphabetic symbol  [a-zA-Z0-9] will match any alphanumeric symbol  [^a] will match anything but lowercase a  [^0-9] will match anything but a digit character

Examples Do the following match the regex '[a-z][0-9]*'  abc  1z93  a-9 Do the following match the regex '[0-9]*[^02468]'  03  999  354 Give a regex for social security numbers  [0-9]{3}-[0-9]{2}-[0-9]{4}

Predefined classes Some character classes are common and have shorthand definitions  \d : matches any decimal digit; equivalent to [0-9]  \D : matches any non-digit character; equivalent to [^0-9]  \w : matches any 'word' character; equivalent to [^ \t\n\r\f\v]  \W : matches any non-word character; equivalent to [^a-zA-Z0-9]  \s : matches any whitespace character (space, tab, newline)  \S : matches any character that is not a whitespace Give a regex for social security numbers  \d{3}-\d{2}-\d{4}

Predefined classes There are two 'positional' matches  \$ : matches the end of a string or matches before a newline  ^ : matches the start of a string or right after a newline What do the following mean?  ^.*s\$  ^\s.*

Finding patterns in text (Examples) 'a' Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. 'Mary' Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. 'a' Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. 'Mary' Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. '.* ' Special characters must be escaped.* '\.\*' Special characters must be escaped.* '.* ' Special characters must be escaped.* '\.\*' Special characters must be escaped.* http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html

Matching and Searching Regular expressions are in the "re" package  match(re, text): determines whether the pattern matches the beginning of the text. Returns either None or a Match object.  search(re, text): determines if the pattern occurs anywhere in the text. Return either None or a Match object.  findall(re,text): return all substrings of the text that match the pattern  finditer(re,text): returns an iterator of all matching substrings >>> import re >>> re.match("c", "abcdef") # No match >>> re.match("a", "abcdef") # Match >>> re.search("c", "abcdef") # Match >>> import re >>> re.match("c", "abcdef") # No match >>> re.match("a", "abcdef") # Match >>> re.search("c", "abcdef") # Match

Match Objects Match Objects support the following methods  start(): returns the index of the start of the match  end(): returns the index of the end of the match  groups(): returns a tuple of the group matches  groups(n): returns the nth group match. If n = 0 returns the entire match. >>> m = re.match("(\w+) (\w+)", "Lazy hands make for poverty,") >>> m.group(0) # The entire match 'Lazy hands' >>> m.group(1) # The first parenthesized subgroup. 'Lazy' >>> m.group(2) # The second parenthesized subgroup. 'hands' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Lazy', 'hands') >>> m = re.match("(\w+) (\w+)", "Lazy hands make for poverty,") >>> m.group(0) # The entire match 'Lazy hands' >>> m.group(1) # The first parenthesized subgroup. 'Lazy' >>> m.group(2) # The second parenthesized subgroup. 'hands' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Lazy', 'hands')

Example: Phone Numbers Consider creating a regular expression to match phone numbers. The phone numbers can take on the following forms:  800-555-1212  800 555 1212  800.555.1212  (800) 555-1212  1-800-555-1212  800-555-1212-1234  800-555-1212x1234  800-555-1212 ext. 1234  1-(800) 555.1212 #1234

Example: Phone Numbers It is good to define a test for your code prior to writing the code. Consider testing our pattern against the examples on the previous slide. import re def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234'] for test in tests: print(test+": ", re.match(regex,test)) import re def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234'] for test in tests: print(test+": ", re.match(regex,test))

Example: Phone Numbers The previous phone numbers had only four components:  area code  trunk (first three digits)  rest (next 4 digits)  extension (last digits. May be between 1 and 4 in length) Consider defining these parts with regular expressions  area code would be \d{3}  trunk would be \d{3}  rest would be \d{3}  extension would be \d{1,4}

Example: Phone Numbers Consider the following for phone numbers  area code-trunk-rest-extension  \d{3}-\d{3}-\d{4}-\d{1,4} >>> test(r'\d{3}-\d{3}-\d{4}-\d{1,4}') 800-555-1212: None 800 555 1212: None 800.555.1212: None (800) 555-1212: None 800-555-1212: None 800-555-1212-1234: 800-555-1212x1234: None 800-555-1212 ext. 1234: None 1-(800) 555.1212 #1234: None >>> test(r'\d{3}-\d{3}-\d{4}-\d{1,4}') 800-555-1212: None 800 555 1212: None 800.555.1212: None (800) 555-1212: None 800-555-1212: None 800-555-1212-1234: 800-555-1212x1234: None 800-555-1212 ext. 1234: None 1-(800) 555.1212 #1234: None

Example: Phone Numbers How to modify our regex to say that extensions are optional?  \d{3}-\d{3}-\d{4}(-\d{1,4})? >>> test(r'\d{3}-\d{3}-\d{4}(-\d{1,4})?') 800-555-1212: 800 555 1212: None 800.555.1212: None (800) 555-1212: None 800-555-1212: 800-555-1212-1234: 800-555-1212x1234: 800-555-1212 ext. 1234: 1-(800) 555.1212 #1234: None >>> test(r'\d{3}-\d{3}-\d{4}(-\d{1,4})?') 800-555-1212: 800 555 1212: None 800.555.1212: None (800) 555-1212: None 800-555-1212: 800-555-1212-1234: 800-555-1212x1234: 800-555-1212 ext. 1234: 1-(800) 555.1212 #1234: None

Example: Phone Numbers How to handle different separators? Consider the following test cases especially:  800-555-1212  800 555 1212  800.555.1212  (800) 555-1212  800-555-1212x1234  800-555-1212 ext. 1234  1-(800) 555.1212 #1234 Let's say that a separator is  optional  any number of non-digit characters

Example: Phone Numbers How to modify our regex to deal with separators?  \d{3}-\d{3}-\d{4}(-\d{1,4})?  \d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})? >>> test(r'\d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?') 800-555-1212: 800 555 1212: 800.555.1212: (800) 555-1212: None 800-555-1212: 800-555-1212-1234: 800-555-1212x1234: 800-555-1212 ext. 1234: 1-(800) 555.1212 #1234: None >>> test(r'\d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?') 800-555-1212: 800 555 1212: 800.555.1212: (800) 555-1212: None 800-555-1212: 800-555-1212-1234: 800-555-1212x1234: 800-555-1212 ext. 1234: 1-(800) 555.1212 #1234: None

Example: Phone Numbers How to get the four parts from sub-groups? Let's first modify our test routine. def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234'] for test in tests: match = re.match(regex, test) if match: print(test + ': ', match.groups()) else: print(test + ': ', None) def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234'] for test in tests: match = re.match(regex, test) if match: print(test + ': ', match.groups()) else: print(test + ': ', None)

Example: Phone Numbers How to get the four parts from sub-groups?  (\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})? >>> test(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?') 800-555-1212: ('800', '555', '1212', None) 800 555 1212: ('800', '555', '1212', None) 800.555.1212: ('800', '555', '1212', None) (800) 555-1212: None 800-555-1212: ('800', '555', '1212', None) 800-555-1212-1234: ('800', '555', '1212', '1234') 800-555-1212x1234: ('800', '555', '1212', '1234') 800-555-1212 ext. 1234: ('800', '555', '1212', '1234') 1-(800) 555.1212 #1234: None >>> test(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?') 800-555-1212: ('800', '555', '1212', None) 800 555 1212: ('800', '555', '1212', None) 800.555.1212: ('800', '555', '1212', None) (800) 555-1212: None 800-555-1212: ('800', '555', '1212', None) 800-555-1212-1234: ('800', '555', '1212', '1234') 800-555-1212x1234: ('800', '555', '1212', '1234') 800-555-1212 ext. 1234: ('800', '555', '1212', '1234') 1-(800) 555.1212 #1234: None

Splitting Consider reading words from a text file. In the past we have split lines on whitespace. This is not a thorough splitting. Consider a text file having punctuation symbols and separators:  The words of the Teacher, son of David, king in Jerusalem: "Meaningless! Meaningless!" says the Teacher. file = open(file, "r") for line in file: for word in line.split(): print(word) for line in file: for word in re.split("\W+", line): print(word) file = open(file, "r") for line in file: for word in line.split(): print(word) for line in file: for word in re.split("\W+", line): print(word)