REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization 08-Sept-2014NLP, Prof. Howard, Tulane University 2 The syllabus is under construction.
The quiz was the review. Review 08-Sept NLP, Prof. Howard, Tulane University
Open Spyder 08-Sept NLP, Prof. Howard, Tulane University
§4. Regular expressions 08-Sept NLP, Prof. Howard, Tulane University
Regular expressions, or regex >>> import re re.findall(pattern, target string) 08-Sept-2014NLP, Prof. Howard, Tulane University 6
4.2. Fixed-length matching 08-Sept NLP, Prof. Howard, Tulane University
The test string >>> S = '''This above all: to thine own self be true,... And it must follow, as the night the day,... Thou canst not then be false to any man.''' 08-Sept-2014NLP, Prof. Howard, Tulane University 8
Strings as regular expressions >>> re.findall(' be ', S) [' be ', ' be '] 08-Sept-2014NLP, Prof. Howard, Tulane University 9
Match one character of a disjunction with | >>> re.findall(' to | be | it | as ', S) [' to ', ' be ', ' it ', ' as ', ' be ', ' to '] >>> set(re.findall(' to | be | it | as ', S)) set([' it ', ' as ', ' to ', ' be ']) 08-Sept-2014NLP, Prof. Howard, Tulane University 10
Match a group of characters with capturing or non-capturing parentheses, () >>> re.findall(' (to|be|it|as) ', S) ['to', 'be', 'it', 'as', 'be', 'to'] R>>> re.findall(' (?:to|be|it|as) ', S) [' to ', ' be ', ' it ', ' as ', ' be ', ' to '] The default behavior of parentheses is to capture the string inside them in the output. The ?: prefix turns capturing off. For the rest of this discussion, we prefer to exclude the spaces from the output. 08-Sept-2014NLP, Prof. Howard, Tulane University 11
Match one character of a range with [] and its negation with [^] >>> re.findall(' ([a-z][a-z]) ', S) ['to', 'be', 'it', 'as', 'be', 'to'] >>> re.findall(' ([^0-9][^0-9]) ', S) ['to', 'be', 'it', 'as', 'be', 'to'] >>> re.findall(' ([a-e][a-e]) ', S) ['be', 'be'] >>> re.findall(' ([^a-e][^a-e]) ', S) ['to', 'it', 'to'] 08-Sept-2014NLP, Prof. Howard, Tulane University 12
Match a number of repetitions of a character with {} >>> re.findall(' ([a-z]{2}) ', S) ['to', 'be', 'it', 'as', 'be', 'to'] 08-Sept-2014NLP, Prof. Howard, Tulane University 13
Match any character with. >>> re.findall(' (..) ', S) ['to', 'be', 'it', 'as', 'be', 'to'] >>> re.findall(' (.{2}) ', S) ['to', 'be', 'it', 'as', 'be', 'to'] 08-Sept-2014NLP, Prof. Howard, Tulane University 14
and following Next time 08-Sept-2014NLP, Prof. Howard, Tulane University 15