Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Python: Regular Expressions
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
UNIX Filters.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
Mechanics Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Chapter 6 Functions -- QuickStart. "The Practice of Computing Using Python", Punch & Enbody, Copyright © 2013 Pearson Education, Inc. What is a function?
Numeric Types, Expressions, and Output ROBERT REAVES.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Input, Output, and Processing
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Finding the needle(s) in the textual haystack
Programming Languages Meeting 13 December 2/3, 2014.
Regular Expression (continue) and Cookies. Quick Review What letter values would be included for the following variable, which will be used for validation.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Operating Systems COMP 4850/CISG 5550 File Systems Files Dr. James Money.
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Clearly Visual Basic: Programming with Visual Basic 2008 Chapter 24 The String Section.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Searching and Regular Expressions. Proteins 20 amino acids Interesting structures beta barrel, greek key motif, EF hand... Bind, move, catalyze, recognize,
Regular Expressions The ultimate tool for textual analysis.
More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
More Patterns Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Finding Things Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction to LISP Atoms, Lists Math. LISP n LISt Processing n Function model –Program = function definition –Give arguments –Returns values n Mathematical.
Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Lecture 5: Expressions and Interactivity Professor: Dr. Miguel Alonso Jr. Fall 2008 CGS2423/COP1220.
Exceptions Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
1 CSE 2337 Chapter 7 Organizing Data. 2 Overview Import unstructured data Concatenation Parse Create Excel Lists.
Strings Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
CompSci 101 Introduction to Computer Science April 7, 2015 Prof. Rodger.
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Java Basics Regular Expressions.  A regular expression (RE) is a pattern used to search through text.  It either matches the.
Regular Expressions Copyright Doug Maxwell (
Finding the needle(s) in the textual haystack
Regular Expressions Upsorn Praphamontripong CS 1110
Regular Expressions 'RegEx'.
Strings and Serialization
CSC1018F: Regular Expressions
Lecture 19 Strings and Regular Expressions
Lecture 9 Shell Programming – Command substitution
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
Finding the needle(s) in the textual haystack
Finding the needle(s) in the textual haystack
Pattern Matching in Strings
Number and String Operations
CS 1111 Introduction to Programming Fall 2018
Lesson 3: Find and Replace Tools
String Processing 1 MIS 3406 Department of MIS Fox School of Business
REGEX.
Presentation transcript:

Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information. Regular Expressions

Introduction

Regular ExpressionsIntroduction

Regular ExpressionsIntroduction

Regular ExpressionsIntroduction

Regular ExpressionsIntroduction

Regular ExpressionsIntroduction

Regular ExpressionsIntroduction

Regular ExpressionsIntroduction Your mission: read files containing several hundred measurements each of background evil levels (in millivaders) and convert them to a uniform format for further processing

Regular ExpressionsIntroduction Your mission: read files containing several hundred measurements each of residual evil levels and convert them to a uniform format for further processing Each reading has a site name, the date, and the level of background evil (in millivaders)

Regular ExpressionsIntroduction Your mission: read files containing several hundred measurements each of residual evil levels and convert them to a uniform format for further processing Each reading has a site name, the date, and the level of background evil (in millivaders) Some use tabs to separate fields, others use commas

Regular ExpressionsIntroduction Your mission: read files containing several hundred measurements each of residual evil levels and convert them to a uniform format for further processing Each reading has a site name, the date, and the level of background evil (in millivaders) Some use tabs to separate fields, others use commas Dates are written in several different styles

Regular ExpressionsIntroduction Notebook #1 Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsIntroduction Notebook #1 single tab as separator Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsIntroduction Notebook #1 spaces in site names Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsIntroduction Notebook #1 dates in international standard format (YYYY-MM-DD) Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsIntroduction Notebook #2 Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮

Regular ExpressionsIntroduction Notebook #2 slashes as separators Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮

Regular ExpressionsIntroduction Notebook #2 Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮ site names don't appear to have spaces

Regular ExpressionsIntroduction Notebook #2 Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮ month names and day numbers of varying length

Regular ExpressionsIntroduction Solution: regular expressions

Regular ExpressionsIntroduction Solution: regular expressions A pattern that strings can match

Regular ExpressionsIntroduction Solution: regular expressions A pattern that strings can match Like '*.txt' matches filenames ending in '.txt'

Regular ExpressionsIntroduction Solution: regular expressions A pattern that strings can match Like '*.txt' matches filenames ending in '.txt' Warning: notation is ugly

Regular ExpressionsIntroduction Solution: regular expressions A pattern that strings can match Like '*.txt' matches filenames ending in '.txt' Warning: notation is ugly Writing patterns for strings as strings...

Regular ExpressionsIntroduction Solution: regular expressions A pattern that strings can match Like '*.txt' matches filenames ending in '.txt' Warning: notation is ugly Writing patterns for strings as strings......using only the symbols on the keyboard (instead of inventing new symbols like mathematicians do)

Regular ExpressionsIntroduction # read the first half-dozen data records from two files readings = [] for filename in ('data-1.txt', 'data-2.txt'): lines = open(filename, 'r').read().strip().split('\n') readings += lines[2:8] for r in readings: print r

Regular ExpressionsIntroduction Baker Baker Baker Baker Baker Baker Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/4981.0

Regular ExpressionsIntroduction # select readings in month '06' for r in readings: if '06' in r: print r Baker

Regular ExpressionsIntroduction # select readings in month '06' or month '07' for r in readings: if ('06' in r) or ('07' in r): print r Baker Baker

Regular ExpressionsIntroduction # what about readings in month '05'? (shouldn't be any) for r in readings: if ('05' in r): print r Baker

Regular ExpressionsIntroduction # what about readings in month '05'? (shouldn't be any) for r in readings: if ('05' in r): print r Baker "in string" is a dangerously blunt tool

Regular ExpressionsIntroduction # try using regular expressions instead import re for r in readings: if re.search('06', r): print r Baker

Regular ExpressionsIntroduction # try using regular expressions instead import re for r in readings: if re.search('06', r): print r Baker not much of an improvement so far...

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker first argument is the pattern to search for

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker first argument is the pattern to search for written as a string

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker second argument is the data to search in

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker second argument is the data to search in reversing these is a common mistake (and hard to track down)

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker vertical bar '|' means OR

Regular ExpressionsIntroduction # find records with '06' or '07' in one search import re for r in readings: if re.search('06|07', r): print r Baker Baker vertical bar '|' means OR match either what's on the left, or what's on the right, in a single search

Regular ExpressionsIntroduction # we're going to be trying out a lot of patterns, # so let's write a function def show_matches(pattern, strings): for s in strings: if re.search(pattern, s): print '**', s else: print ' ', s

Regular ExpressionsIntroduction # test our function right away show_matches('06|07', readings) Baker ** Baker ** Baker Baker Baker Baker Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/4981.0

Regular ExpressionsIntroduction # why doesn't this work? show_matches('06|7', readings) ** Baker ** Baker ** Baker ** Baker Baker ** Baker ** Davison/May 23, 2010/ Pertwee/May 24, 2010/ ** Davison/June 19, 2010/ ** Davison/July 6, 2010/ ** Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/4981.0

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c Multiplication is implied, and has higher precedence

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c Multiplication is implied, and has higher precedence To force the other meaning, write "a(b+c)"

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c Multiplication is implied, and has higher precedence To force the other meaning, write "a(b+c)" In regular expressions, "06|7" means '06' or '7'

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c Multiplication is implied, and has higher precedence To force the other meaning, write "a(b+c)" In regular expressions, "06|7" means '06' or '7'...and there are a lot of 7's in our file

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c Multiplication is implied, and has higher precedence To force the other meaning, write "a(b+c)" In regular expressions, "06|7" means '06' or '7'...and there are a lot of 7's in our file To force the other meaning, parenthesize "0(6|7)"

Regular ExpressionsIntroduction In mathematics, "ab+c" means (a×b) + c Multiplication is implied, and has higher precedence To force the other meaning, write "a(b+c)" In regular expressions, "06|7" means '06' or '7'...and there are a lot of 7's in our file To force the other meaning, parenthesize "0(6|7)" But "06|07" is more readable anyway

Regular ExpressionsIntroduction # still matching days when we want to match months show_matches('05', readings) Baker Baker Baker Baker ** Baker Baker Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/4981.0

Regular ExpressionsIntroduction # could rely on context show_matches('-05-', readings) Baker Baker Baker Baker Baker Baker Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/4981.0

Regular ExpressionsIntroduction # could rely on context show_matches('-05-', readings) Baker Baker Baker Baker Baker Baker Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ month has '-' before and after

Regular ExpressionsIntroduction # could rely on context show_matches('-05-', readings) Baker Baker Baker Baker Baker Baker Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ so no matches

Regular ExpressionsIntroduction Matching is enough: we need to extract data

Regular ExpressionsIntroduction Matching is enough: we need to extract data When a regular expression matches, the library remembers what matched against every parenthesized sub-expression

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 pattern to match years is in parentheses

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 first record from data

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 first record from data (remember '\t' is a tab)

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 re.search returns a match object if a match is found

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 re.search returns a match object if a match is found returns None if there is no match

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 match.group(k) returns the text that matched the k th sub-expression in the regular expression

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 match.group(k) returns the text that matched the k th sub-expression in the regular expression k goes from 1 to N (number of matches), not 0 to N-1

Regular ExpressionsIntroduction # extract the year from the match match = re.search('(2009|2010|2011)', 'Baker 1\t \t1223.0') print match.group(1) 2009 match.group(k) returns the text that matched the k th sub-expression in the regular expression k goes from 1 to N (number of matches), not 0 to N-1 because match.group(0) is all the text that was matched

Regular ExpressionsIntroduction Regular expression to match month would be "(01|02|03|04|05|06|07|08|09|10|11|12)"

Regular ExpressionsIntroduction Regular expression to match month would be "(01|02|03|04|05|06|07|08|09|10|11|12)" Expression to match day would be three times longer

Regular ExpressionsIntroduction Regular expression to match month would be "(01|02|03|04|05|06|07|08|09|10|11|12)" Expression to match day would be three times longer Use '.' (period) to match any single character

Regular ExpressionsIntroduction Regular expression to match month would be "(01|02|03|04|05|06|07|08|09|10|11|12)" Expression to match day would be three times longer Use '.' (period) to match any single character So ' ' matches four characters, a dash, two more characters, another dash, and two more characters

Regular ExpressionsIntroduction Regular expression to match month would be "(01|02|03|04|05|06|07|08|09|10|11|12)" Expression to match day would be three times longer Use '.' (period) to match any single character So ' ' matches four characters, a dash, two more characters, another dash, and two more characters And '(....)-(..)-(..)' remembers those matches individually

Regular ExpressionsIntroduction # match and extract YYYY-MM-DD date formats match = re.search('(....)-(..)-(..)', 'Baker 1\t \t1223.0') print match.group(1), match.group(2), match.group(3)

Regular ExpressionsIntroduction # match and extract YYYY-MM-DD date formats match = re.search('(....)-(..)-(..)', 'Baker 1\t \t1223.0') print match.group(1), match.group(2), match.group(3) Try doing that with substring searches...

Regular ExpressionsIntroduction # let's write another function to show match groups def show_groups(pattern, text): m = re.search(pattern, text) if m is None: print 'NO MATCH' return print 'all:', m.group(0) for i in range(1, 1 + len(m.groups())): print '%2d: %s' % (i, m.groups(i)) show_groups('(....)-(..)-(..)', 'Baker 1\t \t1223.0') 01: : 11 03: 17

Regular ExpressionsIntroduction 1. Letters and digits match themselves.

Regular ExpressionsIntroduction 1. Letters and digits match themselves. 2. '|' means OR.

Regular ExpressionsIntroduction 1. Letters and digits match themselves. 2. '|' means OR. 3. '.' matches any single character.

Regular ExpressionsIntroduction 1. Letters and digits match themselves. 2. '|' means OR. 3. '.' matches any single character. 4. Use '()' to enforce grouping.

Regular ExpressionsIntroduction 1. Letters and digits match themselves. 2. '|' means OR. 3. '.' matches any single character. 4. Use '()' to enforce grouping. 5. re.search returns a match object or None.

Regular ExpressionsIntroduction 1. Letters and digits match themselves. 2. '|' means OR. 3. '.' matches any single character. 4. Use '()' to enforce grouping. 5. re.search returns a match object or None. 6. match.group(k) is the text that matched group k.

June 2010 created by Greg Wilson Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information.