Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt.

Similar presentations


Presentation on theme: "1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt."— Presentation transcript:

1 1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt

2 2 Foresight Pattern matching –Literal –With metacharacters Regular expressions (REs) Using REs in Python

3 3 Consider: dir by Itself D:\athomepc\day\idt>dir Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt. 01-01-02 8:16a... 01-01-02 8:16a.. SPRING~1 PDF 180,072 01-01-02 8:17a spring02idtfront.pdf SPRING~2 PDF 241,542 01-01-02 8:19a spring02idtpartI.pdf SPRING~3 PDF 1,246,514 01-01-02 8:20a spring02idtpartII.pdf SPRING~4 PDF 2,517,343 01-01-02 8:22a spring02idtpartIII.pdf SPRING~5 PDF 3,469,138 01-01-02 8:24a spring02idtpartIV.pdf CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc LECTUR~1 PPT 78,336 01-01-02 9:45a lecture01fall01.ppt PYTHON~1 PPT 34,816 01-01-02 9:46a Python_Intro.ppt PYTHON~2 PPT 37,376 01-01-02 9:46a Python_Structures.ppt LECTUR~2 PPT 154,112 01-01-02 11:51a lecture01spring02.ppt PYTHON~3 PPT 34,816 01-01-02 11:52a PythonREs.ppt 11 file(s) 8,029,393 bytes 2 dir(s) 1,209.06 MB free D:\athomepc\day\idt>

4 4 Now: dir with a Literal Search D:\athomepc\day\idt>dir case1-python.doc Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free D:\athomepc\day\idt>

5 5 Now: dir with “ * ” D:\athomepc\day\idt>dir *.doc Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free D:\athomepc\day\idt>

6 6 Literal vs. Pattern Searches dir myfile.doc –Searches literally, for an exact match with “myfile.doc” dir my*.doc –Does a pattern search. Matches to any file beginning with “ my ”, followed by 0 or more characters of any kind, followed by “.doc ”

7 7 MetaCharacters dir treats “ * ” as a metacharacter, a character not taken literally, but as instruction to match a certain kind of pattern (here: anything) The dir metacharacter scheme is very useful

8 8 On Beyond *...and also very primitive and limited A step up: grep in Unix & Linux; support for RE searches in some text editors, e.g., TextPad (www.textpad.com) Regular expressions (REs) use a richer language and larger set of metacharacters, giving us a very powerful capability to extract information (patterns) from text

9 9 Python’s RE Metacharacters Here’s the complete list:. ^ $ * + ? { } [ ] \ | ( ) No use memorizing. We’ll learn by examples. A natural question: But what if I want to search for a pattern that contains what Python’s RE counts as metacharacters? –Be just a little patient

10 10 Load Python’s re Module >>> import re >>> teststring = "Television is public anomie number 1.” >>> teststring 'Television is public anomie number 1.’ >>> len(teststring) 37 >>> match = re.search('anomie',teststring) >>> match == None 0 >>> match.span() (21, 27) >>> teststring[21:27] 'anomie’ >>>

11 11 Now a Nonliteral Match >>> match = re.search('Television',teststring) >>> match == None 0 >>> match = re.search('television',teststring) >>> match == None 1 >>> match = re.search('[tT]elevision',teststring) >>> match.span() (0, 10) >>> teststring 'Television is public anomie number 1.’ >>>

12 12 Square Bracket Notation: [...] “ [tT] ” means “any one of the characters ‘ t ’ or ‘ T ’.” [...] is called a character class Examples: –[abc], [a-z], [A-Z] –[^t^T] not t and not T

13 13 Not Example ^ >>> teststring 'Television is public anomie number 1.’ >>> match = re.search('[^t^T][a-z]+',teststring) >>> match.span() (1, 10) >>> teststring[1:10] 'elevision’ >>> Note: + means “one or more of the previous” * means “zero or more” ? means “zero or one”

14 14 '\s\w+\.' and '\s(\w+)\.' >>> teststring 'Television is public anomie number 1.’ >>> match = re.search('\s\w+\.',teststring) >>> match.span() (34, 37) >>> teststring[34:37] ' 1.’ >>> match = re.search('\s(\w+)\.',teststring) >>> match.span(0) (34, 37) >>> match.span(1) (35, 36) >>> teststring[35:36] '1’ >>>

15 15 [.] == \. Inside [...] most metacharacters are taken literally –So, [.] == \. Note (again): [...] is called a character class >>> match = re.search('\s(\w+)[.]',teststring) >>> match.span() (34, 37) >>>

16 16 Avoiding Greed ? >>> newstring = ' ’ >>> newstring = newstring+' ’ >>> newstring = newstring+'(As of 10:55 AM on 12/20/01)’ >>> newstring = newstring+' ’ >>> newstring ' (As of 10:55 AM on 12/20/01) ’ >>> match = re.search(' ',newstring) >>> match.span() (0, 81) >>> match = re.search(' ',newstring) >>> match.group() ’ >>>

17 17 More on Not Being Greedy >>> match = re.search(r' (.+)</(\1)',newstring) >>> match.groups() ('d', ' (As of 10:55 AM on 12/20/01) ', 'd') >>> match = re.search(r' ([^<]+)</(\1)',newstring) >>> match.groups() ('i', '(As of 10:55 AM on 12/20/01)', 'i') >>> \1 is called a backreference. It refers to group 1

18 18 Concluding REs are a very powerful tool, very often very useful The language notation is compact and a bit hard to read Practice, study the examples, don’t worry about memorization.

19 19 Advice on Scripting Scripting, and programming in general, is a process Successful scripts don’t spring into existence whole –Scripts built in small increments Attend to: –Decomposition –Stories –Testing

20 20 Advice on Scripting Decomposition –Solve big problems by decomposing them into small problems and solving them Stories –Scripting/programming as a form of literature –Use comments with code to tell a clear story about what the code is or should be doing Testing –Everything, whole and part, often, varying inputs

21 21 Readings IDT book, chapter 8, “Text and Pattern Processing” Further information (but beyond the scope of 101) –The Python online documentation on the re module –“Regular Expression HOWTO” by A.M. Kuchling at http://py-howto.sourceforge.net/ and also at http://py- howto.sourceforge.net/regex/regex.htmlhttp://py-howto.sourceforge.net/http://py- howto.sourceforge.net/regex/regex.html


Download ppt "1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt."

Similar presentations


Ads by Google