1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.

1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code Function chr returns the character with the given character code >>> ord('ff') Traceback (most recent call last): File " ", line 1, in ? TypeError: ord() expected a character, but string of length 2 found >>> ord('f') 102 >>> ord('.') 46 >>> chr(46) '.'

2 Characters and Strings Since characters and strings are fundamental in python, there are a lot of useful methods for dealing with them (fig. 13.2).

3 13.2 Fundamentals of Characters and Strings

6 fig13_03.py 1 # Fig. 13.3: fig13_03.py 2 # Simple output formatting example. 3 4 string1 = "Now I am here." 5 6 print string1.center( 50 ) 7 print string1.rjust( 50 ) 8 print string1.ljust( 50 ) Now I am here. Centers calling string in a new string of 50 charactersRight-aligns calling string in new string of 50 charactersLeft-aligns calling string in new string of 50 characters Remember: strings are immutable; a string manipulating function returns a new string >>> aString = 'gacataggt' >>> >>> aString.upper() 'GACATAGGT' >>> >>> aString 'gacataggt'

7 fig13_04.py 1 # Fig. 13.4: fig13_04.py 2 # Stripping whitespace from a string. 3 4 string1 = "\t \n This is a test string. \t\t \n" 5 6 print 'Original string: "%s"\n' % string1 7 print 'Using strip: "%s"\n' % string1.strip() 8 print 'Using left strip: "%s"\n' % string1.lstrip() 9 print "Using right strip: \"%s\"\n" % string1.rstrip() Original string: " This is a test string. " Using strip: "This is a test string." Using left strip: "This is a test string. " Using right strip: " This is a test string." Removes leading whitespace from string Removes trailing whitespace from string Removes leading and trailing whitespace from string

8 13.4 Searching Strings Method find, index, rfind and rindex search for substrings in a calling string Methods startswith and endswith return 1 if a calling string begins with or ends with a given string, respectively Method count returns number of occurrences of a substring in a calling string Method replace substitutes its second argument for its first argument in a calling string

9 s = "actgccgacgatcgcgcatcagcg" index_string= "012345678901234567890123" # length 24 print s print index_string, "\n" print "gc occurs %d times" % s.count( "gc" ) print “(%d times from index 13)\n" % s.count( "gc", 13, len(s) ) # same result as s[13:len(s)].count("gc") print "first occurrence of gc: index %d" % s.find( "gc" ) print "first occurrence of x: index %d\n" % s.find( “x" ) # -1 is a number, program breaks down later if string not found? # index(): as find() but raises exception if string is not found if s.startswith( "AC" ): print "sequence starts with AC" else: print "sequence doesn't start with AC" # case sensitive! print "last occurrence of gc: index %d\n" % s.rfind( "gc" ) print "replacing gc with GC:\n%s\n" %s.replace( "gc", "GC" ) print "replace 2 occurrences max:\n%s" %s.replace( "gc", "GC", 2 ) actgccgacgatcgcgcatcagcg 012345678901234567890123 gc occurs 4 times (3 times from index 13) first occurrence of gc: index 3 first occurrence of x: index -1 sequence doesn't start with AC last occurrence of gc: index 21 replacing 'gc' with GC: actGCcgacgatcGCGCatcaGCg replace 2 occurrences max: actGCcgacgatcGCgcatcagcg Searching Strings

10 13.5 Splitting and Joining Strings Tokenization breaks statements into individual components (or tokens) Delimiters, typically whitespace characters, separate tokens

11 fig13_06.py 1 # Fig. 13.6: fig13_06.py 2 # Token splitting and delimiter joining. 3 4 # splitting strings 5 string1 = "A, B, C, D, E, F" 6 7 print "String is:", string1 8 print "Split string by spaces:", string1.split() 9 print "Split string by commas:", string1.split( "," ) 10 print "Split string by commas, max 2:", string1.split( ",", 2 ) 11 print 12 13 # joining strings 14 list1 = [ "A", "B", "C", "D", "E", "F" ] 15 string2 = "___" 16 17 print "List is:", list1 18 print 'Joining with ___ : %s' % ( string2.join ( list1 ) ) 19 20 print 'Joining with -.- :', "-.-".join( list1 ) String is: A, B, C, D, E, F Split string by spaces: ['A,', 'B,', 'C,', 'D,', 'E,', 'F'] Split string by commas: ['A', ' B', ' C', ' D', ' E', ' F'] Split string by commas, max 2: ['A', ' B', ' C, D, E, F'] List is: ['A', 'B', 'C', 'D', 'E', 'F'] Joining with "___": A___B___C___D___E___F Joining with "-.-": A-.-B-.-C-.-D-.-E-.-F Splits calling string by whitespace characters Return list of tokens split by first two comma delimiters Splits calling string by specified character Joins list elements with calling string as a delimiter to create new string Joins list elements with calling quoted string as delimiter to create new string

12 Intermezzo 1 www.daimi.au.dk/~chili/CSS/Intermezzi /2.10.1.html 1. Copy and run this program: /users/chili/CSS.E03/ExamplePrograms/random_text.py What does it do? 2. Extend the program: search the text string it produces and print out the index of the first occurrence of 11 (you might look at Figure 13.2 at page 438ff to find a suitable string method). Tell the user if there is no '11'. 3. Split the text into a list of substrings using '11' as a delimiter, print out the list.

13 Solution from random import randrange text = "" for i in range(150): next_char = chr( randrange(48, 58) ) text = "".join( [text, next_char] ) print text i = text.find( "11" ) if i>=0: print "'11' found at index", i splittext = text.split( "11" ) print "text split in %d pieces" %len(splittext) for piece in splittext: print piece 12601124846403636181205195266540502045412039533711871515093137304632865380923951473259266 4241323032411475077087579523798182173083754226565772851806864 '11' found at index 4 text split in 4 pieces 1260 248464036361812051952665405020454120395337 87151509313730463286538092395147325926642413230324 475077087579523798182173083754226565772851806864

14 Regular Expressions – Motivation import re text1 = "No Danish email address here bush@whitehouse.org *@$@.hls.29! fj3a" text2 = "But here: chili@daimi.au.dk what a *(.@#$ nice @#*.( email address" regularExpression = "\w+@[\w.]+\.dk" compiledRE = re.compile( regularExpression) SRE_Match1 = compiledRE.search( text1) SRE_Match2 = compiledRE.search( text2) if SRE_Match1: print "Text1 contains this Danish email address:", SRE_Match1.group() else: print "Text1 contains no Danish email address" if SRE_Match2: print "Text2 contains this Danish email address:", SRE_Match2.group() else: print "Text2 contains no Danish email address" Problem: search a text for any Danish email address: @.dk Text1 contains no Danish email address Text2 contains this Danish email address: chili@daimi.au.dk

15 13.6 Regular Expressions Provide more efficient and powerful alternative to string search methods Instead of searching for a specific string we can search for a text pattern –Don’t have to search explicitly for ‘Monday’, ‘Tuesday’, ‘Wednesday’.. : there is a pattern in these search strings. –A regular expression is a text pattern In Python, regular expression processing capabilities provided by module re

16 Example Simple regular expression: regExp = “football” - matches only the string “football” To search a text for regExp, we can use re.search( regExp, text )

17 Compiling Regular Expressions re.search( regExp, text ) 1.Compile regExp to a special format (an SRE_Pattern object) 2.Search for this SRE_Pattern in text 3.Result is an SRE_Match object If we need to search for regExp several times, it is more efficient to compile it once and for all: compiledRE = re.compile( regExp) 1.Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2.Use search method in this SRE_Pattern to search text 3.Result is same SRE_Match object

18 Searching for ‘football’ import re text1 = "Here are the football results: Bosnia - Denmark 0-7" text2 = "We will now give a complete list of python keywords." regularExpression = "football" compiledRE = re.compile( regularExpression) SRE_Match1 = compiledRE.search( text1 ) SRE_Match2 = compiledRE.search( text2 ) if SRE_Match1: print "Text1 contains the substring ‘football’" if SRE_Match2: print "Text2 contains the substring ‘football’" Text1 contains the substring 'football' Compile regular expression and get the SRE_Pattern object Use the same SRE_Pattern object to search both texts and get two SRE_Match objects (or none if the search was unsuccesful)

19 Building more sophisticated patterns Metacharacters: regular-expression syntax element ? : matches zero or one occurrences of the expression it follows + : matches one or more occurrences of the expression it follows * : matches zero or more occurrences of the expression it follows # search for zero or one t, followed by two a’s: regExp1 = “t?aa“ # search for g followed by one or more c’s followed by a: regExp1 = “gc+a“ #search for ct followed by zero or more g’s followed by a: regExp1 = “ctg*a“

20 Metacharacter example import re text = "gaaagccactgggggggggggggga" regExp1 = "t?aa" compiledRE1 = re.compile( regExp1 ) regExp2 = "gc+a" compiledRE2 = re.compile( regExp2 ) regExp3 = "ctg*a" compiledRE3 = re.compile( regExp3 ) SRE_Match1 = compiledRE1.search( text ) SRE_Match2 = compiledRE2.search( text ) SRE_Match3 = compiledRE3.search( text ) if SRE_Match1: print "Text contains the regular expression", regExp1 if SRE_Match2: print "Text contains the regular expression", regExp2 if SRE_Match3: print "Text contains the regular expression", regExp3 Text contains the regular expression t?aa Text contains the regular expression gc+a Text contains the regular expression ctg*a Compile all three regular expressions into SRE_Pattern objects Use the three SRE_Pattern objects to search the text and get three SRE_Match objects

21 ^ : indicates placement at the beginning of the string $ : indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“ A few more metacharacters

22 Metacharacter example import re text1 = "aactggagcccca" text2 = "ctgga" regExp1 = "^t?aa" regExp2 = "gc+a$" regExp3 = "^ctg*a$" if re.search( regExp1, text1 ): print "Text1 contains the regular expression", regExp1 if re.search( regExp2, text1 ): print "Text1 contains the regular expression", regExp2 if re.search( regExp3, text1 ): print "Text1 contains the regular expression", regExp3 if re.search( regExp3, text2 ): print "Text2 contains the regular expression", regExp3 Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$ This time we use re.search() to search the text for the regular expressions directly without compiling them in advance

23 {} : indicate repetition | : match either regular expression to the left or to the right () : indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1 to 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“ Yet more metacharacters..

24 \ : used to escape (to ‘keep’) a metacharacter # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“ Escaping metacharacters

25 Intermezzo 2 http://www.daimi.au.dk/~chili/CSS/ Intermezzi/2.10.2.html Copy and run this program: /users/chili/CSS.E03/ExamplePrograms/sequence_searching.py What does it do? Put in more regular expressions in the list to search for these patterns: 1. 6 c's followed by 3 g's 2. cc, followed by at least one g, followed by cc 3. double triplets (e.g. aaa followed by ccc) 4. any number of a's, followed by either cc or gg, followed by c at the end of the string

26 Solution import re # this is a dna sequence in fasta format: seq = """>U03518 Aspergillus awamori\naacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgt ctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccccg ggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgattgaatg caatcagttaaaactttcaacaatggatctcttggttccggc""" regular_expressions = [ "a{4}", "c+(t|g)tt", "g*c$", "(gt){2}", "c{6}g{3}", "ccg+cc", "(aaa|ccc|ggg|ttt){2}", "a*(cc|gg)c$" ] for regExp in regular_expressions: if re.search( regExp, seq ): print "found", regExp 1.6 c's followed by 3 g's 2.cc, followed by at least one g, followed by cc 3.double triplets (e.g. aaa followed by ccc) 4.any number of a's, followed by either cc or gg, followed by c at the end of the string

27 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc,.. Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c A class can use – to indicate a range of characters: [a-e] is the same as [abcde] Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

28 Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps like 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address

29 import re text = "1a2b3c4d5e6f" print re.sub( "\d", "*", text ) # substitute * for any digit (i.e. replace digit with *) print print re.sub( "\d", "*", text, 3 ) # substitute * for any digit, max 3 times print print re.split( "\d", text ) # delimiter: any digit print print re.split( "[a-z]", text ) # delimiter: any lower-case letter print if re.search( "\db", text ): # the RE of search() can appear anywhere in text print "method search found \db" if re.match( "\db", text ): # the RE of match() must appear in beginning of text print "method match found \db“ if re.match( "\da", text ): print "method match found \da" Other regular expression functions *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da

30 Groups We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: chili@daimi.au.dk what a *(.@#$ nice @#*.( email address“ regExp = "\w+@[\w.]+\.dk“ # match Danish email address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group()

31 13.11 Grouping The substring that matches the whole RE called a group RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: chili@daimi.au.dk what a *(.@#$ nice @#*.( email address“ # Match any Danish email address; define two groups: username and domain: regExp = “(\w+)@([\w.]+\.dk)“ compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) Text2 contains this Danish email address: chili@daimi.au.dk Username: chili Domain: daimi.au.dk

32 Greedy vs. non-greedy operators + and * are greedy operators –They attempt to match as many characters as possible even if this is not the desired behavior +? and *? are non-greedy operators –They attempt to match as few characters as possible

33 Greedy vs. non-greedy operators # Task: Find a space-separated list of digits, extract the first number. import re text = "1 2 3 4 5 blah blah" # use greedy operator + regExp = "(\d )+" print "Greedy operator:", re.match( regExp, text ).group() # use non-greedy version instead (by putting a ? after the +) regExp = "(\d )+?" print "Non-greedy operator:", re.match( regExp, text ).group() Greedy operator: 1 2 3 4 5 Non-greedy operator: 1

1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.

Similar presentations

Presentation on theme: "1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code.

Similar presentations

Presentation on theme: "1 13.2 Fundamentals of Characters and Strings Characters: fundamental building blocks of Python programs Function ord returns a character’s character code."— Presentation transcript:

Similar presentations

About project

Feedback