1 Programming in Python Michael Schroeder Andreas Henschel {ms,

1 Programming in Python Michael Schroeder Andreas Henschel {ms, ah}@biotec.tu-dresden.de

2 Motivation mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpep mktvrqerlksivrilerskepvsgaqlaeelsvsrqvivqdiaylrslgynivatprgyvlagg kaltarqqevfdlirdhisqtgmpptraeiaqrlgfrspnaaeehlkalarkgvieivsgasrgirllqee mrssakqeelvkafkallkeekfssqgeivaalqeqgfdninqskvsrmltkfgavrtrnakmemvyclpaelgvptt gqrhikireiimsndietqdelvdrlreagfnvtqatvsrdikemqlvkvpmangrykyslpsdqrfnplqklkr kgqrhikireiitsneietqdelvdmlkqdgykvtqatvsrdikelhlvkvptnngsykyslpadqrfnplsklkr dvtgriaqtllnlakqpdamthpdgmqikitrqeigqivgcsretvgrilkmledqnlisahgktivvygt dikqriagffidhanttgrqtqggvivsvdftveeianligssrqttstalnslikegyisrqgrghytipnlvrlkaaa iderdkiileilekdartpfteiakklgisetavrkrvkaleekgiiegytikinpkklg elqaiapevaqslaeffavladpnrlrllsllarselcvgdlaqaigvsesavshqlrslrnlrlvsyrkqgrhvyyqlqdhhivalyqnaldhlqec mntlkkafeildfivknpgdvsvseiaekfnmsvsnaykymvvleekgfvlrkkdkryvpgyklieygsfvlrrf lfneiiplgrlihmvnqkkdrllneylsplditaaqfkvlcsircaacitpvelkkvlsvdlgaltrmldrlvckgwverlpnpndkrgvlvklttggaaiceqchqlvgqdlhqeltknltadevatleyllkkvlp nypvnpdlmpalmavfqhvrtriqseldcqrldltppdvhvlklideqrglnlqdlgrqmcrdkalitrkirelegrnlvrrernpsdqrsfqlfltdeglaihqhaeaimsrvhdelfapltpveqatlvhlldqclaaq tdilreigmiaraldsisniefkelsltrgqylylvrvcenpgiiqekiaelikvdrttaaraikrleeqgfiyrqedasnkkikriyatekgknvypiivrenqhsnqvalqglseveisqladylvrmrknvsedwefvkkg mskindindlvnatfqvkkffrdtkkkfnlnyeeiyilnhilrsesneisskeiakcsefkpyyltkalqklkdlkllskkrslqdertvivyvtdtqkaniqkliseleeyikn aitkindcfellsmvtyadklkslikkefsisfeefavltyisenkekeyylkdiinhlnykqpqvvkavkilsqedyfdkkrnehdertvlilvnaqqrkkiesllsrvnkrit miimeeakkliielfselakihglnksvgavyailylsdkpltisdimeelkiskgnvsmslkkleelgfvrkvwikgerknyyeavdgfssikdiakrkhdliaktyedlkkleekcneeekefikqkikgiermkkisekilealndld aqspagfaeeyiiesiwnnrfppgtilpaerelseligvtrttlrevlqrlardgwltiqhgkptkvnnfwets eekrsstgflvkqraflklymitmteqerlyglkllevlrsefkeigfkpnhtevyrslhellddgilkqikvkkegaklqevvlyqfkdyeaaklykkqlkveldrckkliekalsdnf hmqaeilltlklqqklfadprrisllkhialsgsisqgakdagisyksawdainemnqlsehilveratggkggggavltrygqrliqlydllaqiqqkafdvlsdddalplnsllaaisrfslqts skvtyiikasndvlnektatilitiakkdfitaaevrevhpdlgnavvnsnigvlikkglveksgdgliitgeaqdiisnaatlyaqenapellk sprivqsndlteaayslsrdqkrmlylfvdqirksdgtlqehdgiceihvakyaeifgltsaeaskdirqalksfagkevvfyrpeedagdekgyesfpwfikpahspsrglysvhinpylipffiglq nrftqfrlsetkeitnpyamrlyeslcqyrkpdgsgivslkidwiieryqlpqsyqrmpdfrrrflqvcvneinsrtpmrlsyiekkkgrqtthivfsfrdit lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelarrhl lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelayrhlgypppv egldefdrkilktiieiyrggpvglnalaaslgveadtlsevyepyllqagflartprgrivtekaykhlkyevp iseevliglplheklfllaivrslkishtpyitfgdaeesykivceeygerprvhsqlwsylndlrekgivetrqnkrgegvrgrttlisigtepldtleavitklikeelr kyeltlqrslpfiegmltnlgamklhkihsflkitvpkdwgynritlqqlegylntladegrlkyiangsyeiv pmkteqkqeqetthknieedrklliqaaivrimkmrkvlkhqqllgevltqlssrfkprvpvikkcidiliekeylervdgekdtysyla gspekilaqiiqehregldwqeaatraslsleetrkllqsmaaagqvtllrvendlyaist eryqawwqavtraleefhsryplrpglareelrsryfsrlparvyqalleewsregrlqlaantvalagftps fsetqkkllkdledkyrvsrwqppsfkevagsfnldpseleellhylvregvlvkindefywhr qalgeareviknlastgpfglaeardalgssrkyvlplleyldqvkftrrvgdkrvvvgn vpkrvywemlatnltdkeyvrtrralileilikagslkieqiqdnlkklgfdevietiendikglintgifieikgrfyqlkdhilqfvipnrgvtkqlv irtfgwvqnpgkfenlkrvvqvfdrnskvhnevknikiptlvkeskiqkelvaimnqhdliytykelvgtgtsirseapcdaiiqatiadqgnkkgyidnwssdgflrwahalgfieyinksdsfvitdvglaysksad gsaiekeilieaissyppairiltlledgqhltkfdlgknlgfsgesgftslpegilldtlanampkdkgeirnnwegssdkyarmiggwldklglvkqgkkefiiptlgkpdnkefishafkitgeglkvlrrakgstkftr All these sequences are winged helix DNA binding domains. How can we group them into families?

3 Motivation: Let's rebuild SCOP families Given a SCOP superfamily and its sequences, how can we divide it into families? First, we need dynamic programming to determine the sequence similarity Then we do the following: –For all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix –Next, run hierarchical clustering to cluster the sequences.

4 Python for Bioinformatics Lecture 1: Datatypes and Loops Slides derived from Ian Holmes Department of Statistics University of Oxford

5 Goals of this course Concepts of computer programming Rudimentary Python (widely-used language) Introduction to Bioinformatics file formats Practical data-handling algorithms Exposure to Bioinformatics software

6 Literature/Material Textbook: Python in a Nutshell, Alex Martelli Textbook: Python Cookbook, Alex Martelli, David Ascher (both published by O'Reilly) Python Course in Bioinformatics, K. Schuerer/C. Letondal, Pasteur University (pdf) a lot of online material (see course homepage http://www.biotec.tu-dresden.de/schroeder/group/ teaching/bioinfo2/python.html )

7 Style of this lecture The color scheme for programs, output and text files: Interaction with the Python shell: very handy for quick tests. Helps beginners to overcome physiological barrier: Go ahead, try things out! The main programThe program output Files are shown in yellow The filename goes here >>> (Python Expression) (immediate Python result) Prompt, (python expects input here) Press Enter

8 General principles of programming Make incremental changes Test everything you do –use the Python shell for testing expressions/functions interactively –the edit-run-revise cycle Write so that others can read it –(when possible, write with others) Think before you write Use a good text editor (emacs)

9 Python/Emacs IDE

10 Python: Motivation Well suited for scripting (better syntax than Perl) However, capable of Object Orientation Hence complex data types and large projects feasible, reuse of code (BioPython) Universal language, Applications in and beyond bioinformatics: Amber, ProHit, PyRat, PyMOL, Gene2EST/Google, CGI, Zope Compatible with most software technologies: GUI, MPI, OpenGL, Corba, RDB Test complicated expressions in python shell

11 Python basics Basic syntax of a Python program: # Elementary Python program print "Hello World" print statement tells Python to print the following stuff to the screen Single or double quotes enclose a "string literal" Lines beginning with " # " are comments, and are ignored by Python Hello World

12 Variables We can tell Python to "remember" a particular value, using the assignment operator " = ": The x is referred to as a "scalar variable". Variable names can contain alphabetic characters, numbers (but not at the start of the name), and underscore symbols "_" x = 3 print x 3 x = "ACGCGT" print x ACGCGT Binding site for yeast transcription factor MCB

13 Variables and Objects Everything in Python is an object An object models a real-world entity objects possess methods (also called functions) that are typically applied to the object, possibly parameterized objects can also possess variables, that describe their state e.g. x.upper() is a parameter-less method, that works on the string object x Object. Method or variable

14 Arithmetic operations… Basic operators are + - / * % x = 14 y = 3 print "Sum: ", x + y print "Product: ", x * y print "Remainder: ", x % y Sum: 17 Product: 42 Remainder: 2 x = 5 print "x started as ", x x = x * 2 print "Then x was", x x = x + 1 print "Finally x was",x x started as 5 Then x was 10 Finally x was 11 Could write x *= 2 Could write x += 1

15 … Or interactively >>> x = 14 >>> y = 3 >>> x + y 17 >>> x * y 42 >>> x % y 2 >>> x = 5 >>> print "x started as", x x started as 5 >>> x *= 2 >>> print "Then x was", x Then x was 10 >>> x += 1 >>> print "Finally x was", x Finally x was 11 >>> This way, you can use Python as a calculator Can also use += -= /= *=

16 String operations Concatenation + += Can find the length of a string using the function len(x) a = "pan" b = "cake" a = a + b print a pancake a = "soap" b = "dish" a += b print a soapdish mcb = "ACGCGT" print "Length of %s is "%mcb, len(mcb) Length of ACGCGT is 6

17 String formatting Strings can be formatted with place holders for inserted strings ( %s ) and numbers ( %d for digits and %f for floats) Use Operator % on strings: >>> "aaaa%saaaa%saaa"%("gcgcgc","tttt") 'aaaagcgcgcaaaattttaaa' >>> "A range written like this: (%d - %d)" % (2,5) 'A range written like this: (2 - 5)' >>> "Or with preceeding 0's: (%03d - %04d)" % (2,5) "Or with preceeding 0's: (002 - 0005)" >>> "Rounding floats %.3f" % math.pi 'Rounding floats 3.142' >>> "Space holders: _%-7s_ and _%7s_" %("left", "right") 'Space holders: _left _ and _ right_' Formatted String % Insertion Tuple

18 More string operations x = "A simple sentence" print x print x.upper() print x.lower() xl=list(x) xl.reverse() print "".join(xl) x = x.replace("i", "a") print x print len(x) A simple sentence A SIMPLE SENTENCE a simple sentence ecnetnes elpmis A A sample sentence 17 Convert to upper case Convert to lower case Convert the string to a list Translate "i"'s into "a"'s Calculate the length of the string Reverse the list Join all list members

19 Concatenating DNA fragments dna1 = "accacgt" dna2 = "taggtct" print dna1 + dna2 "Transcribing" DNA to RNA accacguuaggucu dna = "accACgttAGGTct" rna = dna.lower().replace("t", "u") print rna Make it all lower case DNA string is a mixture of upper & lower case Replace "t" with "u" accacgttaggtct

20 Conditional blocks The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Python, this looks like this: x = 149 y = 100 if x > y: print x,"is greater than",y else: print x,"is less than", y 149 is greater than 100 These indentations tell Python which piece of code is contingent on the condition. if condition: action else: alternative Consistent, level-wise indenting important

21 Conditional operators Numeric: > >= < <= != == The same operators work on strings as alphabetic comparisons x = 5 * 4 y = 17 + 3 if x == y: print x, "equals", y 20 equals 20 Note that the test for "x equals y" is x==y, not x=y (x, y) = ("Apple", "Banana") if y > x: print y, "after", x Banana after Apple "does not equal" Shorthand syntax for assigning more than one variable at a time

22 Logical operators Logical operators : and and or The keyword not is used to negate what follows. Thus not x = y The keyword False (or the value zero) is used to represent falsehood, while True (or any non-zero value, e.g. 1) represents truth. Thus: if True: print "True is true" if False: print "False is true" if -99: print "-99 is true" True is true -99 is true x = 222 if x % 2 == 0 and x % 3 == 0: print x, "is an even multiple of 3" 222 is an even multiple of 3

23 x = 0 while x < 10: print x, x+=1 0 1 2 3 4 5 6 7 8 9 The indented code is repeatedly executed as long as the condition x<10 remains true Loops Here's how to print out the numbers 0 to 9: This is a while loop. The code is executed while the condition is true. Equivalent to x = x + 1

24 A common kind of loop Let's dissect the code of the while loop again: Alternatively, the for loop construct iterates through a list x = 0 while x < 10: print x, x+=1 Initialisation Test for completion Continuation for x in range(10): print x, Iteration variable Generates a list [0,1, …,9]

25 For loop features Loops can be used with all iteratable types, ie.: lists, strings, tuples, iterators, sets, file handlers Stepsizes can be specified with the 3. argument of the slice constructor (negative values for iterating backwards) >>> for nucleotide in "actgc":... print nucleotide, a c t g c >>> for number in range(50)[::7]:... print number, 0 7 14 21 28 35 42 49 >>> for nucleotide in "actgc"[::-1]:... print nucleotide, c g t c a

26 Reading Data from Files To read from a file, we can conveniently iterate through it linewise with a for-loop and the open function. Internally a filehandle is maintained during the loop. This code snippet opens a file called "sequence.txt" in the in the current directory, and iterates through it line by line for line in open("sequence.txt"): print line, >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC sequence.txt The comma prevents print 's automatic newline

27 Python for Bioinformatics Lecture 2: Sequences and Lists

28 Summary: scalars and loops Assignment operator Arithmetic operations String operations Conditional tests Logical operators Loops Reading a file x = 5 y = x * 3 if y > 10: print s s = "Concatenating " + "strings" if y > 10 and not s == "": print s for x in range(10): print x for line in open("sequence.txt"): print line,

29 Pattern-matching A very sophisticated kind of logical test is to ask whether a string contains a pattern e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT? name = "YBR007C" dna="TAATAAAAAACGCGTTGTCG" if "ACGCGT" in dna: print name, "has MCB!" 20 bases upstream of the yeast gene YBR007C The membership operator in The pattern for the MCB binding site YBR007C has MCB!

30 FASTA format A format for storing multiple named sequences in a single file This file contains 3' UTRs for Drosophila genes CG11604, CG11455 and CG11488 >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT Name of sequence is preceded by > symbol NB sequences can span multiple lines Call this file fly3utr.txt

31 Printing all sequence names in a FASTA database for line in open("fly3utr.txt"): if line.startswith(">"): print line, >CG11604 >CG11455 >CG11488

32 Finding all sequence lengths length=0 name="" for line in open("/home/bioinf/ah/tmp/sequence.txt"): line=line.rstrip() if line.startswith(">"): if name and length: print name, length name=line[1:] length=0 else: length+=len(line) print name, length CG11604 58 CG11455 83 CG11488 69 The rstrip statement trims the white space characters off the right end. Try it without this and see what happens – and if you can work out why >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT

33 Reverse complementing DNA def revcomp(dna): replaced=list(dna.lower(). replace("a","x").replace("t","a"). replace("x", "t").replace("g","x"). replace("c","g").replace("x", "c")) replaced.reverse() return "".join(replaced) print revcomp("accACgttAGgtct ") agacctaacgtggt Start by making string lower case again. This is generally good practice Reverse the list Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' A common operation due to double-helix symmetry of DNA

34 Lists A list is a list of variables We can think of this as a list with 4 entries nucleotides = ['a', 'c', 'g', 't'] print "Nucleotides: ", nucleotides Nucleotides: ['a', 'c', 'g', 't'] acgt element 0 element 1 element 2 element 3 the list is the set of all four elements Note that the element indices start at zero.

35 List literals There are several, equally valid ways to assign an entire array at once. a = [1,2,3,4,5] print "a = ",a b = ['a','c','g','t'] print "b = ",b c = range(1,6) print "c = ",c d = "a c g t".split() print "d = ", d a = [1,2,3,4,5] b = ['a','c','g','t'] c = [1,2,3,4,5] d = ['a','c','g','t'] This is the most common: a comma- separated list, delimited by squared brackets

36 Accessing lists To access list elements, use square brackets e.g. x[0] means "element zero of list x " Remember, element indices start at zero! Negative indices refer to elements counting from the end e.g. x[-1] means "last element of list x " x = ['a', 'c', 'g', 't'] i=2 print x[0], x[i], x[-1] a g t

37 List operations You can sort and reverse lists... You can read the entire contents of a file into an array (each line of the file becomes an element of the array) x = ['a', 't', 'g', 'c'] print "x =",x x.sort() print "x =",x x.reverse() print "x =",x x = a t g c x = a c g t x = t g c a seqfile = open, "C:/sequence.txt" x =

38 Applying Methods to Objects Instances of lists, strings, etc. are objects with built-in methods Explore available methods using dir : >>> dir("hello") ['__add__', …,'__str__', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] >>> help("hello".count) (…)Return the number of occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation. >>> "hello".count("l") 2 String objectMethod. (dot) applies method to object List of applicable methods

39 List operations >>> x=[1,0]*5 >>> x [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] >>> while 0 in x: print x.pop(), 0 1 0 1 0 1 0 1 0 >>> x [1] >>> x.append(2) >>> x [1, 2] >>> x+=x >>> x [1, 2, 1, 2] >>> x.remove(2) >>> x [1, 1, 2] >>> x.index(2) 2 pop removes the last element of a list append adds an element to the end of a list Multiplying lists with * concatenating lists with + or += Removing the first occurrence of an element Position of an element

40 for loop revisited Finding the total of a list of numbers: Equivalent to: val = [4, 19, 1, 100, 125, 10] total = 0 for x in val: total += x print total 259 val = [4, 19, 1, 100, 125, 10] total = 0 for i in range(len(val)): total += val[i] print total 259 for statement loops through each entry in a list

41 Modules Additional functionality, that is not part of the core language, is in modules like –sys (system) –re (regular expressions) –math (mathematics) Load modules with import You can write your own modules and import them >>> import math >>> help(math) Help on built-in module math: …

42 The sys.argv list A special list is sys.argv This contains the command-line arguments if the program is invoked at the command line It's a way for the user to pass information into the program, if you don't have a graphical interface with which to do this import sys print sys.argv ah@studipool1> python args.py abc 123 ['args.py', 'abc', '123'] File args.py Output at command line

43 Converting a sequence into a list The underlying programming language C treats all strings as lists >>> dna="acgtcgaga" >>> list(dna) ['a', 'c', 'g', 't', 'c', 'g', 'a', 'g', 'a'] >>> """You can also make use of long strings and the split function""".split() ['You', 'can', 'also', 'make', 'use', 'of', 'long', 'strings', 'and', 'the', 'split', 'function'] Data types can be converted. Here the list function converts a string into a list. Triple quotes allow for strings that stretch over several lines

44 Taking a slice of a list The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x nucleotides = ['a', 'g', 'c', 't'] purines = nucleotides[0:2] # nucleotides[:2] also works pyrimidines = nucleotides[2:4]# nucleotides[2:] also works print "Nucleotides:", nucleotides print "Purines:", purines print "Pyrimidines:", pyrimidines Nucleotides: ['a', 'g', 'c', 't'] Purines: ['a', 'g'] Pyrimidines: ['c', 't']

45 Applying a function to a list The map command applies a function to every element in an array Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST EXPR can be arbitrary function, defined elsewhere or lambda calculus expression Lambda calculus: provides "anonymous" function, constructed with keyword lambda, a set of parameters, and an expression with these Example: multiply every number by 3 >>> map(lambda x: x*3, [1,2,3]) [3, 6, 9]

46 Python for Bioinformatics Lecture 3: Patterns and Functions

47 Review: pattern-matching The following code: prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the string variable "sequence" We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax: We can replace all occurrences by omitting the optional count argument if "ACGCGT" in dna: print "Found MCB binding site!" dna.replace("ACGCGT","_MCB_") dna.replace("ACGCGT","_MCB_", 1) countpatternreplacement

48 Regular expressions Python provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful Often called "regexps" for short import module re

49 Motivation: N-glycosylation motif Common post-translational modification Attachment of a sugar group Occurs at asparagine residues with the consensus sequence NX 1 X 2, where –X 1 can be anything (but proline inhibits) –X 2 is serine or threonine Can we detect potential N-glycosylation sites in a protein sequence?

50 Building regexps In general square brackets denote a set of alternative possibilities Use - to match a range of characters: [A- Z]. matches anything \s matches spaces or tabs \S is anything that's not a space or tab [^X] matches anything but X

51 Using Regular Expressions Compile a regular expression object (pattern) using re.compile pattern has a number of methods (try dir(pattern)), eg. –match (in case of success returns a Match object, otherwise None) –search (scans through a string looking for a match) –findall (returns a list of all matches) >>> import re >>> pattern = re.compile('[ACGT]') >>> if pattern.match("A"): print "Matched" Matched >>> if pattern.match("a"): print "Matched" >>> successful match unsuccessful, returns None by default case sensitive

52 Matching alternative strings /(this|that)/ matches "this" or "that"...and is equivalent to /th(is|at)/ >>> pattern=re.compile("(this|that|other)", re.IGNORECASE) >>> pattern.search("Will match THIS") ## success >>> pattern.search("Will also match THat") ## success >>> pattern.search("Will not match ot-her") ## will return None >>> case unsensitive search pattern Python returns a description of the match object

53 Matching multiple characters x* matches zero or more x's x+ matches one or more x's x{n} matches n x's x{m,n} matches from m to n x's Word and string boundaries ^ matches the start of a string $ matches the end of a string \b matches word boundaries

54 "Escaping" special characters \ is used to "escape" characters that otherwise have meaning in a regexp so \[ matches the character " [ " –if not escaped, " [ " signifies the start of a list of alternative characters, as in [ACGT] All special characters:. ^ $ * + ? { [ ] \ | ( )

55 Substitutions/Match Retrieval regexp methods can be used without compiling (less efficient but easier to use) Example re.sub (substitution): Example re.findall : >>> re.sub("(red|blue|green)", "color", "blue socks and red shoes") 'color socks and color shoes' >>> e,raw,frm,to = re.findall("\d+", \ "E-value: 4, \ Raw Bit Score: 165, \ Match position: 362-419") >>> print e, raw, frm, to 4 165 362 419 \ allows multiple line commands alternatively, construct multi-line strings using triple quotes """ …""" The result, a list of 4 strings, is assigned to 4 variables matches one or more digits

56 N-glycosylation site detector >>> protein=""" MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLI IIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQ HYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMF KHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYM DEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANAL KPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI""".upper().replace("\n","") >>> for match in re.finditer("N[^P][ST]", protein): print match.group(), match.span() NGS (26, 29) NLT (214, 217) NGT (250, 253) multi-line string, upper case, line breaks removed N[^P][ST] - the main regular expression re.finditer provides an iterator over match-objects match.group and match.span print the actual matched string and the position-tuple. Altenatively, you can print gene[match.start():match.end()]

57 PROSITE and Pfam PROSITE – a database of regular expressions for protein families, domains and motifs Pfam – a database of Hidden Markov Models (HMMs) – equivalent to probabilistic regular expressions

58 Another Example: Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G The 4 C's are 4Fe-4S ligands What is the corresponding Python

59 Another Example: Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G The 4 C's are 4Fe-4S ligands What is the corresponding Python code? C.{2}C.{2}C.{3}C[PEG]

60 Another Example: Courtesy of Chris Bystroff

61 Courtesy of Chris Bystroff

62 Another Example: Courtesy of Chris Bystroff

63 Another Example Regular expressions are useful to parse text Example: extract information from Blast output, such as –species name –E value –Score –ID

64 Another Example BLASTP 2.2.6 [Apr-09-2003] RID: 1062117117-16602-2157828.BLASTQ3 Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6 (Oculorhombin) (Aniridia, type II protein). (422 letters) Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF 1,509,571 sequences 486,132,453 total letters Results of PSI-Blast iteration 1 Sequences with E-value BETTER than threshold Score E Sequences producing significant alignments: (bits) Value gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a Paired box h... 781 0.0 gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo... 780 0.0 gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg... 778 0.0 gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus] 776 0.0 gi|7305369|ref|NP_038655.1| paired box gene 6 small eye Dickie's sm... 776 0.0 gi|383296|prf||1902328A PAX6 gene 775 0.0 gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b Paired box h... 775 0.0 gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus] 773 0.0 gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus] 770 0.0 gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis] 768 0.0 …

65 Functions Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program. Code for such a task is called a function Examples of such tasks: –finding the length of a sequence –reverse complementing a sequence –finding the mean of a list of numbers NB: Python provides the function len(x) to do this already

66 Maximum element of a list Function to find the largest entry in a list def find_max(data): max = data.pop() for x in data: if x > max: max = x return max data = [1, 5, 1, 12, 3, 4, 6] print "Data:", data print "Maximum:", find_max(data) Data: [1, 5, 1, 12, 3, 4, 6] Maximum: 12 Function declaration Function result Function body Function call

67 Reverse complement from string import maketrans def revcomp(seq): translation = maketrans("agct", "tcga") comp = seq.translate(translation) rcomp = comp[::-1] # reversing comp return rcomp dna = "cggcgt" rev = revcomp(dna) print "Revcomp of %s is %s"%(dna, rev) Revcomp of cggcgt is acgccg The arguments follow the function name in parantheses (in this case seq, the sequence to be revcomp'd) By default, translation and comp are local variables, ie., they "live" only inside the surrounding function return announces that the return value of this function is whatever's in rcomp string formatted with place holders

68 revcomp goes OO from string import * class DNA: def __init__(self, sequence): self.seq=sequence.lower() def revcomp(self): translation = maketrans("agct", "tcga") comp = self.seq.translate(translation) self.revcomp = comp[::-1] def report(self): print "Revcomp of %s is %s"%\ (self.seq,self.revcomp) dna = DNA("accggcatg")# Creating a DNA object dna.revcomp() dna.report() Class Constructor saves input sequence as object variable in lower case self refers to the current object, gives access to all its variables method calls Useful to structure code : add additional DNA sequence functionality to this class, eg. a function that calculates GC-contents, translation to protein etc.

69 Mean & standard deviation from math import sqrt def mean_sd(data): n = len(data) sum = 0 sqSum = 0 for x in data: sum += x sqSum += x * x mean = sum / n variance = sqSum / n - mean * mean sd = sqrt (variance) return (mean, sd) data = [1, 5, 1, 12, 3, 4, 6] (mean, sd) = mean_sd (data) print "Data:", data print "Mean:", mean print "Standard deviation:", sd Function returns a two-element tuple: (mean,sd) Function takes a list of n numeric arguments Importing square root function from module math

70 Including variables in patterns Function to find number of instances of a given binding site in a sequence def count_matches(pattern, text): pos=text.rfind(pattern) if pos==-1: return 0 else: return count_matches(pattern, text[:pos])+1 print count_matches("ACGCGT", "ACGCGTAAGTCGGCACGCGTACGCGT") 3 finds rightmost position, where pattern matches in text text="ACGCGTAAGTCGGCACGCGTACGCGT" print text.count("ACGCGT") call recursively with text to the left of rightmost match, count up one no match NB: Built-in string method count also does the job

71 Python for Bioinformatics Lecture 4: Dictionaries

72 Data structures Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line: Cyp12a5 Mitochondrion MRG15 Nucleus Cop Golgi bor Cytoplasm Bx42 Nucleus Suppose this file is in "c:/genecomp.txt"

73 Reading a table of data We can split each line into a 2-element list using the split command. This breaks the line at each space: The opposite of split is join, which makes a string from a list of strings genes, comps= [], [] for line in open("C:/genecomp.txt"): gene, comp = line.split() genes.append(gene) comps.append(comp) print "Genes:", " - ".join(genes) print "Compartments:", " ".join(comps) Genes: Cyp12a5 - MRG15 – Cop - bor - Bx42 Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus

74 Finding an entry in a table The following code assumes that we've already read in the table from the file: Example: sys.argv[1] = "Cop" import sys geneToFind = sys.argv[1] for i in range(len(genes)): if genes[i]==geneToFind: print "Gene:", genes[i] print "Compartment:", comps[i] sys.exit() print "Couldn't find gene" Searching for gene Cop Gene: Cop Compartment: Golgi

75 Binary search The previous algorithm is inefficient. If there are N entries in the list, then on average we have to search through ½N entries to find the one we want. For the full Drosophila genome, N=12,000. This is painfully slow. An alternative is the Binary Search algorithm: Start with a sorted list. Compare the middle element with the one we want. Pick the half of the list that contains our element. Iterate this procedure to "home in" on the right element. This takes log 2 (N) steps.

76 Dictionaries (hashes) Implementing algorithms like binary search is a common task in languages like C. Conveniently, Python provides a type of array called a dictionary (also called a hash) that does something similar for you. A dictionary is a set of key:value pairs (like our gene:compartment table) comp["Cop"] = "Golgi" Squared brackets [] are used to index a dictionary

77 keys and values keys returns the list of keys in the hash –e.g. names, in the name2seq hash values returns the list of values –e.g. sequences, in the name2seq hash name2seq = read_FASTA ("C:/fly3utr.txt") print "Sequence names: ", " ".join(name2seq.keys()) print "Total length: ", len("".join(name2seq.values())) Sequence names: CG11488 CG11604 CG11455 Total length: 210

78 Getting familiar with hashes >>> tlf={"Michael" : 40062, \ "Bingding" : 40064, "Andreas": 40063 } >>> tlf.keys() ['Bingding', 'Andreas', 'Michael'] >>> tlf.values() [40064, 40063, 40062] >>> tlf["Michael"] 40062 >>> tlf.has_key("Lars") False >>> tlf["Lars"] = 40070 >>> tlf.has_key("Lars") # now its there True >>> for name in tlf.keys():... print name, tlf[name]... Lars 40070 Bingding 40064 Andreas 40063 Michael 40062 Creating an initial phone book Asking for all keys Asking for all values Asking for a value, given a key Checking whether a key is in the list Inserting a single key:value pair Looping through the dictionary

79 Reading a table using hashes import sys comps={} for line in open("C:/genecomp.txt"): gene, comp = line.split() comps[gene] = comp geneToFind=sys.argv[1] print "Gene:", geneToFind print "Compartment:", comp[geneToFind] Gene: Cop Compartment: Golgi...with sys.argv[1] = "Cop" as before:

80 Reading a FASTA file into a hash def read_fasta(filename): name = None name2seq = {} for line in open(filename): if line.startswith(">"): if name: name2seq[name]=seq name=line[1:].rstrip() seq="" else: seq+=line.rstrip() name2seq[name]=seq return name2seq Final entry, after loop if name only evaluates to false, if it is still None (when going over first line) new name is derived from line from second letter on, with new-line character removed

81 Formatted output of sequences def print_seq(name, seq, width=50): print ">"+name i=0 while i<len(seq): print seq[i : i+width] i+=width print_seq("Tata-box1", "TA"*55) print_seq("Tata-box2", "TA"*55, 30) Default values, assigned in parameter line, placed rightmost Here, width default is 50-column output >Tata-box1 TATATATATATATATATATATATATATATATATATATATATATATATATA TATATATATA >Tata-box2 TATATATATATATATATATATATATATATA TATATATATATATATATATA

82 Files of sequence names Easy way to specify a subset of a given FASTA database Each line is the name of a sequence in a given database e.g. CG1167 CG685 CG1041 CG1043

83 Get named sequences Given a FASTA database and a "file of sequence names", print every named sequence: (fasta, fosn) = sys.argv[1:3] name2seq = read_FASTA (fasta) for name in open(fosn): name = name.rstrip() if name2seq.has_key[name]: seq = name2seq[name] print_seq (name, seq) else: print "Can't find sequence: %s."%name, "Known sequences: ", " ".join(name2seq.keys())

84 Common Set operations Two files of sequence names: What is the overlap/difference/union? Find eg. intersection using sets CG1167 CG685 CG1041 CG1043 CG215 CG1041 CG483 CG1167 CG1163 from sets import Set geneSet1 = Set([]) geneSet2 = Set([]) for line in open("C:/fosn1.txt"): geneSet1.add(line.rstrip()) for line in open("C:/fosn2.txt"): geneSet2.add(line.rstrip()) C:/fosn1.txtC:/fosn2.txt >>> geneSet1 Set(['CG1043', 'CG1041', 'CG1167', 'CG685']) >>> geneSet1.intersection(geneSet2) Set(['CG1041', 'CG1167']) >>> geneSet2.difference(geneSet1) Set(['CG483', 'CG215', 'CG1163']) >>> geneSet1.union(geneSet2) Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685', 'CG1163', 'CG215']) A AB A AB A AB difference intersection union

85 More Set operations Since every element in a Set occurs only once, sets can be used to reduce redundancy >>> from sets import Set >>> Set([1,2,3,1,3,3]) Set([1, 2, 3]) >>> pqs=Set("1kim 1dan 1bob".split()) >>> pdb=Set("1bob 3mad 1dan 2bad 1kim".split()) >>> pqs.issubset(pdb) True A is a superset of B when A fully contains B Test: A.issuperset(B) A is a subset of B when A is fully contained in B Test: A.issubset(B)

86 The genetic code as a hash aa = {'ttt':'F', 'tct':'S', 'tat':'Y', 'tgt':'C', 'ttc':'F', 'tcc':'S', 'tac':'Y', 'tgc':'C', 'tta':'L', 'tca':'S', 'taa':'!', 'tga':'!', 'ttg':'L', 'tcg':'S', 'tag':'!', 'tgg':'W', 'ctt':'L', 'cct':'P', 'cat':'H', 'cgt':'R', 'ctc':'L', 'ccc':'P', 'cac':'H', 'cgc':'R', 'cta':'L', 'cca':'P', 'caa':'Q', 'cga':'R', 'ctg':'L', 'ccg':'P', 'cag':'Q', 'cgg':'R', 'att':'I', 'act':'T', 'aat':'N', 'agt':'S', 'atc':'I', 'acc':'T', 'aac':'N', 'agc':'S', 'ata':'I', 'aca':'T', 'aaa':'K', 'aga':'R', 'atg':'M', 'acg':'T', 'aag':'K', 'agg':'R', 'gtt':'V', 'gct':'A', 'gat':'D', 'ggt':'G', 'gtc':'V', 'gcc':'A', 'gac':'D', 'ggc':'G', 'gta':'V', 'gca':'A', 'gaa':'E', 'gga':'G', 'gtg':'V', 'gcg':'A', 'gag':'E', 'ggg':'G' }

87 Translating: DNA to protein def translate(dna): length = len(dna) if len(dna) % 3 != 0: print "Warning: Length is not a multiple of 3!" sys.exit() protein = "" i = 0 while i < length: codon = dna[i:i+3] if not aa.has_key(codon): print "Codon %s is illegal"%codon sys.exit() protein += aa[codon] i+=3 return protein >>> translate("gatgacgaaagttgt") 'DDESC' >>> translate("gatgacgaaagttgta") Warning: Length is not a multiple of 3! … (SystemExit) >>> translate("gatgacgiaagttgt") Codon gia is illegal … (SystemExit)

88 Counting residue frequencies def count_residues(seq): freq={} seq = seq.lower() for i in range(len(seq)): if freq.has_key(seq[i]): freq[seq[i]]+=1 else: freq[seq[i]]=1 return freq freq = count_residues("gatgacgaaagttgt") for residue in freq.keys(): print residue,":", freq[residue] g : 5 a : 5 c : 1 t : 4

89 Counting N-mer frequencies def count_nmers(seq, n): freq={} seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 # incr. according counter else: freq[nmer]=1 # first occurence return freq freq = count_nmers("gatgacgaaagttgt", 2) for residue in freq.keys(): print residue,":", freq[residue] cg: 1 tt: 1 ga: 3 tg: 2 gt: 1 aa: 2 ac: 1 at: 1 ag: 1

N-mer frequencies for a whole file from read_fasta import read_fasta def count_nmers(seq, n, freq): seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 else: freq[nmer]=1 return freq name2seq = read_fasta("z:/tmp/fly3utr.txt") freq = {} ## count for each sequence for seq in name2seq.values(): freq = count_nmers(seq, 2, freq) ## display statistics for residue in freq.keys(): print residue,":", freq[residue] ct : 5 tc : 9 tt : 26 cg : 4 ga : 11 tg : 12 gc : 2 gt : 17 aa : 39 ac : 10 gg : 4 at : 17 ca : 11 ag : 15 ta : 20 cc : 2 Note how we keep passing freq back into the count_nmers function, to get cumulative counts We reuse a function we wrote earlier by import ing it.The first is the filename (without.py), the second the function name

91 Files and filehandles Opening a file: Closing a file: Reading a line: Reading an array: Printing a line: Read-only: Write-only: Test if file exists: fh = open(filename) fh.close() This fh is the filehandle data = fh.readline() data = fh.read() fh.write(line) fh = open(filename, "r") fh = open(filename, "w") import os if os.path.exists(filename): print "filename exists!"

92 Database access from Python # use the database package with all the DB relevant sub-routines import MySQLdb # import class that enables data acquirement as dictionaries from MySQLdb.cursors import DictCursor # Connection to database with access specification conn = MySQLdb.connect(db="scop", # name of database host="myserver", # name of server user="guest", # username passwd="guest") # password # create access pointer that retrieves dictionaries cursor = conn.cursor(DictCursor) # send a query cursor.execute("SELECT * FROM cla LIMIT 10") # retrieve all rows as a list of dictionaries data = cursor.fetchall() # close connection conn.close()

93 Local vs. global variables def foo(): a=3 print a a=6 print a foo() print a def foo(): global a a=3 print a a=6 print a foo() print a does not affect global a does affect global a def foo(a): print a a=6 print a foo(3) print a Parameters are local Function variables and parameters are by default local Unless you declare them to be global 636636 633633 636636

94 References in Python Lists, Dictionaries and other Datatypes are usually referenced, ie. when assigning a variable, no data is copied: [1,2,3,4] "Real copies" with copy module Don't worry about any referencing, Python is doing the job! But be aware when you want to copy objects >>> a = [1,2,3,4] >>> b=a >>> b[2]=7 >>> a [1, 2, 7, 4] >>> from copy import copy >>> b = copy(a) >>> b[2]=3 >>> a [1, 2, 7, 4] ab assigning b=a b points to the same list as a

95 Matrices Easy solution: Lists of lists in core Python Access an element at position (i,j) in a list of lists: selecting from the i'th row the j'th element Disadvantages: Operations like Addition/Multiplications on lists (of lists) would be slow, need to be implemented Luckily: big library already available, fast (since implemented in C), rich functionality >>> m = [[1, 2], [3, 4]] >>> m[1] [3, 4] >>> m[1][1] 4

96 Matrices with numarray Faster, more calculations (reshaping, built-in matrix operations) with external package numarray various matrix creation methods with numarray: –from list of lists –zeros/2 –ones/2 –identity/1 –from a function –etc. Convenient access of multidimensional array elements >>> from numarray import * >>> m1 = array(m);m1 array([[1, 2], [3, 4]]) >>> m1.getshape() (2, 2) >>> zeros((3,5)) array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]) >>> m2=array(arange(8)) >>> m2.setshape((2,4)) >>> m2 array([[0, 1, 2, 3], [4, 5, 6, 7]]) >>> m2[1,1] 5

97 Matrices with numarray You can select rows and columns, or even submatrices (same "slicing" as with lists) You can apply a scalar operation like –addition + –multiplication * –sine or cosine to an array >>> m1[:,1] # second column array([2, 4]) >>> #arange produces one-dim. array >>> m = arange(9, shape=(3,3));m array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) >>> m[1:,1:] array([[4, 5], [7, 8]]) >>> m[1] + 3 array([6, 7, 8]) >>> m[1] * 3 array([ 9, 12, 15]) >>> m1[1] * 3 array([ 9, 12]) >>> sin(m1) array([[ 0.84147098, 0.90929743], [ 0.14112001, -0.7568025 ]])

98 More Math Remember the mean and standard deviation from Lecture 3? Reuse of existing packages makes live easier: Or finding the maximum in a list becomes now: numarray also provides functions for dot product, vector calculations etc. >>> data = array([1, 5, 1, 12, 3, 4, 6]) >>> data.mean() 4.5714285714285712 >>> data.stddev() 3.7796447300922722 >>> dot(array([1,2,3]), array([1,2,3])) 14 >>> array([1,2,3]) + array([4,5,6]) array([5, 7, 9]) >>> data[argmax(data)] 12

99 Longest Common Subsequence from numarray import * seq1="ATCTGATC" seq2="TGCATA" len1 = len(seq1) len2 = len(seq2) def max3(a,b,c): return max( max(a,b),c) #Create an array val of length len1+1 times len2+1 val=zeros((len1+1,len2+1)) for i in range(1,len1+1): for j in range(1,len2+1): if seq1[i-1]==seq2[j-1]: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1) else: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]) print val lcs = val[len1,len2] print "The longest common subsequence of %s and %s is %d (%f)"% \ (seq1, seq2, lcs, float(lcs) / max(len1,len2))

100 Longest Common Subsequence Output [[0 0 0 0 0] [0 1 1 1 1] [0 1 2 2 2] [0 1 2 3 3] [0 1 2 3 4] [0 1 2 3 4]] The longest common subsequence of ATCTGATC and TGCATA is 4 (0.500000) Result of print val Final Result

101 Classes Define a class to store PDB residues. A residue has: a name, a position in the sequence, and a list of atoms. An atom has a name and coordinates. Define 2 methods: add_residue and add_atom class PDBStructure: def add_residue(self, name, posseq): residue = {'name': resname, 'posseq': posseq, 'atoms': []} self._residues.append(residue) return residue def add_atom(self, residue, name, coord): atom = {'residue': residue, 'name': name, 'coord': coord } residue['atoms'].append(atom) return atom

102 Classes: Usage struct = PDBStructure() residue = struct.add_residue(name = "ILE", posseq = 1 ) struct.add_atom(residue, name = "N", coord = (23.46800041, -8.01799965, -15.26200008)) struct.add_atom(residue, name = "CZ", coord = (125.50499725, 4.50500011, -19.14800072)) residue = struct.add_residue(name = "LYS", posseq = 2 ) struct.add_atom(residue, name = "OE1", coord = (126.12000275, -1.78199995, -15.04199982)) print struct.residues [{'name': 'ILE', 'posseq': 1, 'atoms': [ \ {'name': 'N', 'coord': (23.468000409999998, \ -8.0179996500000001, -15.26200008)}, \ {'name': 'CZ', 'coord': (125.50499725, \ 4.5050001100000001, -19.148000719999999)}]}, \ {'name': 'LYS', 'posseq': 2, 'atoms': [ \ {'name': 'OE1', 'coord': (126.12000275, \ -1.7819999500000001, -15.041999819999999)}]}]

1 Programming in Python Michael Schroeder Andreas Henschel {ms,

Similar presentations

Presentation on theme: "1 Programming in Python Michael Schroeder Andreas Henschel {ms,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Programming in Python Michael Schroeder Andreas Henschel {ms,

Similar presentations

Presentation on theme: "1 Programming in Python Michael Schroeder Andreas Henschel {ms,"— Presentation transcript:

Similar presentations

About project

Feedback