Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Similar presentations


Presentation on theme: "Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine."— Presentation transcript:

1 Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine

2 Learning Objectives Flow control (if/else) and Operators For loops Recursion Reading and Writing files (File I/O) Create custom functions with def Dictionaries

3 Flow control Programs need to make decisions, and have controlled looping (repeat operations for a specific number of times). Decision operators: if, elif, else Looping operators: for x in list: while a < 10:

4 For Loops For loops iterate (step) through a list one element at a time. In Python, loops and decisions are set off by a colon and an indent. Python ‘for’ syntax is very simple, but you must use correct indent of statements in the loop >>> my_list=['G', 'A', 'hat', 'cat'] >>> concat = "" # this is an empty string >>> for i in my_list: concat = concat + i >>> print (concat) GAhatcat

5 Loop through a String For loops work on strings as if they were a list of characters. >>> my_dna ='ATGCGTA' >>> for i in my_DNA: print (i) A T G C

6 if/else example >>> my_DNA = "ATGCGTA“ >>> if my_DNA.find("GC"): print (“GC is found”) else: print (“No GC found”)

7 Operators Operators include the basic math functions: +, -, /, *, ** (raise to power) Comparisons: >, =, <=, == Boolean operators: and, or, not

8 Example dna=‘GATCCGGTTACTACGACCTGA’ count_G=0 count_A=0 for base in dna: if base == 'G': count_G += 1 elif base == 'A' count_A += 1 print ('G= ' + str(count_G) + ' ' + 'A= ' + str(count_A)

9 Functions More complex operators are also known as functions They can deal with file I/O, more complex math, or other manipulations of data. Functions use parentheses to act on some data object, and may take additional parameters print(x) open('filename', r) read(filehandle) my_list.append(42) write(data, 'filename') len(my_dna)

10 Range range(start,stop,[step]) creates a list of integers – Starts at zero by default – A range does not include the stop number – Step is optional >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(4, 11, 2) # from 4 to 11 with step of 2 [4, 6, 8, 10] range() is often used as part of a for loop to step through a list while keeping track of what number item you are working on: >>> a = ['Mary', 'had', 'a', 'little', 'lamb'] >>> for i in range(len(a)): print i, a[i] 0 Mary 1 had 2 a 3 little 4 lamb

11 List Compression A list compression creates a list using a function and a for loop. An optional if statement can be included. squares = [] # create a list of squares < 50 for x in range(10): if (x**2) <50: squares.append(x**2) print squares [0, 1, 4, 9, 16, 25, 36, 49] # create a list of squares < 50 with a list compression squares = [x**2 for x in range(10) if (x**2) < 50] print squares [0, 1, 4, 9, 16, 25, 36, 49]

12 Custom Functions In Python, users can create their own functions, which act like subroutines or use functions within code written by others (known as modules) def g_count(dna): #function takes a string as input count=0 for base in dna: if base == ‘G’: count += 1 return(count) #function returns an integer

13 ATG finder >>> def find_ATG(dna): if dna.find("ATG"): return ("ATG is found") else: return ("No ATG found") >>> my_dna =‘TATGCGTA‘ >>> find_ATG(my_dna) ATG is found Bonus point if you find and fix some of the bugs in this code

14 Recursion Now that you can make custom functions … – what would happen if you wrote a function that called itself? def countdown(n): if n <= 0: print “Blastoff!” else: print n countdown(n-1) Of course, you should avoid creating an infinite loop … def plustwo(n): print n plustwo(n+2) #be careful running this- get ready to kill it

15 Fibonacci Computer Scientists use recursion often, it is less common in Bioinformatics applications. has several sections that explore algorithms in computational biology and beyond. – There is a nice (fairly simple) problem about Fibonacci Numbers: http://rosalind.info/problems/fibo/http://rosalind.info/problems/fibo/ – Give it a try (in Python, of course).

16 def Fib(x): if x =0: return 0 elif x = 1: return 1 elif x > 1: return Fib(x-1) + Fib(x-2) Why is this program such a bad idea? How can you do it better using a simple list to store the Fib series? This is also a good introduction to computational complexity. Bioinformatics often deals with large data and complex computations, so the speed of computing for a given task is an important issue.

17 File I/O Usually your programs will get input data in a text file, and you will want to write output to a file rather than dump it on the screen (“standard output”, “stdout”) In Python, a file must be opened before reading or writing. The open file is assigned to a variable called a ‘handle’, then the program will read or write to the handle The.read() method captures the whole contents of the file in a single string..close() the file when you are done with it. file1 = open(‘human_pep.fasta’) Hum_pep = file1.read() gene_count = Hum_pep.count(‘>’) file1.close()

18 with open() as f A nicer way to open a file is to use the with/as keywords and an indented block. This automatically closes the file when the indented block is completed. >>> with open(‘human_pep.fasta’) as file1: Hum_pep = file1.read()

19 Write output to a file To create an output file, open a file (give it any name you want) with the ‘w’ option and assign it to a variable name. Then use the write() method. write() works just like print(), you can include string methods, concatenation, etc. inside the parentheses. output=open( ' humpep_count.txt ', ' w ' ) output.write( ' Gene Count: ' + str(gene_count)) output.close()

20 Read a file line by line with a for loop readlines() captures a file as a list of lines (rather than all in one big string), then you can loop over the list of lines. my_file = open(‘human_dna.fasta’) human_seq = my_file.readlines() for line in human_seq: print (len(line)) Or you can iterate over lines in the file directly with a for loop: my_file = open(‘human_dna.fasta’) for line in my_file: print (len(line))

21 Dictionaries Dictionaries contain key-value pairs. (Called a “hash” in most other programming languages) my_dict1 = {'ATT' : 'I', 'CTT' : 'L', 'GTT' : 'V', 'TTT' : 'F'} Very useful for lookup lists of things like the amino acid codon table or k-mer lists Designed to give very fast random access lookup of the key and return the corresponding value Keys must be unique strings, values can be anything

22 Zip makes a dictionary Rather than type a dictionary, you can build a dictionary from two lists using zip() >>> list1 = ('GAT', 'CAT', 'TAT', 'AAT') >>> list2 = (1, 2, 3, 4) >>> zip(list1,list2) [('GAT', 1), ('CAT', 2), ('TAT', 3), ('AAT', 4)]

23 Check and add to dictionary Another useful application of a dictionary is to build a non- redundant list. – For each item, check if it is in the dictionary, if not then add it to the dictionary. – You can count occurrences at the same time. Example: count DNA dimers DNA = 'GATCCGGTTACTACGACCTGAGAT' Dimers = {}#create an empty dictionary for x in range(len(DNA)): di = DNA[x:(x+2)] if di in Dimers: Dimers[di] += 1#add one to count for di else: Dimers[di] = 1#add di to Dimers dict print Dimers Bonus point if you find and fix the bugs in this code

24 Challenge Assignment Write a function that translates a DNA string into protein. In your function, use a dictionary of triplet codons as keys and amino acids as values Begin translation at the first ATG codon Write a program that uses your translate function to open and translate a file that contains a single DNA sequence as text, write the output as another text file.

25 Zip a codon table (save yourself some typing ) codons= ['ttt', 'ttc', 'tta', 'ttg', 'tct', 'tcc', 'tca', 'tcg', 'tat', 'tac', 'taa', 'tag', 'tgt', 'tgc', 'tga', 'tgg', 'ctt', 'ctc', 'cta', 'ctg', 'cct', 'ccc', 'cca', 'ccg', 'cat', 'cac', 'caa', 'cag', 'cgt', 'cgc', 'cga', 'cgg', 'att', 'atc', 'ata', 'atg', 'act', 'acc', 'aca', 'acg', 'aat', 'aac', 'aaa', 'aag', 'agt', 'agc', 'aga', 'agg', 'gtt', 'gtc', 'gta', 'gtg', 'gct', 'gcc', 'gca', 'gcg', 'gat', 'gac', 'gaa', 'gag', 'ggt', 'ggc', 'gga', 'ggg'] amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGG GG‘ >>> codon_table = dict(zip(codons, amino_acids)) Very nice Python code by Peter CollingridgeVery nice Python code by Peter Collingridge: http://www.petercollingridge.co.uk/python-bioinformatics-tools/codon-table

26 Re-use Code vs Write New A little break for a philosophical debate When should you find and re-use code written by others and when should you write your own? In Bioinformatics, many of the problems you will encounter with data have been faced by other people. – A great deal of code has been written and shared in public repositories. – Some of this code has been published an cited in the literature – Don’t try to re-write BLAST (unless you really, really have to) If you can’t find code to do exactly what you want, should you adapt existing, or write your own? – There are challenges to figuring out someone else’s code – New code that uses (depends) on programs written by others is very fragile – There are challenges to validate your own code when using it to analyze and publish scientific data – There is value to building your own repository of code elements from scratch that work and fit together in a way that is intuitive for you

27 Some Statistics in Python NumPy has some basic statistics functions that work on arrays. >>> squares = [x**2 for x in range(10) if (x**2) < 50] >>> sq=np.array(squares) >>> np.mean(sq) 17.5 >>> np.median(sq) 12.5 >>> np.std(sq) 16.680827317612277

28 Other NumPy funcions NumPy has: linear algebra trigonometry logarithms polynomials Fourier Transformations random sampling permutations sorting and distributions (normal, Poisson, hypergeometrix, logistic, gamma, negative binomial, etc)

29 SciPy SciPy is an extension of NumPy that provides a great deal more complex mathematic, statistical, and scientific data analysis functions. >>> import antigravity

30 Summary Flow control (if/else) and Operators For loops Recursion Reading and Writing files (File I/O) Create custom functions with def Dictionaries

31 Next Lecture: Biopython


Download ppt "Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine."

Similar presentations


Ads by Google