Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Slides:

Advertisements

Similar presentations

Container Types in Python

Advertisements

CompSci 101 Introduction to Computer Science February 3, 2015 Prof. Rodger Lecture given by Elizabeth Dowd.

Numbers, lists and tuples Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

String and Lists Dr. Benito Mendoza. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list List.

Adapted from John Zelle’s Book Slides

Vahé Karamian Python Programming CS-110 CHAPTER 2 Writing Simple Programs.

DICTIONARIES. The Compound Sequence Data Types All of the compound data types we have studies in detail so far – strings – lists – Tuples They are sequence.

An Introduction to Python – Part II Dr. Nancy Warter-Perez.

CSC 4630 Meeting 9 February 14, 2007 Valentine’s Day; Snow Day.

Sequences A sequence is a list of elements Lists and tuples

An introduction to Python and its use in Bioinformatics Csc 487/687 Computing for Bioinformatics Fall 2005.

Python Control of Flow.

Lilian Blot CORE ELEMENTS COLLECTIONS & REPETITION Lecture 4 Autumn 2014 TPOP 1.

Python programs How can I run a program? Input and output.

Introduction to Python Lecture 1. CS 484 – Artificial Intelligence2 Big Picture Language Features Python is interpreted Not compiled Object-oriented language.

“Everything Else”. Find all substrings We’ve learned how to find the first location of a string in another string with find. What about finding all matches?

Builtins, namespaces, functions. There are objects that are predefined in Python Python built-ins When you use something without defining it, it means.

Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 9 More About Strings.

Topics: Sequence Sequences Index into a sequence [] notation Slicing and other operations.

Fall Week 4 CSCI-141 Scott C. Johnson.  Computers can process text as well as numbers ◦ Example: a news agency might want to find all the articles.

If statements while loop for loop

I Power Int 2 Computing Software Development High Level Language Constructs.

Lists and the ‘ for ’ loop. Lists Lists are an ordered collection of objects >>> data = [] >>> print data [] >>> data.append("Hello!") >>> print data.

Numbers, lists and tuples Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Collecting Things Together - Lists 1. We’ve seen that Python can store things in memory and retrieve, using names. Sometime we want to store a bunch of.

Built-in Data Structures in Python An Introduction.

Data Collections: Lists CSC 161: The Art of Programming Prof. Henry Kautz 11/2/2009.

Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp

Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 8 Lists and Tuples.

I Power Higher Computing Software Development High Level Language Constructs.

9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.

Strings in Python. Computers store text as strings GATTACA >>> s = "GATTACA" s Each of these are characters.

GE3M25: Computer Programming for Biologists Python, Class 5

Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 4 Karsten Hokamp, PhD Genetics TCD, 01/12/2015.

Chapter 10 Loops: while and for CSC1310 Fall 2009.

Midterm Exam Topics (Prof. Chang's section) CMSC 201.

CS190/295 Programming in Python for Life Sciences: Lecture 6 Instructor: Xiaohui Xie University of California, Irvine.

LISTS and TUPLES. Topics Sequences Introduction to Lists List Slicing Finding Items in Lists with the in Operator List Methods and Useful Built-in Functions.

Lists Victor Norman CS104. Reading Quiz Lists Our second collection data type – elements are in order (like strings) – indexed from 0 to n – 1 (like.

Introduction to Computing Using Python Repetition: the for loop  Execution control structures  for loop – iterating over a sequence  range() function.

CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.

Python Files and Lists. Files  Chapter 9 actually introduces you to opening up files for reading  Chapter 14 has more on file I/O  Python can read.

Strings … operators Up to now, strings were limited to input and output and rarely used as a variable. A string is a sequence of characters or a sequence.

Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.

String and Lists Dr. José M. Reyes Álamo. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list.

Guide to Programming with Python Chapter Four Strings, and Tuples; for Loops: The Word Jumble Game.

String and Lists Dr. José M. Reyes Álamo.

CSc 120 Introduction to Computer Programing II Adapted from slides by

CSc 120 Introduction to Computer Programing II Adapted from slides by

Tuples and Lists.

Python - Lists.

Computer Programming Fundamentals

CSc 110, Autumn 2017 Lecture 5: The for Loop and user input

Numbers, lists and tuples

Introduction to Python

Data types Numeric types Sequence types float int bool list str

String and Lists Dr. José M. Reyes Álamo.

Topics Sequences Introduction to Lists List Slicing

CS1110 Today: collections.

15-110: Principles of Computing

Python Basics with Jupyter Notebook

Topics Basic String Operations String Slicing

Introduction to Python Strings in Python Strings.

Lists and the ‘for’ loop

Topics Sequences Introduction to Lists List Slicing

“Everything Else”.

Topics Basic String Operations String Slicing

Topics Basic String Operations String Slicing

Introduction to Computer Science

Presentation transcript:

Parsing data records

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN A sequence record in FASTA format

seq = ">sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens \ MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGAR RSS\ WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\ LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVF Y\ YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\ AGEGEN" for i in seq: print i

seq = open("SingleSeq.fasta") for line in seq: print line

seq = open("SingleSeq.fasta") seq_2 = open("SingleSeq-2.fasta") for line in seq: seq_2.write(line) seq_2.close()

Writing different things depending on a condition Read a sequence in FASTA format and print only the header of the sequence >sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

seq = open("SingleSeq.fasta") for line in seq: if line[0] == '>': print line >sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

Making choices: The if/elif/else statements if : if expression in is TRUE execute statements 1 [elif ]: else if exp in is TRUE ] execute statements [elif ]: etc... pass] … [else: ]

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE" >>> s_len = float(len(s)) >>> G_num = s.count('G’) >>> A_num = s.count('A’) >>> freq_G = G_num/s_len >>> freq_A = A_num/s_len >>> print freq_G >>> print freq_A Write different things depending on a condition

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE" >>> s_len = float(len(s)) >>> G_num = s.count('G’) >>> A_num = s.count('A’) >>> freq_G = G_num/s_len >>> freq_A = A_num/s_len >>> print freq_G >>> print freq_A >>> if freq_G > freq_A:... print "Gly is more frequent than Ala"... elif freq_G < freq_A:... print "Ala is more frequent than Gly"... else:... print "The frequency of Gly and Ala is the same"... Ala is more frequent than Glycines Write different things depending on a condition

The if/elif/else construct produces different effects compared with the use of a series of if conditions

seq = open("SingleSeq.fasta") for line in seq: if line[0] != '>': print line seq = open("SingleSeq.fasta") for line in seq: if line[0] == '>': print line

seq = open("SingleSeq.fasta") for line in seq: if line[0] != '>': print line ==!==> <

Exercises 1, 2, and 3 1) Read a file in FASTA format and write to a new file only the header of the record. 2) Read a file in FASTA format and write to a new file only the sequence (without the header). 3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.

fasta = open('SingleSeq.fasta') header = open('header.txt', 'w’) for line in fasta: if line[0] == '>': header.write(line) header.close()

fasta = open('SingleSeq.fasta') seq = open('seq.txt','w') for line in fasta: if line[0] != '>': seq.write(line) seq.close()

fasta = open('SingleSeq.fasta') header = open('header.txt', 'w') seq = open('seq.txt','w') for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line) header.close() seq.close()

Let’s increase the difficulty just a bit…

seq_fasta = open("SingleSeq.fasta") seq = '' for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip() num_cys = seq.count("C") print header, seq, num_cys

Exercise 4 4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.

seq_fasta = open("SingleSeq.fasta") seq = '' header = '' for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else: if header: seq = seq + line if header: print header + seq else: print "The record is not from H. sapiens"

In general, you will need to analyse several sequences….

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS S WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSS W RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN... SwissProt-Human.fasta

Read the records from a file and write them to a new file fasta = open('SwissProt-Human.fasta') fasta_2 = open('SwissProt-Human_2.fasta', 'w') for line in fasta: fasta_2.write(line) this must be a string

Strings can be concatenated Strings can be indexed and sliced String elements cannot be re-assigned >>> print "ACTGGTA" + "ATGTAACTT" ACTGGTAATGTAACTT >>> s = "ACTGGTA" >>> s[0] 'A' >>> s[1:3] 'CT' >>> s[2] = 'Z' Traceback (most recent call last): File " ", line 1, in TypeError: 'str' object does not support item assignment

Read the sequences from a file and write them to a new file fasta = open('SwissProt-Human.fasta') fasta_2 = open('SwissProt-Human_2.fasta', 'w') n = 0 for line in fasta: n = n + 1 l_n = str(n) fasta_2.write(l_n + "\t" + line) fasta_2.close() Number the lines starting from 1

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS S WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSS W RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN...

Exercise 5 5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.

fasta = open('SwissProt-Human.fasta') headers = open('headers.txt', 'w') for line in fasta: if line[0] == '>': headers.write(line) headers.close()

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARR SS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRS SW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN...

Exercise 6 6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line

fasta = open('SwissProt-Human.fasta.fasta') seqs = open('seqs.txt', 'w') for line in fasta: if line[0] == '>’: seqs.write('\n') elif line[0] != '>': seqs.write(line) seqs.close() seqs.write(line.strip() + '\n’)

Exercise 7 7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.

fasta = open('sprot_prot.fasta') output = open('homo_sapiens.fasta', 'w') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = '' if "Homo sapiens" in header: output.write(header + seq) output.close()

Exercise 8 8) Read FASTA records from a file and count the cysteine residues in each sequence.

fasta = open('sprot_prot.fasta') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = '' print header, ': ', cys_num Read the records from a file and count the cysteine residues in each sequence

Exercises 9, 10, and 11 9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M'). 10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').

outfile = open('SwissProtHuman-Filtered.fasta','w') fasta = open('SwissProtHuman.fasta','r') seq = '' for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq = seq + line elif line[0:1] == '>' and seq != '': TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq) seq = '' header = line TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq) outfile.close()

In many cases you will need to compare data from different files

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN... SwissProt-Human.fasta cancer-expressed.txt

1) Read 10 SwissProt ACs from a file 2) Store them into a data structure cancer_file = open('cancer-expressed.txt') cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) print cancer_list

List data structure A list is a mutable ordered collection of objects L = [1, [2,3], 4.52, ‘DNA’] The elements of a list can be any kind of object: numbers strings tuples lists dictionaries function calls etc. L = []The empty list

>>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)] >>> L[0]# indexing 1 >>> L[3]# indexing [1, 2, ’three'] >>> L[3][2]# indexing ‘three’ >>> L[-1]# negative indexing (1, 2) >>> L[2:4]# slicing [12.1, [1, 2, ‘three’]] >>> L[2:]# slicing shorthand [12.1, [1, 2, ‘three’], ‘seq’, (1, 2)] >>>

The elements of a list can be changed/replaced after the list has been defined l[i] = x l[i:j] = t del l[i:j] del l[i:j:k] l.append(x) l.extend(x) >>> l = [2,3,5,7,8,['a','b'],'a','b','cde'] >>> l[0] = 1 >>> l [1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde'] >>> l[0:3] = 'DNA' >>> l ['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde'] >>> del l[0:5] >>> l [['a', 'b'], 'a', 'b', 'cde'] >>> l.append('DNA') >>> l [['a', 'b'], 'a', 'b', 'cde', 'DNA'] >>> l.extend('dna') >>> l [['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a'] >>> These operations CHANGE the list

l.count(x) l.index(x) l.insert(i, x) l.pop(i) l.remove(x) >>> l = [1,3,5,7,8,['a','b'],'a','b','cde'] >>> l.count(‘a’) >>> l 1 >>> l.index(8) 4 >>> l.insert(4, 80) >>> l [1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’] >>> l.pop(4) 80 >>> l [1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’] >>> l.pop() ‘cde’ >>> l [1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’] >>> l.remove(8) [1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’] The elements of a list can be changed/replaced after the list has been defined

l.reverse() l.sort() sorted(l) >>> l = [4, 3, 2, 1, 5, 6, 7, 8] >>> l.reverse() >>> l [8, 7, 6, 5, 1, 2, 3, 4] >>> new = sorted(l) >>> new [1, 2, 3, 4, 5, 6, 7, 8] >>> l [8, 7, 6, 5, 1, 2, 3, 4] >>> l.sort() >>> l [1, 2, 3, 4, 5, 6, 7, 8] The elements of a list can be changed/replaced after the list has been defined

Putting together lists and loops range() and xrange() built-in functions >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(1, 11) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> range(0, 30, 5) [0, 5, 10, 15, 20, 25] >>> range(0, 10, 3) [0, 3, 6, 9] >>> range(0, -10, -1) [0, -1, -2, -3, -4, -5, -6, -7, -8, -9] >>> range(0) [] >>> range(1, 0) [] # the xrange()method is more commonly used in for loops than range() >>>for i in xrange(5): … print i … 0,1,2,3,4 The xrange()method generates the values upon call, i.e. it does not store them into a variable

Exercise 12 12) Create a list containing Uniprot ACs extracted from a FASTA file. Print the list.

InputFile = open("SwissProtHuman.fasta","r") AC_list = [] for line in InputFile: if line[0] == '>': fields = line.split('|') AC_list.append(fields[1]) print AC_list

By the way…. Exercise 13 13) Read a file in FASTA format and copy to a new file the record ACs.

human_fasta = open('SwissProt-Human.fasta') Outfile = open('SwissProt-Human-AC.txt’) for line in human_fasta: if line[0] == '>': AC = line.split('|')[1] Outfile.write(AC + '\n') Outfile.close() Selectively extract ACs froma a FASTA file

Exercise 14 14) Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file.

Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file. cancer_file = open('cancer-expressed.txt') human_fasta = open('SwissProt-Human.fasta') Outfile = open(‘cancer-expressed.fasta’,’w’) cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) for line in human_fasta: if line[0] == '>': AC = line.split('|')[1] if AC in cancer_list: Outfile.write(line) Outfile.close() We are not writing the whole record but the header line only

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS S WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSS W RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN SwissProt-Human.fasta

Exercise 15 15) Read a multiple sequence file in FASTA format and write to a new file only the records the Uniprot ACs of which are present in the list created in 12).

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN cancer_file = open('cancer-expressed.txt') human_fasta = open('SwissProt-Human.fasta') Outfile = open('cancer_expressed.fasta','w') cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) for line in human_fasta: if line[0] == ">": field = line.split("|") AC = field[1] if AC in cancer_list: Outfile.write(line) else: if AC in cancer_list: Outfile.write(line) Outfile.close()

>sp|P31946|1433B_HUMAN protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN cancer_file = open('cancer-expressed.txt') human_fasta = open('SwissProt-Human.fasta') Outfile = open('cancer_expressed.fasta','w') cancer_list = [] seq = '' for line in cancer_file: AC = line.strip() cancer_list.append(AC) for line in human_fasta: if line[0] == '>' and seq == '': header = line AC = line.split('|')[1] elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if AC in cancer_list: Outfile.write(header+seq) header = line AC = line.split('|')[1] seq = '' if AC in cancer_list: Outfile.write(header+seq) The same but with more control…

Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap gbk) Try to write it in FASTA format: >AP Ccactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaa agtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatcc atctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaa cacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaa Gtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga...

Exercise 16 16) Read a Genbank record and write to a file the nucleotide sequence in FASTA format.

InputFile = open("ap gbk") OutputFile = open("ap fasta","w") flag = 0 for line in InputFile: if line[0:9] == 'ACCESSION': AC = line.split()[1].strip() OutputFile.write('>'+AC+'\n') if line[0:6] == 'ORIGIN': flag = 1 continue if flag == 1: fields = line.split() if fields != []: seq = ''.join(fields[1:]) OutputFile.write(seq +'\n') InputFile.close() OutputFile.close()

Parsing data records Start by visually inspecting the file you want to parse Identify the information you want to extract Identify separators to select your information using if conditions Use lists if you have to compare data from different files

cancer_file = open('cancer-expressed.txt') cancer_list = [] line = cancer_file.readline() while line: AC = line.strip() cancer_list.append(AC) line = cancer_file.readline() We can use while loops to read files (but usually we won’t do it)

You can repeat all exercises using ncbi_gene.fasta as input file

Summary Parsing sequence records in FASTA format Lists Making choices: if/elif/else range() and xrange()