Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY.

Similar presentations


Presentation on theme: "Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY."— Presentation transcript:

1 Parsing data records

2 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

3 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN A sequence record in FASTA format

4 seq = ">sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens \ MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGAR RSS\ WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\ LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVF Y\ YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\ AGEGEN" for i in seq: print i

5 seq = open("SingleSeq.fasta") for line in seq: print line

6 seq = open("SingleSeq.fasta") seq_2 = open("SingleSeq-2.fasta") for line in seq: seq_2.write(line) seq_2.close()

7 Writing different things depending on a condition Read a sequence in FASTA format and print only the header of the sequence >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

8 seq = open("SingleSeq.fasta") for line in seq: if line[0] == '>': print line >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

9 Making choices: The if/elif/else statements if : if expression in is TRUE execute statements 1 [elif ]: else if exp in is TRUE ] execute statements 2.... [elif ]: etc... pass] … [else: ]

10 >>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE" >>> s_len = float(len(s)) >>> G_num = s.count('G’) >>> A_num = s.count('A’) >>> freq_G = G_num/s_len >>> freq_A = A_num/s_len >>> print freq_G 0.116666666667 >>> print freq_A 0.216666666667 Write different things depending on a condition

11 >>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE" >>> s_len = float(len(s)) >>> G_num = s.count('G’) >>> A_num = s.count('A’) >>> freq_G = G_num/s_len >>> freq_A = A_num/s_len >>> print freq_G 0.116666666667 >>> print freq_A 0.216666666667 >>> if freq_G > freq_A:... print "Gly is more frequent than Ala"... elif freq_G < freq_A:... print "Ala is more frequent than Gly"... else:... print "The frequency of Gly and Ala is the same"... Ala is more frequent than Glycines Write different things depending on a condition

12 The if/elif/else construct produces different effects compared with the use of a series of if conditions

13 seq = open("SingleSeq.fasta") for line in seq: if line[0] != '>': print line seq = open("SingleSeq.fasta") for line in seq: if line[0] == '>': print line

14 seq = open("SingleSeq.fasta") for line in seq: if line[0] != '>': print line ==!==> <

15 Exercises 1, 2, and 3 1) Read a file in FASTA format and write to a new file only the header of the record. 2) Read a file in FASTA format and write to a new file only the sequence (without the header). 3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.

16 fasta = open('SingleSeq.fasta') header = open('header.txt', 'w’) for line in fasta: if line[0] == '>': header.write(line) header.close()

17 fasta = open('SingleSeq.fasta') seq = open('seq.txt','w') for line in fasta: if line[0] != '>': seq.write(line) seq.close()

18 fasta = open('SingleSeq.fasta') header = open('header.txt', 'w') seq = open('seq.txt','w') for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line) header.close() seq.close()

19 Let’s increase the difficulty just a bit…

20 seq_fasta = open("SingleSeq.fasta") seq = '' for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip() num_cys = seq.count("C") print header, seq, num_cys

21 Exercise 4 4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.

22 seq_fasta = open("SingleSeq.fasta") seq = '' header = '' for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else: if header: seq = seq + line if header: print header + seq else: print "The record is not from H. sapiens"

23 In general, you will need to analyse several sequences….

24 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS S WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSS W RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN... SwissProt-Human.fasta

25 Read the records from a file and write them to a new file fasta = open('SwissProt-Human.fasta') fasta_2 = open('SwissProt-Human_2.fasta', 'w') for line in fasta: fasta_2.write(line) this must be a string

26 Strings can be concatenated Strings can be indexed and sliced String elements cannot be re-assigned >>> print "ACTGGTA" + "ATGTAACTT" ACTGGTAATGTAACTT >>> s = "ACTGGTA" >>> s[0] 'A' >>> s[1:3] 'CT' >>> s[2] = 'Z' Traceback (most recent call last): File " ", line 1, in TypeError: 'str' object does not support item assignment

27 Read the sequences from a file and write them to a new file fasta = open('SwissProt-Human.fasta') fasta_2 = open('SwissProt-Human_2.fasta', 'w') n = 0 for line in fasta: n = n + 1 l_n = str(n) fasta_2.write(l_n + "\t" + line) fasta_2.close() Number the lines starting from 1

28 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS S WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSS W RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN...

29 Exercise 5 5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.

30 fasta = open('SwissProt-Human.fasta') headers = open('headers.txt', 'w') for line in fasta: if line[0] == '>': headers.write(line) headers.close()

31 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARR SS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRS SW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN...

32 Exercise 6 6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line

33 fasta = open('SwissProt-Human.fasta.fasta') seqs = open('seqs.txt', 'w') for line in fasta: if line[0] == '>’: seqs.write('\n') elif line[0] != '>': seqs.write(line) seqs.close() seqs.write(line.strip() + '\n’)

34 Exercise 7 7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.

35 fasta = open('sprot_prot.fasta') output = open('homo_sapiens.fasta', 'w') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = '' if "Homo sapiens" in header: output.write(header + seq) output.close()

36 Exercise 8 8) Read FASTA records from a file and count the cysteine residues in each sequence.

37 fasta = open('sprot_prot.fasta') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = '' print header, ': ', cys_num Read the records from a file and count the cysteine residues in each sequence

38 Exercises 9, 10, and 11 9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M'). 10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').

39 outfile = open('SwissProtHuman-Filtered.fasta','w') fasta = open('SwissProtHuman.fasta','r') seq = '' for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq = seq + line elif line[0:1] == '>' and seq != '': TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq) seq = '' header = line TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq) outfile.close()

40 In many cases you will need to compare data from different files

41 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN... SwissProt-Human.fasta cancer-expressed.txt

42

43 1) Read 10 SwissProt ACs from a file 2) Store them into a data structure cancer_file = open('cancer-expressed.txt') cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) print cancer_list

44 List data structure A list is a mutable ordered collection of objects L = [1, [2,3], 4.52, ‘DNA’] The elements of a list can be any kind of object: numbers strings tuples lists dictionaries function calls etc. L = []The empty list

45

46 >>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)] >>> L[0]# indexing 1 >>> L[3]# indexing [1, 2, ’three'] >>> L[3][2]# indexing ‘three’ >>> L[-1]# negative indexing (1, 2) >>> L[2:4]# slicing [12.1, [1, 2, ‘three’]] >>> L[2:]# slicing shorthand [12.1, [1, 2, ‘three’], ‘seq’, (1, 2)] >>>

47 The elements of a list can be changed/replaced after the list has been defined l[i] = x l[i:j] = t del l[i:j] del l[i:j:k] l.append(x) l.extend(x) >>> l = [2,3,5,7,8,['a','b'],'a','b','cde'] >>> l[0] = 1 >>> l [1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde'] >>> l[0:3] = 'DNA' >>> l ['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde'] >>> del l[0:5] >>> l [['a', 'b'], 'a', 'b', 'cde'] >>> l.append('DNA') >>> l [['a', 'b'], 'a', 'b', 'cde', 'DNA'] >>> l.extend('dna') >>> l [['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a'] >>> These operations CHANGE the list

48 l.count(x) l.index(x) l.insert(i, x) l.pop(i) l.remove(x) >>> l = [1,3,5,7,8,['a','b'],'a','b','cde'] >>> l.count(‘a’) >>> l 1 >>> l.index(8) 4 >>> l.insert(4, 80) >>> l [1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’] >>> l.pop(4) 80 >>> l [1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’] >>> l.pop() ‘cde’ >>> l [1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’] >>> l.remove(8) [1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’] The elements of a list can be changed/replaced after the list has been defined

49 l.reverse() l.sort() sorted(l) >>> l = [4, 3, 2, 1, 5, 6, 7, 8] >>> l.reverse() >>> l [8, 7, 6, 5, 1, 2, 3, 4] >>> new = sorted(l) >>> new [1, 2, 3, 4, 5, 6, 7, 8] >>> l [8, 7, 6, 5, 1, 2, 3, 4] >>> l.sort() >>> l [1, 2, 3, 4, 5, 6, 7, 8] The elements of a list can be changed/replaced after the list has been defined

50 Putting together lists and loops range() and xrange() built-in functions >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(1, 11) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> range(0, 30, 5) [0, 5, 10, 15, 20, 25] >>> range(0, 10, 3) [0, 3, 6, 9] >>> range(0, -10, -1) [0, -1, -2, -3, -4, -5, -6, -7, -8, -9] >>> range(0) [] >>> range(1, 0) [] # the xrange()method is more commonly used in for loops than range() >>>for i in xrange(5): … print i … 0,1,2,3,4 The xrange()method generates the values upon call, i.e. it does not store them into a variable

51 Exercise 12 12) Create a list containing Uniprot ACs extracted from a FASTA file. Print the list.

52 InputFile = open("SwissProtHuman.fasta","r") AC_list = [] for line in InputFile: if line[0] == '>': fields = line.split('|') AC_list.append(fields[1]) print AC_list

53 By the way…. Exercise 13 13) Read a file in FASTA format and copy to a new file the record ACs.

54 human_fasta = open('SwissProt-Human.fasta') Outfile = open('SwissProt-Human-AC.txt’) for line in human_fasta: if line[0] == '>': AC = line.split('|')[1] Outfile.write(AC + '\n') Outfile.close() Selectively extract ACs froma a FASTA file

55 Exercise 14 14) Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file.

56 Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file. cancer_file = open('cancer-expressed.txt') human_fasta = open('SwissProt-Human.fasta') Outfile = open(‘cancer-expressed.fasta’,’w’) cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) for line in human_fasta: if line[0] == '>': AC = line.split('|')[1] if AC in cancer_list: Outfile.write(line) Outfile.close() We are not writing the whole record but the header line only

57 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS S WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRAS W RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSS W RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN SwissProt-Human.fasta

58 Exercise 15 15) Read a multiple sequence file in FASTA format and write to a new file only the records the Uniprot ACs of which are present in the list created in 12).

59 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN cancer_file = open('cancer-expressed.txt') human_fasta = open('SwissProt-Human.fasta') Outfile = open('cancer_expressed.fasta','w') cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) for line in human_fasta: if line[0] == ">": field = line.split("|") AC = field[1] if AC in cancer_list: Outfile.write(line) else: if AC in cancer_list: Outfile.write(line) Outfile.close()

60 >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN cancer_file = open('cancer-expressed.txt') human_fasta = open('SwissProt-Human.fasta') Outfile = open('cancer_expressed.fasta','w') cancer_list = [] seq = '' for line in cancer_file: AC = line.strip() cancer_list.append(AC) for line in human_fasta: if line[0] == '>' and seq == '': header = line AC = line.split('|')[1] elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if AC in cancer_list: Outfile.write(header+seq) header = line AC = line.split('|')[1] seq = '' if AC in cancer_list: Outfile.write(header+seq) The same but with more control…

61

62 Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk) Try to write it in FASTA format: >AP006852 Ccactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaa agtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatcc atctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaa cacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaa Gtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga...

63 Exercise 16 16) Read a Genbank record and write to a file the nucleotide sequence in FASTA format.

64 InputFile = open("ap006852.gbk") OutputFile = open("ap006852.fasta","w") flag = 0 for line in InputFile: if line[0:9] == 'ACCESSION': AC = line.split()[1].strip() OutputFile.write('>'+AC+'\n') if line[0:6] == 'ORIGIN': flag = 1 continue if flag == 1: fields = line.split() if fields != []: seq = ''.join(fields[1:]) OutputFile.write(seq +'\n') InputFile.close() OutputFile.close()

65 Parsing data records Start by visually inspecting the file you want to parse Identify the information you want to extract Identify separators to select your information using if conditions Use lists if you have to compare data from different files

66 cancer_file = open('cancer-expressed.txt') cancer_list = [] line = cancer_file.readline() while line: AC = line.strip() cancer_list.append(AC) line = cancer_file.readline() We can use while loops to read files (but usually we won’t do it)

67 You can repeat all exercises using ncbi_gene.fasta as input file

68 Summary Parsing sequence records in FASTA format Lists Making choices: if/elif/else range() and xrange()


Download ppt "Parsing data records. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY."

Similar presentations


Ads by Google