Presentation is loading. Please wait.

Presentation is loading. Please wait.

Python.

Similar presentations


Presentation on theme: "Python."— Presentation transcript:

1 Python

2 What is Biopython? Biopython is a python library of resources for developers of Python-base software for bioinformatics and research. can parse bioinformatics files into local data structures Fasta, GenBank, Blast output Clustalw etc. Can access many files directly ( web database, NCBI) from within the script. Works with sequences and records Many search algorithms, comparative algorithms and format options.

3 Installing BioPython Comes with Anaconda. You don’t even have to type in the import commands! If you use the standard IDLE environment you will need to download BioPython and place it in the proper directory. Bioinformatics has become so important in recent years that almost every programming environment, C++, Perl, etc has its own Bioinfo libraries.

4 Sequence objects Biological sequences represent the main point of interest in Bioinformatics processing. Python includes a special datatype called a Sequence. Sequence objects are not the same as Python strings. They are really strings together with additional information, such as an alphabet, and a variety of methods such as translate(), reverse_complement() and so on. dna = ‘AGTACACTGGT ‘  this is a pure string // Here is how you create a sequence object. seqdna = Seq(‘AGTACACTGGT ‘, Alphabet())  sequence obj Note that seqdna is a sequence object not just a string.

5 Alphabets - See IUPAC (international union of pure and applied chemistry)
Alphabets are just the set of allowable characters that are used in the string. IUPAC.unambiguous_dna is really just the set {A,C,G, T} of nucleotides. IUPAC.unambiguous_rna is {A,C,G,U} IUPAC.protein is just the 20 standard amino acids {A,R,N,D,C,Q,E,H,I,L,K,M,F,P,S,T,W,Y,V} and others We will use mainly the {A,C,G,T} DNA set. Nice for type checking our sequences.

6 Dumping Alphabets from Bio.Alphabet import IUPAC print IUPAC.unambiguous_dna.letters print IUPAC.ambiguous_dna.letters print IUPAC.unambiguous_rna.letters print IUPAC.protein.letters OUTPUT GATC GATCRYWSMKHBVDN GAUC ACDEFGHIKLMNPQRSTVWY

7 Can work with Sequence objects like strings
from Bio.Seq import Seq from Bio.Alphabet import IUPAC my_seq = Seq("GATCG", IUPAC.unambiguous_dna) print my_seq[0]  prints first letter print len(my_seq)  print length of string in sequence print Seq(“AAAA”).count(“AA”)  non overlapping count ie 2 print GC(my_seq)  Gives the GC % of the sequence. print my_seq[2:5]  We can even slice them. Returns a Seq. #convert seq obj to a pure string obj dna_string = str(my_seq)

8 MutaableSeq >>>from Bio.Seq import Seq >>>from Bio.Alphabet import IUPAC my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) Observe what happens if you try to edit the sequence: >>> my_seq[5] = "G" Traceback (most recent call last): ... TypeError: ’Seq’ object does not support item assignment However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it: >>> mutable_seq = my_seq.tomutable() >>> mutable_seq MutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA’, IUPACUnambiguousDNA())

9 We can modify these >>> mutable_seq MutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) >>> mutable_seq[5] = "C" >>> mutable_seq MutableSeq(’GCCATCGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) >>> mutable_seq.remove("T") >>> mutable_seq MutableSeq(’GCCACGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) >>> mutable_seq.reverse() >>> mutable_seq MutableSeq(’CCCGTGGGAAAGTCGCCGGGTAATGCACCG’, IUPACUnambiguousDNA())

10 Nucleotide sequences and (reverse) complements
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

11 Reversing a Sequence an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step # FORWARD >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) #BACKWARD ( Using a -1 step slice ) >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())

12 Double Stranded DNA 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
DNA coding strand (aka Crick strand, strand +1) 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ DNA template strand (aka Watson strand, strand −1)

13 Transcription 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ Transcription 5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ Single stranded messenger RNA

14 Lets do some Reverse Comp
from Bio.Seq import Seq from Bio.Alphabet import IUPAC coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) template_dna= coding_dna.reverse_complement() print template_dna CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT

15 Transcribe ( T->U ) from Bio.Seq import Seq from Bio.Alphabet import IUPAC coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) messenger_rna = coding_dna.transcribe() print messenger_rna AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG //or you can do both messenger_rna = coding_dna.reverse_complement().transcribe()

16 Translate into protein
from Bio.Seq import Seq from Bio.Alphabet import IUPAC messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCG AUAG", IUPAC.unambiguous_rna) print messenger_rna print messenger_rna.translate() # I added the spaces AUG GCC AUU GUA AUG GGC CGC UGA AAG GGU GCC CGA UAG MAIVMGR*KGAR* # the * represents stop codons.

17 Standard Translation Table

18 Printing Tables from Bio.Seq import Seq from Bio.Alphabet import IUPAC from Bio.Data import CodonTable stdTable = CodonTable.unambiguous_dna_by_id[1] print stdTable mitoTable = CodonTable.unambiguous_dna_by_id[2] print mitoTable

19 Table 1 Standard, SGC0 | T | C | A | G | T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G

20 Codon - Amino Acids Amino Acid SLC DNA codons Isoleucine I
ATT, ATC, ATA Leucine   L CTT, CTC, CTA, CTG, TTA, TTG Valine V GTT, GTC, GTA, GTG Phenylalanine   F TTT, TTC Methionine M ATG Cysteine  C TGT, TGC Alanine       A GCT, GCC, GCA, GCG Glycine   G GGT, GGC, GGA, GGG Proline       P CCT, CCC, CCA, CCG Threonine   T ACT, ACC, ACA, ACG Serine        S TCT, TCC, TCA, TCG, AGT, AGC Tyrosine   Y TAT, TAC Tryptophan   W TGG Glutamine   Q CAA, CAG Asparagine   N AAT, AAC Histidine  H CAT, CAC Glutamic acid   E GAA, GAG Aspartic acid  D GAT, GAC Lysine        K AAA, AAG Arginine   R CGT, CGC, CGA, CGG, AGA, AGG Stop codons Stop TAA, TAG, TGA .

21 The SeqRecord Object A SeqRecord is a structure that allows the storage of additional information with a sequence. This includes the usual information found in standard genbank files. The following is a sample of the fields in SeqRecord. .seq - The sequence .id - The primary ID used to identify the sequence (String) .name – The common name of the sequence .annotations – A dictionary of additional information about the sequence .features –A list of SeqFeature objects and others.

22 Build From Scratch >>> from Bio.Seq import Seq >>> simple_seq = Seq("GATC") >>> from Bio.SeqRecord import SeqRecord >>> simple_seq_r = SeqRecord(simple_seq) or pass in the id, description etc. >>> simple_seq_r.id = "AC12345" >>> simple_seq_r.description = "Made up sequence I wish I could write a paper about" >>> print(simple_seq_r.description) Made up sequence I wish I could write a paper about >>> simple_seq_r.seq Seq(’GATC’, Alphabet())

23 Fill SeqRecord from Fasta file
>gi| |ref|NC_ | Yersinia pestis biovar Microtus ... pPCP1, complete sequence TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAA TCAGATCCAGGGGGTAATCTGCTCTCC >>> from Bio import SeqIO # Note that SeqIO.read will only read one record. >>> record = SeqIO.read("NC_ fna", "fasta") >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACA CCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()), id=’gi| |ref|NC_ |’, name=’gi| |ref|NC_ |’, description=’gi| |ref|NC_ | Yersinia pestis biovar Microtus ... sequence’, dbxrefs=[])

24 ACCEss the fields Individually
>>> record.seq Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGC CTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()) >>> record.id ’gi| |ref|NC_ |’ >>> record.description ’gi| |ref|NC_ | Yersinia pestis biovar Microtus ... pPCP1, complete sequence’

25 These are missing >>> record.dbxrefs [] >>> record.annotations {} >>> record.letter_annotations {} >>> record.features [] Note which is a dict and which is a list!!

26 Reading Genbank Files LOCUS NC_ bp DNA circular BCT 21-JUL-2008 DEFINITION Yersinia pestis biovar Microtus str plasmid pPCP1, complete sequence. ACCESSION NC_ VERSION NC_ GI: PROJECT GenomeProject:10638

27 Read the File >>> from Bio import SeqIO >>> record = SeqIO.read("NC_ gb", "genbank") >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCC ACACCCAACGCCTGAAATCAGATCCAGG...CTG’, IUPACAmbiguousDNA()), id=’NC_ ’, name=’NC_005816’, description=’Yersinia pestis biovar Microtus str plasmid pPCP1, complete sequence.’, dbxrefs=[’Project:10638’])

28 Read a record from Bio import SeqIO record = SeqIO.read("micoplasmaGen.gb","genbank") print record.description ct=0 for f in record.features: if f.type=='gene': ct+=1 print ct

29 Chapter 5 : SeqIO from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) SeqIO returns a iterator to the collection of records in the file. or using list comprehension identifiers = [seq_record.id for seq_record in SeqIO.parse( "ls_orchid.gbk", "genbank“ )] This builds a list of identifiers that are in the file.

30 Explicit iteration from Bio import SeqIO record_iterator = SeqIO.parse("ls_orchid.fasta", "fasta") first_record = next(record_iterator) print(first_record.id) print(first_record.description) second_record = next(record_iterator) print(second_record.id) print(second_record.description)

31 Getting List of Records
from Bio import SeqIO records = list(SeqIO.parse("ls_orchid.gbk", "genbank")) print("Found %i records" % len(records)) print("The last record") last_record = records[-1] #using Python’s list tricks print(last_record.id) print(repr(last_record.seq)) print(len(last_record)) print("The first record") first_record = records[0] #remember, Python counts from zero print(first_record.id) print(repr(first_record.seq)) print(len(first_record))

32 Parsing Sequences from the net
from Bio import Entrez from Bio import SeqIO Entrez. = with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=" ") as handle seq_record = SeqIO.read(handle, "gb") #using "gb" as an alias for "genbank" print("%s with %i features" % (seq_record.id, len( seq_record.features ))) AF with 3 features

33 Multiple Sequence Alignment objects
This is an object that holds an alignment of multiple sequences. This alignment contains gap character, and leading or trailing gaps making all the strings the same length. Bio.AlignIO is used to read and write MultipleSeqAlignment objects . We have two functions for reading in sequence alignments, Bio.AlignIO.read() and Bio.AlignIO.parse() which following the convention introduced in Bio.SeqIO are for files containing one or multiple alignments respectively. Note : These do not perform alignments, only read, write and operate on them.

34 Example Alignment from Bio import AlignIO alignment = AlignIO.read("PF05371_seed.sth", "stockholm") print(alignment) SingleLetterAlphabet() alignment with 7 rows and 52 columns AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA

35 Each Row in the alignment is a record(SeqRecord)!
for record in alignment: print("%s - %s" % (record.seq, record.id)) AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/ AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/ AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/ AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49 AEGDDP- --AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73 Here we can format the alignment as we see fit. Each row is a SeqRecord object.

36 Build one from scratch from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Align import MultipleSeqAlignment align1 = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"), ]) align2 = MultipleSeqAlignment([ SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"), SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"), SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"), ]) align3 = MultipleSeqAlignment([ SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"), SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"), SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"), ]) my_alignments = [align1, align2, align3]// list of alignments

37 Alignment Tools and process.
There are many tools out there for creating alignments. They can be pairwise and multiple sequence alignments. If they have a python wrapper then they are easy to use within python scripts. General method is Prepare an input file of your unaligned sequences, typically this will be a FASTA file which you might create using Bio.SeqIO (see Chapter 5). 2. Call the command line tool to process this input file, typically via one of Biopython’s command line wrappers (which we’ll discuss here). 3. Read the output from the tool, i.e. your aligned sequences, typically using Bio.AlignIO (see earlier in this chapter).

38 The Zika virus The zika virus last year received a lot of press.
Zika is spread mostly by the bite of an infected Aedes species mosquito (Ae. aegypti and Ae. albopictus). These mosquitoes bite during the day and night. Zika can be passed from a pregnant woman to her fetus. Infection during pregnancy can cause certain birth defects. There is no vaccine or medicine for Zika. Local mosquito-borne Zika virus transmission has been reported in the continental United States. Here is a specific isolate with it large polyprotein.

39 Zika Study This virus has undergone mutation so lets find different examples and study them. Go and download as many different isolates as you can find. We will attempt to determine the evolutionary history. When you get these example apply the alignment tools process as discussed in a previous slide.


Download ppt "Python."

Similar presentations


Ads by Google