Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biopython Programming for Engineers in Python 1. Classes class : statement_1. statement_n The methods of a class get the instance as the first parameter.

Similar presentations

Presentation on theme: "Biopython Programming for Engineers in Python 1. Classes class : statement_1. statement_n The methods of a class get the instance as the first parameter."— Presentation transcript:

1 Biopython Programming for Engineers in Python 1

2 Classes class : statement_1. statement_n The methods of a class get the instance as the first parameter traditionally named self The method __init__ is called upon object construction (if available) 2

3 Classes Reminder: type = data representation + behavior. Classes are user-defined types. class : statement_1.. statement_n Objects of a class are called class instances. 3 Like a mini-program: Variables. Function Definitions. Even arbitrary commands.

4 Classes – Attributes and Methods 4 Methods Instance Attributes (each instance has its own copy) class Vector2D: def __init__ (self, x, y): self.x, self.y = x, y def size (self): return (self.x ** 2 + self.y ** 2) ** 0.5

5 >>> v = Vector2D(3, 4) # Make instance. >>> v >>> v.size() # Call method on instance. 5.0 Classes – Instantiate and Use

6 Example – Multimap 6 A dictionary with more than one value for each key We already needed it once or twice and used: >>> lst = d.get(key, [ ]) >>> lst.append(value) >>> d[key] = lst We will now write a new class that will be a wrapper around a dict The class will have methods that allow us to keep multiple values for each key

7 Multimap. partial code 7 class Multimap: def __init__(self): '''Create an empty Multimap''' self.inner = inner def get(self, key): '''Return list of values associated with key''' return self.inner.get(key, []) def put(self, key, value): '''Adds value to the list of values associated with key''' value_list = self.get(key) if value not in value_list: value_list.append(value) self.inner[key] = value_list

8 Multimap put_all and remove 8 def put_all(self, key, values): for v in values: self.put(key, v) def remove(self, key, value): value_list = self.get(key) if value in value_list: value_list.remove(value) self.inner[key] = value_list return True return False

9 Multimap. Partial code 9 def __len__(self): '''Returns the number of keys in the map''' return len(self.inner) def __str__(self): '''Converts the map to a string''' return str(self.inner) def __cmp__(self, other): '''Compares the map with another map''' return self.inner.cmp(other) def __contains__(self, key): '''Returns True if key exists in the map''' return self.has_key(k)

10 Multimap 10 Use case – a dictionary of countries and their cities: >>> m = Multimap() >>> m.put('Israel', 'Tel-Aviv') >>> m.put('Israel', 'Jerusalem') >>> m.put('France', 'Paris') >>> m.put_all('England',('London', 'Manchester', 'Moscow')) >>> m.remove('England', 'Moscow') >>> print m.get('Israel') ['Tel-Aviv', 'Jerusalem']

11 11

12 Biopython An international association of developers of freely available Python ( tools for computational molecular biology Provides tools for Parsing files (fasta, clustalw, GenBank,…) Interface to common softwares Operations on sequences Simple machine learning applications BLAST And many more 12

13 Installing Biopython Go to Windows Unix Select python 2.7 NumPy is required 13

14 SeqIO The standard Sequence Input/Output interface for BioPython Provides a simple uniform interface to input and output assorted sequence file formats Deals with sequences as SeqRecord objects There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objectsBio.AlignIO 14

15 Parsing a FASTA file 15 # Parse a simple fasta file from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print print repr(seq_record.seq) print len(seq_record) Why repr and not str?

16 16

17 GenBank files 17 # genbank files from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record # added to print just one record example break

18 GenBank files 18 from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print print repr(seq_record.seq) print len(seq_record)

19 Sequence objects Support similar methods as standard strings Provide additional methods Translate Reverse complement Support different alphabets AGTAGTTAAA can be DNA Protein 19

20 Sequences and alphabets Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions For example: Adding ambiguous symbols Adding special new characters 20

21 Example – generic alphabet 21 >>> from Bio.Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> my_seq.alphabet Alphabet() Non-specific alphabet

22 Example – specific sequences 22 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna) >>> my_seq Seq('AGTACACTGGT', IUPACUnambiguousDNA()) >>> my_seq.alphabet IUPACUnambiguousDNA() >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_prot = Seq("AGTACACTGGT", IUPAC.protein) >>> my_prot Seq('AGTACACTGGT', IUPACProtein()) >>> my_prot.alphabet IUPACProtein()

23 Sequences act like strings Access elements Count without overlaps 23 >>> print my_seq[0] #first letter G >>> print my_seq[2] #third letter T >>> print my_seq[-1] #last letter G >>> from Bio.Seq import Seq >>> "AAAA".count("AA") 2 >>> Seq("AAAA").count("AA") 2

24 Calculate GC content 24 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> GC(my_seq) 46.875

25 Slicing Simple slicing Start, stop, stride 25 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) >>> my_seq[0::3] Seq('GCTGTAGTAAG', IUPACUnambiguousDNA()) >>> my_seq[1::3] Seq('AGGCATGCATC', IUPACUnambiguousDNA()) >>> my_seq[2::3] Seq('TAGCTAAGAC', IUPACUnambiguousDNA())

26 Concatenation Simple addition as in Python But, alphabets must fit 26 >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> protein_seq + dna_seq Traceback (most recent call last): …

27 Changing case 27 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq Seq('acgtACGT', DNAAlphabet()) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet())

28 Changing case Case is important for matching IUPAC names are upper case 28 >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq.upper() True >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> dna_seq.lower() Seq('acgt', DNAAlphabet())

29 Reverse complement 29 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

30 Transcription 30 >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATA G", IUPAC.unambiguous_dna) >>> template_dna = coding_dna.reverse_complement() >>> messenger_rna = coding_dna.transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUA G', IUPACUnambiguousRNA()) As you can see, all this does is switch T → U, and adjust the alphabet.

31 Translation Simple example 31 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) Stop codon!

32 Translation from the DNA 32 >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

33 Using different translation tables In several cases we may want to use different translation tables Translation tables are given IDs in GenBank (standard=1) Vertebrate Mitochondrial is table 2 More details in 33

34 Using different translation tables 34 >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

35 Translation tables in biopython 35

36 Translate up to the first stop in frame 36 >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(to_stop=True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2, to_stop=True) Seq('MAIVMGRWKGAR', IUPACProtein())

37 Comparing sequences Standard “==“ comparison is done by comparing the references (!), hence: 37 >>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq1==seq2 Warning (from warnings module): … FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception)… please use str(seq1)==str(seq2) to make your code explicit and to avoid this warning. False >>> seq1==seq1 True

38 Mutable vs. Immutable Like strings standard seq objects are immutable If you want to create a mutable object you need to write it by either: Use the “tomutable()” method Use the mutable constructor mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAG GGTGCCCGA", IUPAC.unambiguous_dna) 38

39 Unknown sequences example In many biological cases we deal with unknown sequences 39 >>> from Bio.Seq import UnknownSeq >>> from Bio.Alphabet import IUPAC >>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna) >>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) >>> unk_dna+my_seq Seq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGC CGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA())

40 40 MSA

41 Read MSA Use, format) File – the file path Format support: “stockholm” “fasta” “clustal” … Use help(AlignIO) for details 41

42 Example We want to parse this file from PFAM 42

43 Example 43 from Bio import AlignIO alignment ="PF05371.sth", "stockholm") print alignment

44 Alignment object example 44 >>> from Bio import AlignIO >>> alignment ="PF05371_seed.sth", "stockholm") >>> print alignment[1] ID: Q9T0Q8_BPIKE/1-52 Name: Q9T0Q8_BPIKE Description: Q9T0Q8_BPIKE/1-52 Number of features: 0 /start=1 /end=52 /accession=Q9T0Q8.1 Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVI KLFKKFVSRA', SingleLetterAlphabet())


46 Cross-references example 46 Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure? >>> for record in alignment: if record.dbxrefs: print, record.dbxrefs COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;'] COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;'] Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;'] COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']

47 Comments Remember that almost all MSA formats are supported When you have more than one MSA in your files use AlignIO.parse() Common example is PHYLIP’s output Use AlignIO.parse("resampled.phy", "phylip") The result is an iterator object that contains all MSAs 47

48 Write alignment to file 48 from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Align import MultipleSeqAlignment align1 = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),]) from Bio import AlignIO AlignIO.write(align1, "my_example.phy", "phylip") 3 12 Alpha ACTGCTAGCT AG Beta ACT-CTAGCT AG Gamma ACTGCTAGDT AG 3 9 Delta GTCAGC-AG Epislon GACAGCTAG Zeta GTCAGCTAG 3 13 Eta ACTAGTACAG CTG Theta ACTAGTACAG CT- Iota - CTACTACAG GTG

49 Slicing Alignments work like numpy matrices 49 >>> print alignment[2,6] T # You can pull out a single column as a string like this: >>> print alignment[:,6] TTT---T >>> print alignment[3:6,:6] SingleLetterAlphabet() alignment with 3 rows and 6 columns AEGDDP COATB_BPM13/24-72 AEGDDP COATB_BPZJ2/1-49 AEGDDP Q9T0Q9_BPFD/1-49 >>> print alignment[:,:6] SingleLetterAlphabet() alignment with 7 rows and 6 columns AEPNAA COATB_BPIKE/30-81 AEPNAA Q9T0Q8_BPIKE/1-52 DGTSTA COATB_BPI22/32-83 AEGDDP COATB_BPM13/24-72 AEGDDP COATB_BPZJ2/1-49 AEGDDP Q9T0Q9_BPFD/1-49 FAADDA COATB_BPIF1/22-73

50 External applications How do we call MSA algorithms on unaligned set of sequences? Biopython provides wrappers The idea: Create a command line object with the algorithm options Invoke the command (Python uses subprocesses) Bio.Align.Applications module: >>> import Bio.Align.Applications >>> dir(Bio.Align.Applications) ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ] 50

51 ClustalW example First step: download ClustalW from Second step: install Third step: look for clustal exe files Now you can run ClustalW from your Python code 51

52 Run example 52 >>> import os >>> from Bio.Align.Applications import ClustalwCommandline >>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe" >>> clustalw_cline = ClustalwCommandline(clustalw_exe, infile="opuntia.fasta") >>> assert os.path.isfile(clustalw_exe), "Clustal W executable missing" >>> stdout, stderr = clustalw_cline() The command line is actually a function we can run!


54 ClustalW - tree 54 In case you are interested, the opuntia.dnd file ClustalW creates is just a standard Newick tree file, and Bio.Phylo can parse these: >>> from Bio import Phylo >>> tree ="opuntia.dnd", "newick") >>> Phylo.draw_ascii(tree)

55 55 BLAST

56 Running BLAST over the internet We use the function qblast() in the Bio.Blast.NCBIWWW module. This has three non- optional arguments: The blast program to use for the search, as a lower case string: works with blastn, blastp, blastx, tblast and tblastx. The databases to search against. The options for this are available on the NCBI web pages at html. html A string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number. 56

57 qblast additional parameters qblast can receive other parameters, analogous to the parameters of the actual server Important examples: format_type: "HTML", "Text", "ASN.1", or "XML". The default is "XML", as that is the format expected by the parser (see next examples) expect sets the expectation or e-value threshold. 57

58 Step 1: call BLAST 58 >>> from Bio.Blast import NCBIWWW # Option 1 - Use GI ID >>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") # Option 2 – read a fasta file >>> fasta_string = open("m_cold.fasta").read() >>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) # option 3 – parse file to seq object >>> record ="m_cold.fasta"), format="fasta") >>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)

59 Step2: parse the results Read can be used only once! blast_record object keeps the actual results 59 >>> from Bio.Blast import NCBIXML >>> blast_record =

60 Remarks Basically, Biopython supports reading BLAST results from HTMLs and text files. These methods are not stable and sometimes fail because the servers change the format. XML is stable You can save XML files In the server From result_handle objects (next slide) 60

61 Save results as XML Read can be used only once! 61 >>> save_file = open("my_blast.xml", "w") >>> save_file.write( >>> save_file.close() >>> result_handle.close()

62 BLAST records A BLAST Record contains everything you might ever want to extract from the BLAST output. Example: 62 >>> E_VALUE_THRESH = 0.04 >>> for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '' print hsp.match[0:75] + '' print hsp.sbjct[0:75] + ''

63 BLAST records 63

64 More functions We cover here very basic functions To get more details use 64 >>> import Bio.Blast.Record >>> help(Bio.Blast.Record) Help on module Bio.Blast.Record in Bio.Blast: NAME Bio.Blast.Record - Record classes to hold BLAST output. FILE d:\python27\lib\site-packages\bio\blast\ DESCRIPTION Classes: Blast Holds all the information from a blast search. PSIBlast Holds all the information from a psi-blast search. Header Holds information from the header. Description Holds information about one hit description. Alignment Holds information about one alignment hit. HSP Holds information about one HSP. MultipleAlignment Holds information about a multiple alignment. DatabaseReport Holds information from the database report. Parameters Holds information from the parameters.

65 65 Accessing NCBI’s Entrez Databases

66 Bio.Entrez Module for programmatic access to Entrez Example: search PubMed or download GenBank records from within a Python script Makes use of the Entrez Programming Utilities Makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI Note! If the NCBI finds you are abusing their systems, they can and will ban your access! 66

67 ESearch example 67 >>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]") >>> record = # Each of the IDs is a GenBank identifier. >>> print (record["IdList"]) ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153']

68 Explanation Transforms the actual results (retrieved as XML) to a usable object of type Bio.Entrez.Parser.DictionaryElement 68 >>> record {u'Count': '158', u'RetMax': '20', u'IdList': ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153'], u'TranslationStack': [{u'Count': '2482', u'Field': 'Organism', u'Term': '"Cypripedioideae"[Organism]', u'Explode': 'Y'}, {u'Count': '71514', u'Field': 'Gene', u'Term': 'matK[Gene]', u'Explode': 'N'}, 'AND'], u'TranslationSet': [{u'To': '"Cypripedioideae"[Organism]', u'From': 'Cypripedioideae[Orgn]'}], u'RetStart': '0', u'QueryTranslation': '"Cypripedioideae"[Organism] AND matK[Gene]'}

69 Database options 69 'pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists'

70 Download a full record 70 >>> from Bio import Entrez # Always tell NCBI who you are >>> = # rettype: get a GenBank record >>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") >>> print

71 71

72 Change ‘gb’ to ‘fasta’ 72

73 Read directly to Seq.IO object 73 >>> from Bio import Entrez, SeqIO >>> handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="gb", retmode="text") >>> record =, "genbank") >>> handle.close() >>> print record ID: EU490707.1 Name: EU490707 Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast. Number of features: 3... Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAA ATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA())

74 Download directly from a URL Suppose we know how the database URLs look like Example: GEO (gene expression omnibus) " c=GSE6609&format=file" 74

75 Use the urlib2 module 75 >>> import urllib2 >>> u = urllib2.urlopen(' nload/?acc=GSE6609&format=file') >>> localFile = open('gse6609_raw.tar', 'w') >>> for x in u: localFile.write(x) >>> localFile.close()

76 More details We covered only a few concepts For more details on Biopython options, including dealing with specialized parsers, see ml#sec:parsing-blast ml#sec:parsing-blast Chapter 9 Look at the urllib2 manual 76

77 77 Sequence Motifs

78 Gene expression regulation Transcription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) TFs can promote or repress transcription Other regulators: micro-RNAs (miRNAs)

79 Ab-initio motif discovery You are given a set of strings You want to find a motif that is significantly represented in the strings For example: TF\miRNA binding site 79

80 TFBS models The BSs of a particular TF share a common pattern, or motif, which is often modeled using: Degenerate string GGWATB (W={A,T}, B={C,G,T}) PWM = Position weight matrix 654321 00.20.700.80.1A 0.300.10 0.9T  Cutoff = 0.009 AGCTACACCCATTTAT 0.06 AGTAGAGCCTTCGTG 0.06 CGATTCTACAATATGA 0.01 ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA

81 Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery: The typical two-step pipeline Promoter/3’UTR sequences Motif discovery Co-regulated gene set

82 Motif discovery: Goals and challenges Goal: Reverse-engineer the transcriptional regulatory network Challenges: BSs are short and degenerate (non-specific) Promoters are long + complex (hard to model) Search space is huge (motif and sequence) Data is noisy What to look for? (enriched?, localized?, conserved?) Problem is still considered very difficult despite extensive research

83 Biopython motif objects 83 from Bio import motifs from Bio.Seq import Seq instances = [Seq("TACAA"),Seq("TACGC"),Seq("TACAC"),Seq("TACCC" ),Seq("AACCC"),Seq("AATGC"),Seq("AATGC")] m = motifs.create(instances) print m TACAA TACGC TACAC TACCC AACCC AATGC AATGC

84 Biopython motif objects 84 >>> print m.counts 0 1 2 3 4 A: 3.00 7.00 0.00 2.00 1.00 C: 0.00 0.00 5.00 2.00 6.00 G: 0.00 0.00 0.00 3.00 0.00 T: 4.00 0.00 2.00 0.00 0.00

85 Biopython motif objects 85 >>> m.consensus Seq('TACGC', IUPACUnambiguousDNA()) #The anticonsensus sequence, corresponding to the smallest values in the columns of the.counts matrix: >>> m.anticonsensus Seq('GGGTG', IUPACUnambiguousDNA())

86 Motif database ( 86

87 87

88 88

89 89

90 90

91 Read records 91 from Bio import motifs arnt ="Arnt.sites"), "sites") print arnt.counts 0 1 2 3 4 5 A: 4.00 19.00 0.00 0.00 0.00 0.00 C: 16.00 0.00 20.00 0.00 0.00 0.00 G: 0.00 1.00 0.00 20.00 0.00 20.00 T: 0.00 0.00 0.00 0.00 20.00 0.00

92 MEME MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It takes as input a group of DNA or protein sequences and outputs as many motifs as requested. Therefore, in contrast to JASPAR files, MEME output files typically contain multiple motifs. 92

93 Assumptions The number of motifs is known Assume this number is 1 The size of the motif is known Biologically, we have estimates for the size for TFs and miRNA Missing information PWM of the motif PWM of the background Motif locations 93

94 Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions 94

95 Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions Given a PWM we can now calculate for each position K in each sequence J the probability the motif starts at K in the sequence J. 95

96 Start with initial guess for the PWMs The EM algorithm consists of the two steps, which are repeated consecutively. Step 1, estimate the probability of finding the site at any position in each of the sequences. These probabilities are used to provide new information as to expected base or aa distribution for each column in the site. Step 2, the maximization step, the new counts for bases or aa for each position in the site found in the step 1 are substituted for the previous set. Expectation Maximization (EM) Algorithm

97 OOOOOOOOXXXXOOOOOOOO o o o o o o o o o o o o o o o o o o o o o o o oOOOOOOOOXXXXOOOOOOOO IIII IIIIIIII IIIIIII Columns defined by a preliminary alignment of the sequences provide initial estimates of frequencies of aa in each motif column BasesBackgroundSite column 1Site column 2…… G0.270.40.1…… C0.250.40.1…… A0.250.20.1…… T0.230.20.7…… Total1.00 …… Columns not in motif provide background frequencies

98 Expectation Maximization (EM) Algorithm The resulting score gives the likelihood that the motif matches positions A, B or other in seq 1. Repeat for all other positions and find most likely locator. Then repeat for the remaining seq’s. A B XXXXOOOOOOOOOOOOOOOO XXXX IIII IIIIIIIIIIIIIIII OXXXXOOOOOOOOOOOOOOO XXXX IIII I IIIIIIIIIIIIIII …background frequencies in the remaining positions. X Use previous estimates of aa or nucleotide frequencies for each column in the motif to calculate probability of motif in this position, and multiply by……..

99 The site probabilities for each seq calculated at the 1 st step are then used to create a new table of expected values for base counts for each of the site positions using the site probabilities as weights. Suppose that P (site 1 in seq 1) = P site1,seq1 / (P site1,seq1 + P site2,seq1 + …+ P site78,seq1 ) = 0.01 and P (site 2 in seq 1) = 0.02. Then this values are added to the previous table as shown in the table below. This procedure is repeated for every other possible first columns in seq1 and then the process continues for all other sequences resulting in a new version of the table. The expectation and maximization steps are repeated until the estimates of base frequencies do not change. EM Algorithm 2 nd optimisation step: calculations BasesBackgroundSite column 1Site column 2…… G0.27 + …0.4 + …0.1 + ……… C0.25 + …0.4 + …0.1 + ……… A0.25 + …0.2 + 0.010.1 + ……… T0.23 + …0.2 + …0.7 + 0.02…… Total/ weighted 1.00 ……

100 Run MEME ( ) 100

101 Results 101

102 Parse results 102 >>> handle = open("meme.dna.oops.txt") >>> record = motifs.parse(handle, "meme") >>> handle.close() >>> len(record) 2 >>> motif = record[0] >>> print motif.consensus TTCACATGCCGC >>> print motif.degenerate_consensus TTCACATGSCNC

103 Motif attributes 103 >>> motif.num_occurrences 7 >>> motif.length 12 >>> evalue = motif.evalue >>> print "%3.1g" % evalue 0.2 >>> 'Motif 1'

104 Where the motif was found 104 >>> motif = record['Motif 1'] # Each motif has an attribute.instances with the sequence instances in which the motif was found, providing some information on each instance >>> len(motif.instances) 7 >>> motif.instances[0] Instance('TTCACATGCCGC', IUPACUnambiguousDNA()) >>> motif.instances[0].start 620 >>> motif.instances[0].strand '-' >>> motif.instances[0].length 12 >>> pvalue = motif.instances[0].pvalue >>> print "%5.3g" % pvalue 1.85e-08

105 Amadeus 105 Advanced algorithms improve upon MEME This is an algorithm for motif finding Appears to be one of the top algorithms in many tests Java based tool Easy to use GUI Supports analysis of TFs and miRNAs Developed here in TAU

106 Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1.Finding over-represented motifs in one or more given sets of genes. 2.Identifying motifs with global spatial features given only the genomic sequences. 3.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How? A general pipeline architecture for enumerating motifs. Different statistical scoring schemes of motifs for different motif discovery tasks.

107 Input: ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02] CHR NF-Y (CCAAT-box) Pairs analysis

108 108 Clustering analysis

109 Clustering - reminder Cluster analysis is the grouping of items into clusters based on the similarity of the items to each other. Bio.Cluster module Kmeans SOM Hierarchical clustering PCA 109

110 110 K-means clustering MacQueen, 65 Input: a set of observations (x 1, x 2, …, x n ) For example, each observation is a gene, and x is the values Goal: partition the observation to K clusters S = {S 1, S 2, …, S k } Objective function:

111 111 K-means clustering MacQueen, 65 Initialize an arbitrary partition P into k clusters C 1,…, C k. For cluster C j, element i  C j, E P (i, C j ) = cost of soln. if i is moved to cluster C j. Pick E P (r, C s ) if the new partition is better Repeat until no improvement possible Requires knowledge of k

112 112 K-means variations Compute a centroid c p for each cluster C p, e.g., gravity center = average vector Solution cost:  clusters p  i in cluster p d(v i,c p ) Parallel version: move each to the cluster with the closest centroid simultaneously Sequential version: one at a time “moving centers” approach Objective = homogeneity only (k fixed)

113 113

114 114

115 Data representation The data to be clustered are represented by a n × m Numerical Python array data. Within the context of gene expression data clustering, typically the rows correspond to different genes whereas the columns correspond to different experimental conditions. The clustering algorithms in Bio.Cluster can be applied both to rows (genes) and to columns (experiments). 115

116 Distance\Similarity functions 'e': Euclidean distance 'c': Pearson correlation coefficient 'a': Absolute value of the Pearson correlation coefficient 'u': cosine of the angle between two data vectors 'x': Absolute uncentered Pearson correlation 's': Spearman’s rank correlation 116

117 Calculating distance matrices >>> from Bio.Cluster import distancematrix >>> matrix = distancematrix(data) data - required Additional options: transpose (default: 0) Determines if the distances between the rows of data are to be calculated (transpose==0), or between the columns of data (transpose==1). dist (default: 'e', Euclidean distance) 117

118 Distancematrix To save space Biopython keeps only the lower\upper triangle of the matrix 118

119 Partitioning algorithms Algorithms that receive the number of clusters K as an argument Kmeans Kmedians Often referred to as EM variations 119

120 Analysis example 120

121 Analysis example 121 # Read the data import csv file = open('ge_data_example.txt', 'rb') data = csv.reader(file, delimiter='\t') table = [row for row in data] >>> len(table) 100 >>> table[1][1] '9.412' >>> table[0][0] 'sample' >>> len(table[1]) 17

122 Analysis example 122 # Transform the data to numpy matrix from numpy import * mat = matrix(table[1:][1:],dtype='float') print len(mat) # Create the distance matrix from Bio.Cluster import distancematrix dist_matrix = distancematrix(mat) # Cluster from Bio.Cluster import kcluster clusterid, error, nfound = kcluster(mat)

123 Analysis example 123 # Cluster from Bio.Cluster import kcluster clusterid, error, nfound = kcluster(mat) Clusterid: array with cluster assignments Error: the within cluster sum of distances Nfound: the number of times the returned solution was found

124 Analysis example 124 >>> clusterid array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> error 15988.118370804612 >>> nfound 1

125 Kcluster: other options nclusters (default: 2): the number of clusters k. transpose (default: 0): Determines if rows (transpose is 0) or columns (transpose is 1) are to be clustered. npass (default: 1): the number of times the k-means/-medians clustering algorithm is performed method (default: a): describes how the center of a cluster is found: method=='a': arithmetic mean (k-means clustering); method=='m': median (k-medians clustering). dist (default: 'e', Euclidean distance) initialid (default: None) Specifies the initial clustering to be used for the algorithm. 125

126 Hierarchical clustering 126 from Bio.Cluster import treecluster tree1 = treecluster(mat) # Can be applied to a precalculated distance matrix tree2 = treecluster(distancematrix=dist_matrix) # Get the cluster assignments clusterid = tree1.cut(3)

127 Hierarchical clustering using SciPy Better visualizations! 127 # Create a distance matrix X=mat D = scipy.zeros([len(x),len(x)]) for i in range(len(x)): for j in range(len(x)): D[i,j] = sum(abs(x[i] - x[j]))

128 Hierarchical clustering using SciPy 128 # Compute and plot first dendrogram. fig = pylab.figure(figsize=(8,8)) # Add an axes at position rect [left, bottom, width, height] where all quantities are in fractions of figure width and height. ax1 = fig.add_axes([0.09,0.1,0.2,0.6]) # Clustering analysis Y = sch.linkage(D, method='centroid') Z1 = sch.dendrogram(Y, orientation='right') ax1.set_xticks([]) ax1.set_yticks([])

129 Hierarchical clustering using SciPy 129 # Plot distance matrix. axmatrix = fig.add_axes([0.3,0.1,0.6,0.6]) idx1 = Z1['leaves'] D = D[idx1,:] im = axmatrix.matshow(D, aspect='auto', origin='lower', axmatrix.set_xticks([]) axmatrix.set_yticks([])

130 Hierarchical clustering using SciPy 130 # Plot colorbar. axcolor = fig.add_axes([0.91,0.1,0.02,0.6]) pylab.colorbar(im, cax=axcolor)

131 131 Phylogenetic trees

132 Remember the Newick format? Simple example without branch length 132 (((A,B),(C,D)),(E,F,G))

133 Visualizing trees 133 >>> localFile.close() >>> from Bio import Phylo >>> tree ="simple.dnd", "newick") >>> print tree Tree(weight=1.0, rooted=False) Clade(branch_length=1.0) Clade(branch_length=1.0) Clade(branch_length=1.0) Clade(branch_length=1.0, name='A') Clade(branch_length=1.0, name='B') Clade(branch_length=1.0) Clade(branch_length=1.0, name='C') Clade(branch_length=1.0, name='D') Clade(branch_length=1.0) Clade(branch_length=1.0, name='E') Clade(branch_length=1.0, name='F') Clade(branch_length=1.0, name='G')

134 Visualizing trees 134

135 Use matplotlib 135 >>> import matplotlib >>> tree.rooted = True >>> Phylo.draw(tree)

136 Phylo IO reads a tree with exactly one tree If you have many trees use a loop over the returned object of Phylo.parse() Write to file using Phylo.write(treeObj,format) Popular formats: “nwk”, “xml” Convert tree formats using Phylo.convert Phylo.convert("tree1.xml", "phyloxml", "tree1.dnd", "newick") 136

Download ppt "Biopython Programming for Engineers in Python 1. Classes class : statement_1. statement_n The methods of a class get the instance as the first parameter."

Similar presentations

Ads by Google