Download presentation

Presentation is loading. Please wait.

Published byJoy McGee Modified about 1 year ago

1
Bioinformatics Motif Detection Revised 27/10/06

2
Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises http://www.esat.kuleuven.ac.be/~kmarchal/

3
Global multiple alignment (ClustalW) –Proteins, nucleotides –Long stretches of conservation essential –Identification of protein family profiles –Score gaps Local multiple alignments (motif detection) –Proteins, nucleotides –Short stretches of conservation (12 NT, 6 AA) –Identification of regulatory motifs (DNA, protein) –No explicit gap scoring –Explicit use of a profile Introduction

4
Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises

5
HMM

6
Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises http://www.esat.kuleuven.ac.be/~kmarchal/

7
cellsignal motif Gene 1Gene 2Gene 3Gene 4 sigma ? translation transcription mRNA protein gene chromosome Transcriptional regulation

9
Consensus sequence: –reductionistic representation of a motif –Most frequent instance is used as a representative –Loss of information Regular expression: –More complex representation allowing motif degeneracy Position specific scoring matrix (PSSM): –Probabilistic representation Motif Representation

10
CTTAATATTAACTTAAT Consensus CTTAAKRTTMAYTTAAT Regular expression PSSM (motif logo)

11
Motif Representation

12
Search for motifs that are present more frequently in a set of sequences than in a set of unrelated sequences Methods based on word counting (regular expression) NP problems, heuristic methods clever algorithms –motif w=8; combinations=8! –Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 Probabilistic methods (weight matrix) Multiple alignment by locally aligning small conserved regions in a set of unaligned sequences. Motif model represented by a probability matrix EM, Gibbs sampler (optimization algorithms) –AlignACE http://atlas.med.harvard.edu/http://atlas.med.harvard.edu/ –BioProspector: http://bioprospector.stanford.edu/http://bioprospector.stanford.edu/ –Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html Overview Algorithms

13
Search space When are motifs overrepresented statistically? Set of coexpressed (coregulated sequences) –Literature searches –Microarrays, expression profiling Set of orthologous sequences (phylogenetic footprinting) –Comparative genomics –Orthologous sequences similar ancestral origin => similar mechanism of transcriptional regulation

14
coexpression cDNA arrays Motif finding Clustering Preprocessing of the data EMBL BLAST Upstream regions Gibbs sampling Search space

15
PhoPQ ubiquitous system Salmonella Escherichia Yersinia Vibrio Pseudomonas Providencia Pectobacterium PhoPQ is autoregulated Search space Phylogenetic footprinting

16
Search space

17
Methods based on word counting NP problems, heuristic methods clever algorithms –motif w=8; combinations=8! –Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 Probabilistic methods Optimisation problems, self learning, AI Motif model represented by a probability matrix Bayesian, Gibbs sampler –AlignACE http://atlas.med.harvard.edu/http://atlas.med.harvard.edu/ –BioProspector: http://bioprospector.stanford.edu/http://bioprospector.stanford.edu/ –Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html Overview Algorithms

18
Monad frequencies: single word counts: (RSA tools) (J. Vanhelden et al., 1998 J. Mol. Biol.) –Enumerate all oligonucleotides –count the number of occurrences of all oligonucleotides of selected size in a set of coregulated genes –compare the number of occurrences with its expected value in the background Word Counting http://bio.cigb.edu.cu/jvanheld/rsa-tools/RSA_home.shtml

19
Relevance of the motifs detected p-Value and Sig score (string based methods) Expected number of occurrences in background Statistical significance Word Counting

20
Methods based on word counting NP problems, heuristic methods clever algorithms –motif w=8; combinations=8! –Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 Probabilistic methods Optimisation problems, self learning, AI Motif model represented by a probability matrix Bayesian, Gibbs sampler –AlignACE http://atlas.med.harvard.edu/http://atlas.med.harvard.edu/ –BioProspector: http://bioprospector.stanford.edu/http://bioprospector.stanford.edu/ –Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html Probabilistic Algorithms

21
Find common motifs, that represent regulatory elements, in the region upstream of translation start in a set of co-expressed DNA sequences Motifs are hidden in background sequence Probabilistic Algorithms

22
Motif Representation: Probability matrix (PSSM) Background model Single nucleotide frequencies Described by an m th order Markov process, that can be represented by a transition matrix Probabilistic Algorithms

23
Step 1: Initialization of alignment vector A (predictive update) i j 1 n G A A T T C A T G T C A C T T C A T T G Step 2: Calculate motif model for all sequences except one Probabilistic Algorithms

24
GAATTATCGTGAATGCGTGGT P(S|M) = 0.0098 x 0.0097 x 0.495 x 0.0098 x 0.245 P(S|B) = Step 3 (expectation): Select remaining sequence For each window (site) calculate the probability that the sequence in the window is generated by the motif model versus the probability that it is generated by the background model i 1 n Assign weight based on this score to this site Probabilistic Algorithms

25
Step 4 (Maximization): –Re-estimate new positions based on the weights calculated in step 3 Go to step 1 i j 1 nn Re-iterate until stable motifs are found i j 1

26
local optima –EM update alignment vector: Select positions with highest score Deterministic output but local minimum global optimum –Gibbs sampling Select positions according to probability distribution Stochastic output: –i.e. result differs each time the algorithm runs –allows to detect stable motifs –statistical analysis describes quality of the motif detected Probabilistic Algorithms

27
Influence of the background model: e.g. p(ATCGT|Bm)=p(AT)p(C|AT)p(G|TC)p(T|CG) Compensates for motifs that occur frequently because of the general background composition Makes the outcome of the algorithm more robust Probabilistic Algorithms

28
Two organisms with similar background model Two organisms with different background model Probabilistic Algorithms

29
Information content (Consensus score) Log likelihood Relative entropy (Information content) Entropy Probabilistic Algorithms Motif scores for probabilistic motif finding algorithms

30
Result: bacterial O 2 responsive element FNR Probabilistic Algorithms Takes into account the background model Does only take into account the degree of conservation Tradeoff between the degree of conservation and the number of occurrences

31
Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation –Algorithm –Search Space –Word counting methods –Probabilistic methods Profile Searches –Introduction Exercises

32
Profile Search

33
EXPERIMENTAL High throughput measurements Literature GENOMICS Genomic sequence data Novel targets Novel Conditions 1. Microarray Datamining Preprocessing Clustering 2. Sequence Datamining Motif Detection 3. Comparative Genomics Genomewide Screening Phylogenetic Footprinting Clusters of coexpressed genes Summarized information Target Identification Profile Search

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google