1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.

1 Finding Regulatory Motifs

2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.John Wiley & Sons, Inc Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

3 Regulation of Transcription TFs bound to their BSs Transcription machinery Gene start

4 Transcription Factors Proteins involved in the regulation of gene expression that bind to the upstream promoter region of transcription initiation sites Composed of two essential functional regions: a DNA-binding domain and an activator domain.

5 Transcription factors Sequence- specific DNA binding Non-DNA binding TF1 TF2 TF3 TF4 adapter Co-activator HAT DNA Layer I Layer III Layer II

6 BSs Models (a)Exact string(s) Example: BS = TACACC, TACGGC CAATGCAGGATACACCGATCGGTA GGAGTACGGCAAGTCCCCATGTGA AGGCTGGACCAGACTCTACACCTA

7 BSs Models (II) (b)String with mismatches Example: BS = TACACC + 1 mismatch CAATGCAGGATTCACCGATCGGTA GGAGTACAGCAAGTCCCCATGTGA AGGCTGGACCAGACTCTACACCTA

8 BSs Models (III) (c)Degenerate string Example: BS = TASDAC ( S ={ C,G } D ={ A,G,T }) CAATGCAGGATACAACGATCGGTA GGAGTAGTACAAGTCCCCATGTGA AGGCTGGACCAGACTCTACGACTA

9 BSs Models (IV) (d)Position Weight Matrix (PWM) Example: BS = 00.20.700.80.1 A 0.60.40.10.50.10 C 0.40.10.500 G 0.300.10 0.9 T ATGCAGGATACACCGATCGGTA 0.0605 GGAGTAGAGCAAGTCCCGTGA 0.0605 AAGACTCTACAATTATGGCGT 0.0151

10 Position Weight Matrix (PWM) Frequency Matrix 12345 a 121010 c 111090 g 255214 t 07021 Weight Matrix 12345 a 5.1-5.7-8.7-5.7-8.7 c -5.7 4.23.8-8.7 g -2.71.2 -2.75.7 t -8.72.7-8.7-2.7-5.7 NOTE: Use pseudo-counts for zero frequencies background frequencies

11 Predicting Motif Occurrences: Sequence Scoring 12345 a 5.1-5.7-8.7-5.7-8.7 c -5.7 4.23.8-8.7 g -2.71.2 -2.75.7 t -8.72.7-8.7-2.7-5.7 a g c g g t a Sum = 13.5 12345 a 5.1-5.7-8.7-5.7-8.7 c -5.7 4.23.8-8.7 g -2.71.2 -2.75.7 t -8.72.7-8.7-2.7-5.7 Sum = -15.6

12 BSs Models (V) (e)More complex models –PWM with spacers (e.g., for p53) –Markov model (dependency between adjacent columns of PWM) –Hybrid models, e.g., mixture of two PWMs –… … And we also need to model the non- BSs sequences in the promoters…

13 Motif Representations CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG... CGGGGCAGACTATTCCG CGGNGCACANTCNTCCG 1.Consensus 2.Frequency Matrix 3.Logo

14 Logos Graphical representation of nucleotide base (or amino acid) conservation in a motif (or alignment) Information theory Height of letters represents relative frequency of nucleotide bases http://weblogo.berkeley.edu/

15 Regulatory Motif Discovery DNA Group of co-regulated genes Common subsequence Find motifs within groups of corregulated genes

16 How to find novel motifs Degenerate string: YMF - Sinha & Tompa ’02 String with mismatches: WINNOWER – Pevzner & Sze ‘00 Random Projections – Buhler & Tompa ’02 MULTIPROFILER – Keich & Pevzner ’02 PWM: MEME – Bailey & Elkan ’95 AlignACE – Hughes et al. ’98 CONSENSUS - Hertz & Stormo ’99

17 How to find TF modules BioProspector – Liu et al. ‘01 Co-Bind – GuhaThakurta & Stormo ‘01 MITRA – Eskin & Pevzner ‘02 CREME – Sharan et al. ‘03 MCAST – Bailey & Noble ‘03

18 Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size –Because a constant-size transcription factor binds Often repeated Low-complexity

19 Problem Definition Probabilistic Motif: M ij ; 1  i  W 1  j  4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with  k differences Given a collection of promoter sequences s 1,…, s N of genes with common expression

20 Discrete Approaches to Motif Finding

21 Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and any word in x i d(W, S) =  i d(W, x i )

22 Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N =  i |x i |) Advantage: Finds provably “best” motif W Disadvantage: Time

23 Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

24 MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N α (W) = words W’ in S s.t. d(W,W’)  α Idea: Assume W is occurrence of true motif W * Will use N α (W) to correct “errors” in W

25 MULTIPROFILER Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W –L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

26 MULTIPROFILER Algorithm: For each W in S: For L = 1 to L max 1.Find the α- neighbors of W in S  N α (W) 2.Find all “strong” L-long wordlets G in N a (W) 3.For each wordlet G, 1.Modify W by the wordlet G  W’ 2.Compute d(W’, S) Report W * = argmin d(W’, S) Step 2 above: Smaller motif-finding problem; Use exhaustive search

27 CONSENSUS Algorithm: Cycle 1: For each word W in S (of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C 1 best alignments, A 1, …, A C1 ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

28 CONSENSUS Algorithm: Cycle t: For each word W in S For each alignment A j from cycle t-1 Create alignment (gap free) of W, A j Keep the C l best alignments A 1, …, A Ct ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT …

29 CONSENSUS C 1, …, C n are user-defined heuristic constants –N is sum of sequence lengths –n is the number of sequences Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + … + O(N C n ) = O( N 2 + NC total ) Where C total =  i C i, typically O(nC), where C is a big constant

30 Expectation Maximization in Motif Finding

31 All K-long words motif background Expectation Maximization Algorithm (sketch): 1.Given genomic sequences find all k-long words 2.Assume each word is motif or background 3.Find likeliest Motif Model Background Model classification of words into either Motif or Background

32 Expectation Maximization Given sequences x 1, …, x N, Find all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 ) (assume {A, C, G, T}) where M ij = Prob[ letter j occurs in motif position i ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ] motif background ACGTACGT M1M1 MKMK M1M1 B

33 Expectation Maximization Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = x[s]…x[s+k], P[ X i, Z i1 =1 ] = M 1x[s] …M kx[s+k] P[ X i, Z i2 =1 ] = (1 – ) B x[s] …B x[s+k] Let 1 = ; 2 = (1 – ) motif background ACGTACGT M1M1 MKMK M1M1 B 1 –

34 Expectation Maximization Define: Parameter space  = (M, B)  1 : Motif;  2 : Background Objective: Maximize log likelihood of model: ACGTACGT M1M1 MKMK M1M1 B 1 – 

35 Expectation Maximization Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,

36 Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows: Expectation Maximization: E- step

37 Expectation Maximization: M- step Maximization: Maximize expected value over  and independently For, this has the following solution: (we won’t prove it) Effectively, NEW is the expected # of motifs per position, given our current parameters

38 For  = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] c ij values are calculated easily from Z* values It then follows: to not allow any 0’s, add pseudocounts Expectation Maximization: M- step

39 Initial Parameters Matter! Consider the following artificial example: 6-mers X 1, …, X n :(n = 2000) –990 words “AAAAAA” –990 words “CCCCCC” –20 words “ACACAC” Some local maxima: = 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA = 1%; B = 50% C, 50% A M = 100% ACACAC

40 Overview of EM Algorithm 1.Initialize parameters  = (M, B), : –Try different values of from N -1/2 up to 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in  = (M, B), falls below  4.Report results for several “good”

41 Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy Lawrence et al. 1993

42 Notations Set of symbols: Sequences: S = {S 1, S 2, …, S N } Starting positions of motifs: A = {a 1, a 2, …, a N } Motif model ( ) : q ij = P(symbol at the i-th position = j) Background model: p j = P(symbol = j) Count of symbols in each column: c ij = count of symbol, j, in the i-th column in the aligned region

43 Probability of data given model

44 Scoring Function Maximize the log-odds ratio: Is greater than zero if the data is a better match to the motif model than to the background model

45 Scoring function A particular alignment “A” gives us the counts c ij. In the scoring function “F”, use:

46 Scoring function Thus, given an alignment A, we can calculate the scoring function F We need to find A that maximizes this scoring function, which is a log-odds score

47 Optimization and Sampling To maximize a function, f(x): –Brute force method: try all possible x –Sample method: sample x from probability distribution: p(x) ~ f(x) –Idea: suppose x max is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting x max

48 Markov Chain Sampling To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy: This would then imply:

49 Gibbs sampling to maximize F Gibbs sampling is a special type of Markov chain sampling algorithm Our goal is to find the optimal A = (a 1,…a N ) The Markov chain we construct will only have transitions from A to alignments A’ that differ from A in only one of the a i In round-robin order, pick one of the a i to replace Consider all A’ formed by replacing a i with some other starting position a i ’ in sequence S i Move to one of these A’ probabilistically Iterate the last three steps

50 Algorithm Randomly initialize A 0 ; Repeat: (1) randomly choose a sequence z from S; A* = A t \ a z ; compute θ t from A*; (2) sample a z according to P(a z = x), which is proportional to Q x /P x ; update A t+1 = A*  x; Select A t that maximizes F; Q x : the probability of generating x according to θ t ; P x : theprobability of generating x according to the background model

51 Algorithm Current solution A t

52 Algorithm Choose one a z to replace

53 Algorithm For each candidate site x in sequence z, calculate Q x and P x : Probabilities of sampling x from motif model and background model resp. x

54 Algorithm Among all possible candidates, choose one (say x) with probability proportional to Q x /P x x

55 Algorithm Set A t+1 = A*  x x

56 Algorithm Repeat x

57 Local optima The algorithm may not find the “global” or true maximum of the scoring function Once “A t ” contains many similar substrings, others matching these will be chosen with higher probability Algorithm will “get locked” into a “local optimum” –all neighbors have poorer scores, hence low chance of moving out of this solution

58 Phase shifts After every M iterations, compare the current A t with alignments obtained by shifting every aligned substring a i by some amount, either to left or right

59 Phase shift

60 Phase shift

61 Pattern Width The algorithm described so far requires pattern width(W) to be input. We can modify the algorithm so that it executes for a range of plausible widths. The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.

62 Pattern Width Another function based on the incomplete- data log-probability ratio G can be used. Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.

63 Time complexity analysis For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence. L*W multiplications are performed in Step2 of the algorithm. Total multiplications to execute the algorithm = TNL avg W Linear Time complexity has been observed in applications

64 Motif finding The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences –Protein motifs represent common sequence patterns in proteins, that are related to certain structure and function of the protein Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites

65 Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

66 Repeats, and a Better Background Model Repeat DNA can be confused as motif –Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A|A), P(A|C), …, P(T|T) } … K th order: B = { P(X | b 1 …b K ); X, b i  {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

67 Limits of Motif Finders Given upstream regions of coregulated genes: –Increasing length makes motif finding harder – random motifs clutter the true ones –Decreasing length makes motif finding harder – true motif missing in some sequences 0 gene ???

68 Example Application: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles 1.Clustering genes according to common expression K-means clustering -> 30 clusters, 50-190 genes/cluster Clusters correlate well with known function 2.AlignACE motif finding 600-long upstream regions

69 Motifs in Periodic Clusters

70 Motifs in Non-periodic Clusters

1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.

Similar presentations

Presentation on theme: "1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.

Similar presentations

Presentation on theme: "1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by."— Presentation transcript:

Similar presentations

About project

Feedback