Download presentation

Presentation is loading. Please wait.

Published byRegina Limon Modified over 2 years ago

1
Gene Regulation and Microarrays

2
Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......

3
Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size Because a constant-size transcription factor binds Often repeated Low-complexity-ish

4
Sequence Logos Information at pos’n I, H(i) = – {letter x} freq(x, i) log 2 freq(x, i) Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i)) Examples: freq(A, i) = 1;H(i) = 0;L(A, i) = 2 A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼

5
Problem Definition Probabilistic Motif: M ij ; 1 i W 1 j 4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with k differences Or, Find M with smallest total hamming dist Given a collection of promoter sequences s 1,…, s N of genes with common expression

6
Essentially a Multiple Local Alignment Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases......

7
Algorithms Combinatorial CONSENSUS, TEIRESIAS, SP-STAR, others Probabilistic 1.Expectation Maximization: MEME 2.Gibbs Sampling: AlignACE, BioProspector

8
Combinatorial Approaches to Motif Finding

9
Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and any word in x i d(W, S) = i d(W, x i )

10
Approaches Exhaustive Searches CONSENSUS MULTIPROFILER TEIRESIAS, SP-STAR, WINNOWER

11
Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N = i |x i |) Advantage: Finds provably “best” motif W Disadvantage: Time

12
Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

13
CONSENSUS Algorithm: Cycle 1: For each word W in S(of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C 1 best alignments, A 1, …, A C1 ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

14
CONSENSUS Algorithm: Cycle t: For each word W in S For each alignment A j from cycle t-1 Create alignment (gap free) of W, A j Keep the C l best alignments A 1, …, A Ct ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT …

15
CONSENSUS C 1, …, C n are user-defined heuristic constants N is sum of sequence lengths n is the number of sequences Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + … + O(N C n ) = O( N 2 + NC total ) Where C total = i C i, typically O(nC), where C is a big constant

16
MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N α (W) = words W’ in S s.t. d(W,W’) α Idea: Assume W is occurrence of true motif W * Will use N α (W) to correct “errors” in W

17
MULTIPROFILER Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

18
MULTIPROFILER Algorithm: For each W in S: For L = 1 to L max 1.Find the α- neighbors of W in S N α (W) 2.Find all “strong” L-long wordlets G in N a (W) 3.For each wordlet G, 1.Modify W by the wordlet G W’ 2.Compute d(W’, S) Report W * = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search

19
Expectation Maximization in Motif Finding

20
Expectation Maximization The MM algorithm, part of MEME package uses Expectation Maximization Algorithm (sketch): 1.Given genomic sequences find all K-long words 2.Assume each word is motif or background 3.Find likeliest Motif Model Background Model classification of words into either Motif or Background

21
Expectation Maximization Given sequences x 1, …, x N, Find all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 )(assume {A, C, G, T}) where M ij = Prob[ letter j occurs in motif position i ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ]

22
Expectation Maximization Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = x[1]…x[k], P[ X i, Z i1 =1 ] = M 1x[1] …M kx[k] P[ X i, Z i2 =1 ] = (1 - ) B x[1] …B x[K] Let 1 = ; 2 = (1- )

23
Expectation Maximization Define: Parameter space = (M,B) 1 : Motif; 2 : Background Objective: Maximize log likelihood of model:

24
Expectation Maximization Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,

25
Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows: Expectation Maximization: E-step

26
Maximization: Maximize expected value over and independently For, this is easy: Expectation Maximization: M-step

27
For = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] c ij values are calculated easily from E[Z] values It easily follows: to not allow any 0’s, add pseudocounts Expectation Maximization: M-step

28
Initial Parameters Matter! Consider the following “artificial” example: x 1, …, x N contain: 2 12 patterns on {A, T}:A…A, A…AT,……, T… T 2 12 patterns on {C, G}:C…C, C…CG,……, G…G D << 2 12 occurrences of 12-mer ACTGACTGACTG Some local maxima: ½; B = ½C, ½G; M i = ½A, ½T, i = 1,…, 12 D/2 k+1 ; B = ¼A,¼C,¼G,¼T; M 1 = 100% A, M 2 = 100% C, M 3 = 100% T, etc.

29
Overview of EM Algorithm 1.Initialize parameters = (M, B), : Try different values of from N -1/2 up to 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in = (M, B), falls below 4.Report results for several “good”

30
Overview of EM Algorithm One iteration running time: O(NK) Usually need < N iterations for convergence, and < N starting points. Overall complexity: unclear – typically O(N 2 K) - O(N 3 K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994.

Similar presentations

OK

Www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.

Www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google