Download presentation

Presentation is loading. Please wait.

Published byRegina Limon Modified over 2 years ago

1
Gene Regulation and Microarrays

2
Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......

3
Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size Because a constant-size transcription factor binds Often repeated Low-complexity-ish

4
Sequence Logos Information at pos’n I, H(i) = – {letter x} freq(x, i) log 2 freq(x, i) Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i)) Examples: freq(A, i) = 1;H(i) = 0;L(A, i) = 2 A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼

5
Problem Definition Probabilistic Motif: M ij ; 1 i W 1 j 4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with k differences Or, Find M with smallest total hamming dist Given a collection of promoter sequences s 1,…, s N of genes with common expression

6
Essentially a Multiple Local Alignment Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases......

7
Algorithms Combinatorial CONSENSUS, TEIRESIAS, SP-STAR, others Probabilistic 1.Expectation Maximization: MEME 2.Gibbs Sampling: AlignACE, BioProspector

8
Combinatorial Approaches to Motif Finding

9
Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and any word in x i d(W, S) = i d(W, x i )

10
Approaches Exhaustive Searches CONSENSUS MULTIPROFILER TEIRESIAS, SP-STAR, WINNOWER

11
Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N = i |x i |) Advantage: Finds provably “best” motif W Disadvantage: Time

12
Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

13
CONSENSUS Algorithm: Cycle 1: For each word W in S(of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C 1 best alignments, A 1, …, A C1 ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

14
CONSENSUS Algorithm: Cycle t: For each word W in S For each alignment A j from cycle t-1 Create alignment (gap free) of W, A j Keep the C l best alignments A 1, …, A Ct ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT …

15
CONSENSUS C 1, …, C n are user-defined heuristic constants N is sum of sequence lengths n is the number of sequences Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + … + O(N C n ) = O( N 2 + NC total ) Where C total = i C i, typically O(nC), where C is a big constant

16
MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N α (W) = words W’ in S s.t. d(W,W’) α Idea: Assume W is occurrence of true motif W * Will use N α (W) to correct “errors” in W

17
MULTIPROFILER Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

18
MULTIPROFILER Algorithm: For each W in S: For L = 1 to L max 1.Find the α- neighbors of W in S N α (W) 2.Find all “strong” L-long wordlets G in N a (W) 3.For each wordlet G, 1.Modify W by the wordlet G W’ 2.Compute d(W’, S) Report W * = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search

19
Expectation Maximization in Motif Finding

20
Expectation Maximization The MM algorithm, part of MEME package uses Expectation Maximization Algorithm (sketch): 1.Given genomic sequences find all K-long words 2.Assume each word is motif or background 3.Find likeliest Motif Model Background Model classification of words into either Motif or Background

21
Expectation Maximization Given sequences x 1, …, x N, Find all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 )(assume {A, C, G, T}) where M ij = Prob[ letter j occurs in motif position i ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ]

22
Expectation Maximization Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = x[1]…x[k], P[ X i, Z i1 =1 ] = M 1x[1] …M kx[k] P[ X i, Z i2 =1 ] = (1 - ) B x[1] …B x[K] Let 1 = ; 2 = (1- )

23
Expectation Maximization Define: Parameter space = (M,B) 1 : Motif; 2 : Background Objective: Maximize log likelihood of model:

24
Expectation Maximization Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,

25
Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows: Expectation Maximization: E-step

26
Maximization: Maximize expected value over and independently For, this is easy: Expectation Maximization: M-step

27
For = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] c ij values are calculated easily from E[Z] values It easily follows: to not allow any 0’s, add pseudocounts Expectation Maximization: M-step

28
Initial Parameters Matter! Consider the following “artificial” example: x 1, …, x N contain: 2 12 patterns on {A, T}:A…A, A…AT,……, T… T 2 12 patterns on {C, G}:C…C, C…CG,……, G…G D << 2 12 occurrences of 12-mer ACTGACTGACTG Some local maxima: ½; B = ½C, ½G; M i = ½A, ½T, i = 1,…, 12 D/2 k+1 ; B = ¼A,¼C,¼G,¼T; M 1 = 100% A, M 2 = 100% C, M 3 = 100% T, etc.

29
Overview of EM Algorithm 1.Initialize parameters = (M, B), : Try different values of from N -1/2 up to 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in = (M, B), falls below 4.Report results for several “good”

30
Overview of EM Algorithm One iteration running time: O(NK) Usually need < N iterations for convergence, and < N starting points. Overall complexity: unclear – typically O(N 2 K) - O(N 3 K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994.

Similar presentations

Presentation is loading. Please wait....

OK

Fuzzy K means.

Fuzzy K means.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on iron and steel industry in india Ppt on chapter 3 drainage tile Crt display ppt online Ppt on health care industry in india Ppt on eid festival images Ppt on history of england Ppt on human nutrition and digestion of annelids Ppt on cube and cuboid Ppt on hindi grammar for class 9 Ppt on library management system free download