Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002.

Similar presentations


Presentation on theme: "Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002."— Presentation transcript:

1 Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

2 Outline Background biology Motif-finding methods –Word enumeration –Gibbs sampling –Random projection –Phylogenetic footprinting –Reducer

3

4 Regulation of Gene Expression Chromatin structure Transcription initiation Transcript processing and modification RNA transport Transcript stability Translation initiation Post-Translational Modification Protein Transport Control of Protein Stability

5 Typical Structure of an Eukaryotic mRNA Gene

6 Control of Transcription Initiation

7 Motif A conserved pattern that is found in two or more sequences Can be found in –DNA (e.g., transcription factor binding sites) –Protein –RNA

8 Models for Representing Motifs Regular expression –Consensus TGACGCA –Degenerate WGACRCA Position Specific Matrix TGACGCA AGACGCA TGACACA AGACGCA A T G C

9 Where to look for motifs? Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus How do you construct gene families? –Microarray experiments

10 Known DNA sequences Glass slide Isolate mRNA Cells of Interest Reference sample genes Resulting data experiments 10 Microarrays

11 Motif-finding Methods Goal: Look for motifs (5-15bp) in the data set Methods: –Word enumeration method –Gibbs sampling –Random projection –Phylogenetic footprinting –Reducer

12 Word Enumeration For every word w, calculate: –Expected frequency based on entire upstream region of the yeast genome E.g., P(ATTGA) = (0.4) 4 (0.1) 1, given P(A) = P(T) = 0.4, P(G)=P(C) = 0.1 Expected number of occurrences of ATTGA: n*P(ATTGA) –Observed frequency in the data set –Statistical significance of enrichment Z = (O - E) / sqrt[np (1 - p)] ~ N(0, 1) –Disadvantage: only consider exact word E.g, YCTGCA: TCTGCA and CCTGCA

13 Gibbs Sampling Matrix to capture a motif Goal: find the best a k to maximize the difference between motif and background base distribution. a2a2 a3a3 a4a4 akak a1a1 Liu, X

14 Gibbs Sampling (Lawrence, et al, 1993) Step 1: Pick random start position, compute current motif matrix Step 2: Iterative update –Take one sequence out, update motif matrix –Calcuate fitness score of each position of out sequence –Pick start position in out sequence based on weight Ax –Take out another sequence, …, until converge Step 3: Reset starting position Liu, X

15 Gibbs Sampling Initialization Pick random start position, compute motif matrix a1a1 a2a2 a3a3 a4a4 akak a1'a1' a3'a3' a4'a4' ak'ak' a2'a2' Liu, X

16 Gibbs Sampling Iteration Steps 1) Take out one sequence, calculate the fitness score of every subsequence relative to the current motif a3'a3' a4'a4' ak'ak' a2'a2' ????????????????? a1'a1' Liu, X

17 Fitness Score Ax = Qx / Px –Qx: probability of generating subsequence x from current motif –Px: probability of generating subsequence x from background 123 A T G C Current Motif Background: P(A) = P(T) = 0.4 P(G) = P(C) = 0.1 X = GGA: Q? P?

18 Gibbs Sampling Iteration Steps 2) Pick new start position sampling from fitness score a 1 '' a3'a3' a4'a4' ak'ak' a2'a2' Liu, X

19 Recent Development Random Projection Phylogenetic Footprinting Reducer

20 Random Projection (Buhler, 2002) (l, d)-motif problem: –M is an (unknown) motif of length l –Each occurrence of M is corrupted by exactly d point substitutions in random positions No known biological motifs are of (l, d)-motif CCcaAG CCcgAG CCgcAG CCtaAG CCtgAG CtATgG CCctAc tCtTAG CaAcAG CCAgAa

21 Random Projection Algorithm Guiding principle: Some instances of a motif agree on a subset of positions. Use information from multiple motif instances to construct model. ATGCGTC...ccATCCGACca......ttATGAGGCtc......ctATAAGTCgc......tcATGTGACac... (7,2) motif x(1) x(2) x(5) x(8) =M Buhler, J

22 k-Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. In l-dimensional Hamming space, projection onto k dimensional subspace. ATGGCATTCAGATTC TGCTGAT l = 15 k = 7 P P = (2, 4, 5, 7, 11, 12, 13) Buhler, J

23 Random Projection Algorithm Choose a projection by selecting k positions uniformly at random. For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. Recover motif from bucket containing multiple l-tuples. Bucket TGCT TGCACCT Input sequence x(i): …TCAATGCACCTAT... Buhler, J

24 Example l = 7 (motif size), k = 4 (projection size) Choose projection (1,2,5,7) GCTC...TAGACATCCGACTTGCCTTACTAC... Buckets Input Sequence ATGC ATCCGAC GCCTTAC Buhler, J

25 Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain more than s l- tuples, for some parameter s. ATTCCATCGCTC ATGC Buhler, J

26 Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are known from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler Local refinement algorithm ATGCGTC Candidate motif ATGC ATCCGAC ATGAGGC ATAAGTC ATGTGAC Buhler, J

27 Parameter Selection Projection size k Choose k small so several motif instances hash to same bucket. (k < l - d) Choose k large to avoid contamination by spurious l-mers. ( 4 k > t (n - l + 1) Bucket threshold s: (s = 3, s = 4) Buhler, J

28 Recent Development Random Projection Phylogenetic Footprinting Reducer

29 Conservation of Regulatory Elements in Upstream of ApoAI Gene TATA box Hepatic site CCCAAT box Mouse Rabbit Human Chicken Mouse Rabbit Human Chicken Mouse Rabbit Human Chicken TATA box

30 AAGCA ACGCA AAGCA

31 Substring Parsimony Problem Given: orthologous upstream sequences S 1,…S n phylogenetic tree T of the n species size k of the motif, threshold d Problem: Find all sets of substrings s 1,…s n of S 1,…S n, each of size k, such that the parsimony score of s 1,…s n on T is at most d Blanchette, M

32 Parsimony Score s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s` 34 Minimum (all possible labelings of internal nodes) l(v) – label of node v d(l 1, l 2 ) – Hamming distance Tree T: Blanchette, M

33 String Parsimony Problem S1: AAAGCATTC S2: TACGCACCC S3: GAAGCAGGG S1S2S3 AAGCA ACGCA AAGCA k = 5 d = 1

34 Algorithm: version I Root the tree at arbitrary internal node r Compute table W u of size 4 k for each node u, where W u [s] – best parsimony score for subtree rooted at u when u is labeled with s Direct implementation of this recursion gives O(nk(4 2k + l), where l – average sequence length Blanchette, M

35 Algorithm: version II Define X (u, v) [s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v u labeled s v w Blanchette, M

36 Algorithm: version II (continued) Update X (u, v) in phases: in phase p maintain set B p of sequences t, such that X (u, v) [t] = p Define: R a = {s: W v [s] = a} N(s) = {t in k : d(s, t) = 1} Start in phase m and let B m = R m Update Computation of X (u, v) takes O(k4 k ) Blanchette, M

37 Improvements Reduce the size of B p when sequences contribute to X (u, v) greater than threshold d In phase p, only care for sequence X (u, v) [s] if Leads to significant reductions in stages d/2 … d Reduce the number of substrings inserted in W at the leaves For substring s of S i, if its best match against any S j, has Hamming distance at least d, s can be discarded Blanchette, M

38 Results Practical limit on k = 10 There appeared to be a threshold d 0 with very few solutions below and many above Algorithm found ~80% known binding sites Performed better than ClustalW, MEME, Consensus Blanchette, M

39 Recent Development Random Projection Phylogenetic Footprinting Reducer

40 Reducer (Bussemaker, et al 2001) Links motif finding to expression level A g = C + Σ F u N ug –A g: gene expression level (logarithm of expression ratio) –M: number of significant motifs –N g : number of occurrences of motif u in gene g –C: baseline expression level (same for all genes) –F: increase/decrease of expression level caused by presence of motif

41 Reducer (Contd) Expressio n vector Log ratio of expression levels Gene1Gene2Gene3Gene4…GeneN Motif vector Number of times that motif occurs in the upstream region of the gene Gene1Gene2Gene3Gene4…GeneN AAAAA20530 AAAAT53215 … Liu, X

42 Reducer (Contd) Normalize expression (A) and motif (n) vectors Linear regression between A vector and every n vector to find the best fit n to A Step-wise regression to combine effects of motifs –Subtract the effect of one motif –Find the next best motif Liu, X

43 Acknowlegement People from whom I borrowed slides: –Xiaole Liu (Reducer) –Olga Troyanskaya (Microarray) –Jeremy Buhler (Random projections) –Mathieu Blanchette (Phylogenetic footprinting) –Various web sources

44

45 cDNA clones (probes) PCR product amplification purification printing microarray Hybridise target to microarray mRNA target) excitation laser 1laser 2 emission scanning analysis overlay images and normalise 0.1nl/spot

46 Information Content of Motifs Uncertainty Information = H before - H after

47 Improvement on Original Gibbs sampler 0 ~ n copies of sites in each sequence Iterative masking to find multiple motifs Use higher order Markov models to improve motif specificity

48 Clinical Importance of Defects in Regulatory Elements Burkitts Lymphoma

49 Statistical Methods Expectation Maximization (EM) –MEME Gibbs sampling –BioProspector –AlignACE

50 Motifs are not limited to DNAs RNA motifs –RNA – RNA interaction motifs, e.g., intron-exon splice sites –RNA – protein interaction motifs, e.g., binding of proteins to RNA polyA tail Protein motifs –E.g., Helix-turn-helix motif

51 Sequence Logo

52 Why is this Problem Hard? Motif information content low Hamming distance between each motif instance high


Download ppt "Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002."

Similar presentations


Ads by Google