Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.

Similar presentations


Presentation on theme: "1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical."— Presentation transcript:

1 1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical Models for Biological Sequence Motif Discovery, Liu J, Gupta, Liu X, Mayerhofere, Lawrence

2 2 “Regulatory Motif Finding”  What is being regulated?  What is a “Motif?”  Why do we want to find them?

3 3 Central Dogma of Genetics (pict by Andrew Hughes, Rice University)  It’s “TRUE,” right?!  Yes, but…

4 4 Every Protein in Every Cell?  Clearly, there are complicated mechanisms at work  Rhodopsin  But, we have the same DNA in all cells…

5 5 Transcriptional Regulation  It is transcription (DNA  RNA) that is being regulated.  RNA Polymerase II, aided by Transcription Factors (TFs)  Where do TFs bind?

6 6 Promoter Regions (pict by Andrew Hughes, Rice University)  TATA box – usually ~ 30 bp upstream of gene  But, there are others...Where? What Sequence?

7 7 Promoter Sequence  Many different possible locations, sometimes extremely far from the start of transcription!  What Sequence? THAT is the $64k (or $1B) Question…

8 8 Motifs  Many different promoter sequences found  Basal: TATA-box (-20), CCAAT-box (-100)  Additional transcriptional regulatory domains  Activators and inhibitors use these domains

9 9 Motifs (2)  Not exact sequences – that would be too easy  Not exact sequences – that would be too easy  Strength of Binding Affects level of promotion/inhibition (C/G vs A/T)  Described either probabilistically with motif logos or with extended single-letter nucleotide codes  Often are Palindromic (GATATC)

10 SymbolMeaning AAdenine GGuanine CCytosine TThymine UUracil YpYrimidine(C or T) RpuRine(A or G) W"Weak"(A or T) S"Strong"(C or G) K"Keto"(T or G) M"aMino"(C or A) Bnot A(C or G or T) Dnot C(A or G or T) Hnot G(A or C or T) Vnot T(A or C or G) X,N,?unknown(A or C or G or T)  TGASTMA – Promoter Sequence for several oncogenes Extended Single-Letter Codes  Letters represent possible bases in each position:

11 11 Motif Logos  Height of letters represents probability of being found in that location in the motif

12 12 Why do we care?  Gene regulation  transcriptional regulation  Can teach us about our complex signaling pathways  Drugs and Money

13 13 So…Finding Regulatory Motifs  Statistical Models paper (Liu et al)  Assumes: We have located genes that we expect to be co-regulated (microarrays, co-expression)

14 14 So…Finding Regulatory Motifs  Experimental methods of determining TF binding sites (Gel Shift assay, DNA Protection Assay)  Statistical models

15 15 Single-Site Model  Assumes: - Each sequence contains 1 motif - Sequences are generated by random draws from {A,C,G,T} with given prior probabilities - Motif has a frequency matrix for each position  Use Gibbs site sampler: Missing Data Problem. Randomly choose motif locations. Then move the motif locations based on P(a k )

16 16 Gibbs Sampling Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j )  …  M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j )  …  B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob

17 17 Repetitive Block-Motif Model  View K sequences as one long sequence of length n. Model probability of a motif starting at each position ‘i’.  Problems: - Lose evolutionary relationship between sequences - Allows multiple copies of motif in each sequence - Total number of occurrences unknown

18 18 The Rest of the Statistical Models Paper…  Much math: – Scoring motif candidates – Using potential motif dictionaries – Bayesian Prior Probabilities – Finding motifs with insertions in them (“gapped” motifs)  On to: Phylogenetic Footprinting

19 19 Phylogenetic Footprinting  Most of paper spent describing background, results  Methods are brief, not too deep

20 20 Let Evolution Be Your Guide  Phylogenetic Footprinting – “Identifying regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species”

21 21 Orthologs and Paralogs Gene duplicate within species: Paralog Same gene in species with common ancestor: Ortholog

22 22 Advantages  Doesn’t rely on reliably determining co-regulated genes (single-genome approach, non-trivial!)  Can be used to find regulatory elements specific to one single gene (caveat: conserved across species)

23 23 Standard Methods  Usually start with MSA (ProbCons,clustalw) – But, this can lose signal (short regulatory elements ~20bp, long promoter regions ~1000 bp) – Also, if species are evolutionarily close, nonfunctional regions may also be well conserved  Can start with general motif discovery algs (MEME, Consensus, AlignAce, DIALIGN …) – But, these don’t take into account relative phylogenetic relationships of sequences. Will weight closely related sequences too highly

24 24 The PF Algorithm Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

25 25 AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4 Small Example (merci, CS262)

26 26 Solution Parsimony score: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

27 An Exhaustive Algorithm W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1 \... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: 0... 4 k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

28 28 Simple Recurrence W u [s] =  min ( W v [t] + h(s, t) ) v : children t of u Words Good: K-mer score at a node is the sum of its children’s best parsimony scores for that k-mer

29 29 Running Time W u [s] =  min ( W v [t] + h(s, t) ) v : children t of u O(k  4 2k ) time per node Number of species Average sequence length Motif length Total time O(n k (4 2k + l ))

30 30 FootPrinter http://bio.cs.washington.edu/software.html  Avoids pitfalls of using MSA or general- purpose Motif-finding algorithms  Identifies all DNA motifs that appear to have evolved more slowly than the surrounding sequence  Allows motifs to not appear in all sequences (LexA in gram +/- bacteria)

31 31 FootPrinter (2)  “Given n orthologous input sequences and the phylogenetic tree T relating them, [footprinter] is guaranteed to produce every set of k- mers, one from each input sequence, that have a parsimony score at most d with respect to T, where k and d are parameters specified by the user.

32 32 Parameters  Can set minimum threshold on fraction of the phylogeny that must be spanned for motifs with each parsimony score ‘s’.

33 33 Results  Examine 9 sets of orthologous or paralogous (works for duplicated genes that have since evolved as well) sequences.  Found: many old, + some highly conserved motifs of unknown function (time for the experimentalists!)

34 34 One example: Metallothionein Gene Family  Good test family: – Large number of promoter sequences – Wide variety of species – Large number of regulatory elements experimentally verified in several species.  Most binding sites are within 300 bp of start codon (ATG)

35 35  Inputs Sequences: 590 bp upstream of the start codon  Most found were present in multiple isoform families – gained accuracy by considering the paralogs, not just the orthologs

36 36 But, FootPrinter isn’t Perfect  Some known regulatory binding sites were missed. Why?  Ultimately, must be because the motifs were not well-enough conserved to be detected (but we can discuss more…)

37 37 FootPrinter Error (1)  Some binding sites not well matched in other species. Example: Thyroid hormone receptor T3R is conserved within rodents, but not beyond. Would need many closely related species to detect this motif.

38 38 FootPrinter Error (2-5)  Some motifs well conserved, but too short  InDels in middle of motif – could allow them, but would get many false +s  Some barely fail to meet statistical thresholds (close but no cigar)  Dimer TFs like two conserved regions with variable internal seq.


Download ppt "1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical."

Similar presentations


Ads by Google