Presentation on theme: "Intro to Comp Genomics Lecture 9: Motif finding. Sequence specific transcription factors Sequence specific transcription factors (TFs) are a critical."— Presentation transcript:
Intro to Comp Genomics Lecture 9: Motif finding
Sequence specific transcription factors Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene repression machinary TFs include a DNA binding domain that recognize specifically “regulatory elements” in the genome. The TF-DNA duplex is then used to target larger transcriptional structure to the genomic locus.
Sequence specificity is represented using consensus sequences or weight matrices The specificity of the TF binding is central to the understanding of the regulatory relations it can form. We are therefore interested in defining the DNA motifs that can be recognize by each TF. A simple representation of the binding motif is the consensus site, usually derived by studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe W=[A|T], S=[C|G] A more flexible representation is using weight matrices (PWM/PSSM): PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy ACGCGT ACGCGA ACGCAT TCGCGA TAGCGT A60%20%00 40% C080%0100%00 G00 080%0 T40%000060%
TF binding energy is approximated by weight matrices Leu3 data (Liu and Clarke, JMB 2002) We can interpret weight matrices as energy functions: This linear approximation is reasonable for most TFs.
s TF binding affinity is kinetically important, with possible functional implications Kalir et al. Science 2001 Ume6 ChIP ranges Average PWM energy Stronger binding Stronger prediction Tanay. Genome Res 2006
TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications Re TSS Re ATG Lee et al. Nat Gen 2007
TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications Barski et al. Cell 2007 Active Inactive
TFBSs are clustered in promoters or in “sequence modules” The distribution of binding sites in the genome is non uniform In small genomes, most sites are in promoters, and there is a bias toward nucleosome free region near the TSS In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are frequently away from the TSS. These represent enhancers. A single binding site, without the context of other co-sites, is unlikely to represent a functional loci
Constructing a weight matrix from aligned TFBSs is trivial This is done by counting (or “voting”) Several databases (e.g., TRANSFAC, JASPAR) contain matrices that were constructed from a set of curated and validated binding site Validated site: usually using “promoter bashing” – testing reported constructs with and without the putative site Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers However, there are no real different 830 matrices outthere – the real binding repertoire in nature is still somewhat unclear
Probabilistic interpretation of weight matrices and a generative model One can think of a weight matrix as a probabilistic model for binding sites: This is the site independent model, defining a probability space over k-mers Given a set of aligned k-mers, we know that the ML motif model is derived by voting (a set of independent multinomial variables – like the dice case) Now assume we are given a set of sequences that are supposed to include binding sites (one for each), but that we don’t know where the binding sites are. In other words the position of the binding site is a hidden variable h. We introduce a background model P b that describes the sequence outside of the binding site (usually a d-order Markov model) Given complete data we can write down the likelihood of a sequence s as:
Inference of the binding site location posterior: Note that only k-factors should be computed for each location (P b (s) is constant)) Using EM to discover PWMs de-novo Inference of the binding site location posterior: Note that only k factors should be computed for each location (P b (s) is constant)) Starting with an initial motif model, we can apply a standard EM: E: M: As always with the EM, initializing to reasonable PWM would be critical Following Baily and Elkan, MEME 1995
If we assume some of the sequences may lack a binding site, this should be incorporated into the model: Allowing false positive sequences hit l s This is sometime called the ZOOPS model (Zero or one positions) In Bayesian terms: –Probability of sequence hit P(hit | S) –Probability of hit at position l = Pr(l|S) We can consider the PWM parameters as variables in the model Learning the parameters is then equivalent to inference
Using Gibbs sampling to discover PWMs de-novo hit l s We can use Gibbs sampling to sample the hidden sites and estimate the PWM hit l s l s This is done by estimating the PWM from all locations except for the one we sample, and computing the hit probabilities as shown before Note that we are working with the MAP (Maximum a-posteriori) to do the sampling: Gibbs: Lawrence et al. Science 1993 But this can be shown to approximate:
Generalizing PWMs to allow site dependencies: mixture of PWMs and Trees Barash et al., RECOMB 2003 Mixture of PWMs Tree motif We only change the motif component of the likelihood model Learning the model can become more difficult This is because computing the ML model parameter from complete data may be challenging
Discriminative scores for motifs So far we used a generative probabilistic model to learn PWMs The model was designed to generate the data from parameters We assumed that TFBSs are distributed differently than some fixed background model If our background model is wrong, we will get the wrong motifs.. A different scoring approach try to maximize the discriminative power of the motif model. We will not go here into the details of discriminative vs. generative models, but we shall exemplify the discriminative approach for PWMs. Lousy discriminator High specificity discriminatorHigh sensitivity discriminator
Hypergeometric scores and thresholding PWMs PWM score threshold Number of sequences Positive True positive For a discriminative score, we need to decide on both the PWM model and the threshold. Hyper geometric probability (sum for j>=k is the hg p-value)
Exhaustive k-mer search A very common strategy for motif finding is to do exhustive k-mer search. Given a set of hits and a set of non hits, we will compute the number of occurrences of each k-mer in the two sets and report all cases that have a discriminative score higher than some threshold Since k-mers either match or do not match, there is no issue with the threshold For DNA, we will typically scan k=5-8. This can be done efficiently using a map/hash: –Iterate on short sequence windows (of the desired k length) –For each window, mark the appearance of the k-mer in a table –Avoid double counting using a second map It is easy to generalize such exhaustive approaches to include gaps or other types of degeneracy.
Refining k-mers to PWMs using heuristic “EM” K-mer scan is an excellent intial step for finding refined weight matrices. For example, we can use them to initialize an EM. If we want to find a weight matrix, but want to stick to the discriminative setting, we can heuristically use and “EM-like” algorithm: –Start with a k-mer seed –Add uniform prior to generate a PWM –Compute the optimal PWM threshold (maximal hyper-geometric score) –Restimate the PWM by voting from all PWM true positives Consider additional PWM positions Bound the position entropies to avoid over-fitting –Repeat two last steps until fail to improve score There are of course no guarantees for improving the scores, but empirically this approach works very well.
High density arrays quantify TF binding preferences and identify binding sites in high throughput Harbison et al., Nature 2004 Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them
If only biology was that simple… Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution.
PWM regression exploits variable levels of binding affinity to robustly recover binding preferences. ChIP log(binding ratio) PWM sequence energy r = 0.42 = 0.20 ChIP log(binding ratio) PWM sequence energy r = 0.42 = 0.28 r = 0.42 = 0.26 ABF1 GCN4 MBP1 PWM sequence energy r = 0.21 = 0.72 r = 0.28 = 0.8 r = 0.11 = 0.74 Correlation between PWM predicted binding and ChIP experiments spans high, medium and low affinity sites Motif regression optimizes the PWM given the overall correlation of the predicted binding energies and the measured ChIP values v s Tanay, GR 2006
Direct measurments of the in-vitro binding afffinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008)
1. Download the promoters of the yeast genome for SGD (1000 upstream annotated TSSs) 2. Get the yeast GO gene annotations 3. Implement the discriminative k-mer scanner described above: enumerate over all 6-mers (with one gap of up to 6 characters) compute the hyper-geometric p-value for discriminating using the motif refine the k-mer into a PWM by: 1) build a PWM from the motif seed using a uniform prior (i.e., position i has 97% to be equal to the motif character at position I and 1% probability to be different). 2) compute the optimal PWM likelihood threshold: -for each sequence find the position with maximum PWM likelihood -for each threshold on PWM likelihood divide the genome into two sets -compute the hg p-value according to the intersection with your annotation set -select the threshold with minimal p-value 3) retrain your PWM using the “hits” the got a score above thre likelihood threshold (just count the number of nucleotide at each position) 4) continue iterating until convergence. 4. Search for motifs in selected annotations: cell cycle, ribosome biogenesis, RNA processing, amino acid metabolism, sulfur metabolism, meiosis, stress response, heat shock. 5. To control for your results, shuffle the promoters between the genes and rerun your motif finder while recoding your p-values. Determine an empirical p-value threshold, compare it to the expected p- value given just the multiple testing effect. 6. Report the annotations and motifs and the random p-values/likelihoods you got Your Task