Intro to Comp Genomics Lecture 9: Motif finding. Sequence specific transcription factors Sequence specific transcription factors (TFs) are a critical.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Promoter and Module Analysis Statistics for Systems Biology.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Visual Recognition Tutorial
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Genome evolution: a sequence-centric approach Lecture 11: Transcription factor binding sites.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Heuristic alignment algorithms and cost matrices
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Expectation Maximization Algorithm
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
A Quantitative Modeling of Protein- DNA interaction for Improved Energy Based Motif Finding Algorithm Junguk Hur School of Informatics April 25, 2005 L529.
Genome Evolution. Amos Tanay 2009 Genome evolution: Lecture 11: Transcription factor binding sites.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Genome Evolution. Amos Tanay 2012 Genome evolution: Lecture 12: Evolution of regulatory sequences.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
Sampling Approaches to Pattern Extraction
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Transcription factor binding motifs (part II) 10/22/07.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei.
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
Hidden Markov Models BMI/CS 576
Learning Sequence Motif Models Using Expectation Maximization (EM)
Latent Variables, Mixture Models and EM
(Regulatory-) Motif Finding
Finding regulatory modules
Presentation transcript:

Intro to Comp Genomics Lecture 9: Motif finding

Sequence specific transcription factors Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene repression machinary TFs include a DNA binding domain that recognize specifically “regulatory elements” in the genome. The TF-DNA duplex is then used to target larger transcriptional structure to the genomic locus.

Sequence specificity is represented using consensus sequences or weight matrices The specificity of the TF binding is central to the understanding of the regulatory relations it can form. We are therefore interested in defining the DNA motifs that can be recognize by each TF. A simple representation of the binding motif is the consensus site, usually derived by studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe W=[A|T], S=[C|G] A more flexible representation is using weight matrices (PWM/PSSM): PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy ACGCGT ACGCGA ACGCAT TCGCGA TAGCGT A60%20%00 40% C080%0100%00 G00 080%0 T40%000060%

TF binding energy is approximated by weight matrices Leu3 data (Liu and Clarke, JMB 2002) We can interpret weight matrices as energy functions: This linear approximation is reasonable for most TFs.

s TF binding affinity is kinetically important, with possible functional implications Kalir et al. Science 2001 Ume6 ChIP ranges Average PWM energy Stronger binding Stronger prediction Tanay. Genome Res 2006

TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications Re TSS Re ATG Lee et al. Nat Gen 2007

TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications Barski et al. Cell 2007 Active Inactive

TFBSs are clustered in promoters or in “sequence modules” The distribution of binding sites in the genome is non uniform In small genomes, most sites are in promoters, and there is a bias toward nucleosome free region near the TSS In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are frequently away from the TSS. These represent enhancers. A single binding site, without the context of other co-sites, is unlikely to represent a functional loci

Constructing a weight matrix from aligned TFBSs is trivial This is done by counting (or “voting”) Several databases (e.g., TRANSFAC, JASPAR) contain matrices that were constructed from a set of curated and validated binding site Validated site: usually using “promoter bashing” – testing reported constructs with and without the putative site Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers However, there are no real different 830 matrices outthere – the real binding repertoire in nature is still somewhat unclear

Probabilistic interpretation of weight matrices and a generative model One can think of a weight matrix as a probabilistic model for binding sites: This is the site independent model, defining a probability space over k-mers Given a set of aligned k-mers, we know that the ML motif model is derived by voting (a set of independent multinomial variables – like the dice case) Now assume we are given a set of sequences that are supposed to include binding sites (one for each), but that we don’t know where the binding sites are. In other words the position of the binding site is a hidden variable h. We introduce a background model P b that describes the sequence outside of the binding site (usually a d-order Markov model) Given complete data we can write down the likelihood of a sequence s as:

Inference of the binding site location posterior: Note that only k-factors should be computed for each location (P b (s) is constant)) Using EM to discover PWMs de-novo Inference of the binding site location posterior: Note that only k factors should be computed for each location (P b (s) is constant)) Starting with an initial motif model, we can apply a standard EM: E: M: As always with the EM, initializing to reasonable PWM would be critical Following Baily and Elkan, MEME 1995

If we assume some of the sequences may lack a binding site, this should be incorporated into the model: Allowing false positive sequences hit l s  This is sometime called the ZOOPS model (Zero or one positions) In Bayesian terms: –Probability of sequence hit P(hit | S) –Probability of hit at position l = Pr(l|S) We can consider the PWM parameters as variables in the model Learning the parameters is then equivalent to inference

Using Gibbs sampling to discover PWMs de-novo hit l s  We can use Gibbs sampling to sample the hidden sites and estimate the PWM hit l s l s This is done by estimating the PWM from all locations except for the one we sample, and computing the hit probabilities as shown before Note that we are working with the MAP (Maximum a-posteriori)  to do the sampling: Gibbs: Lawrence et al. Science 1993 But this can be shown to approximate:

Generalizing PWMs to allow site dependencies: mixture of PWMs and Trees Barash et al., RECOMB 2003 Mixture of PWMs Tree motif We only change the motif component of the likelihood model Learning the model can become more difficult This is because computing the ML model parameter from complete data may be challenging

Discriminative scores for motifs So far we used a generative probabilistic model to learn PWMs The model was designed to generate the data from parameters We assumed that TFBSs are distributed differently than some fixed background model If our background model is wrong, we will get the wrong motifs.. A different scoring approach try to maximize the discriminative power of the motif model. We will not go here into the details of discriminative vs. generative models, but we shall exemplify the discriminative approach for PWMs. Lousy discriminator High specificity discriminatorHigh sensitivity discriminator

Hypergeometric scores and thresholding PWMs PWM score threshold Number of sequences Positive True positive For a discriminative score, we need to decide on both the PWM model and the threshold. Hyper geometric probability (sum for j>=k is the hg p-value)

Exhaustive k-mer search A very common strategy for motif finding is to do exhustive k-mer search. Given a set of hits and a set of non hits, we will compute the number of occurrences of each k-mer in the two sets and report all cases that have a discriminative score higher than some threshold Since k-mers either match or do not match, there is no issue with the threshold For DNA, we will typically scan k=5-8. This can be done efficiently using a map/hash: –Iterate on short sequence windows (of the desired k length) –For each window, mark the appearance of the k-mer in a table –Avoid double counting using a second map It is easy to generalize such exhaustive approaches to include gaps or other types of degeneracy.

Refining k-mers to PWMs using heuristic “EM” K-mer scan is an excellent intial step for finding refined weight matrices. For example, we can use them to initialize an EM. If we want to find a weight matrix, but want to stick to the discriminative setting, we can heuristically use and “EM-like” algorithm: –Start with a k-mer seed –Add uniform prior to generate a PWM –Compute the optimal PWM threshold (maximal hyper-geometric score) –Restimate the PWM by voting from all PWM true positives Consider additional PWM positions Bound the position entropies to avoid over-fitting –Repeat two last steps until fail to improve score There are of course no guarantees for improving the scores, but empirically this approach works very well.

High density arrays quantify TF binding preferences and identify binding sites in high throughput Harbison et al., Nature 2004 Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them

If only biology was that simple… Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution.

PWM regression exploits variable levels of binding affinity to robustly recover binding preferences. ChIP log(binding ratio) PWM sequence energy r = 0.42  = 0.20 ChIP log(binding ratio) PWM sequence energy r = 0.42  = 0.28 r = 0.42  = 0.26 ABF1 GCN4 MBP1 PWM sequence energy r = 0.21  = 0.72 r = 0.28  = 0.8 r = 0.11  = 0.74 Correlation between PWM predicted binding and ChIP experiments spans high, medium and low affinity sites Motif regression optimizes the PWM given the overall correlation of the predicted binding energies and the measured ChIP values v s Tanay, GR 2006

Direct measurments of the in-vitro binding afffinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008)

1. Download the promoters of the yeast genome for SGD (1000 upstream annotated TSSs) 2. Get the yeast GO gene annotations 3. Implement the discriminative k-mer scanner described above: enumerate over all 6-mers (with one gap of up to 6 characters) compute the hyper-geometric p-value for discriminating using the motif refine the k-mer into a PWM by: 1) build a PWM from the motif seed using a uniform prior (i.e., position i has 97% to be equal to the motif character at position I and 1% probability to be different). 2) compute the optimal PWM likelihood threshold: -for each sequence find the position with maximum PWM likelihood -for each threshold on PWM likelihood divide the genome into two sets -compute the hg p-value according to the intersection with your annotation set -select the threshold with minimal p-value 3) retrain your PWM using the “hits” the got a score above thre likelihood threshold (just count the number of nucleotide at each position) 4) continue iterating until convergence. 4. Search for motifs in selected annotations: cell cycle, ribosome biogenesis, RNA processing, amino acid metabolism, sulfur metabolism, meiosis, stress response, heat shock. 5. To control for your results, shuffle the promoters between the genes and rerun your motif finder while recoding your p-values. Determine an empirical p-value threshold, compare it to the expected p- value given just the multiple testing effect. 6. Report the annotations and motifs and the random p-values/likelihoods you got Your Task