Genome Evolution. Amos Tanay 2012 Genome evolution: Lecture 12: Evolution of regulatory sequences.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Intro to Comp Genomics Lecture 9: Motif finding. Sequence specific transcription factors Sequence specific transcription factors (TFs) are a critical.
A Genomic Code for Nucleosome Positioning Authors: Segal E., Fondufe-Mittendorfe Y., Chen L., Thastrom A., Field Y., Moore I. K., Wang J.-P. Z., Widom.
Promoter and Module Analysis Statistics for Systems Biology.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Genome evolution: a sequence-centric approach Lecture 11: Transcription factor binding sites.
Heuristic alignment algorithms and cost matrices
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Genome Evolution. Amos Tanay 2009 Genome evolution: Lecture 11: Transcription factor binding sites.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Uri Alon’s lab 10/02. Network of transcriptional interactions in E. coli Thieffry, Collado-Vides, 1998 Shen-Orr, Alon, Nature Genetics 2002.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Reconstruction of Transcriptional Regulatory Networks
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Calculating branch lengths from distances. ABC A B C----- a b c.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions Jiajian Liu and Gary D. Stormo Presented by Aliya.
Local Multiple Sequence Alignment Sequence Motifs
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Motif Search and RNA Structure Prediction Lesson 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Step 3: Tools Database Searching
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Modelling evolution Gil McVean Department of Statistics TC A G.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
CS273B: Deep learning for Genomics and Biomedicine
Learning Sequence Motif Models Using Expectation Maximization (EM)
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Finding regulatory modules
Presented by, Jeremy Logue.
Presented by, Jeremy Logue.
Presentation transcript:

Genome Evolution. Amos Tanay 2012 Genome evolution: Lecture 12: Evolution of regulatory sequences

Genome Evolution. Amos Tanay 2012 Beyond Protein Coding Sequences Non coding fraction of the genome: E. coli : 12% Yeast : 27% Fly : 76% Human : 97.6% How biological functions of non-coding sequence can be defined?

Genome Evolution. Amos Tanay 2012 Sequence specific transcription factors Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene repression machinery TFs include a DNA binding domain that recognize specifically “regulatory elements” in the genome. The TF-DNA duplex is then used to target larger transcriptional structure to the genomic locus. Lactose Repressor

Genome Evolution. Amos Tanay 2012 Sequence specificity is represented using consensus sequences or weight matrices The specificity of the TF binding is central to the understanding of the regulatory relations it can form. We are therefore interested in defining the DNA motifs that can be recognize by each TF. A simple representation of the binding motif is the consensus site, usually derived by studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (representing pairs of nucleotides, for examlpe W=[A|T], S=[C|G] A more flexible representation is using weight matrices (PWM/PSSM): PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy ACGCGT ACGCGA ACGCAT TCGCGA TAGCGT A60%20%00 40% C080%0100%00 G00 080%0 T40%000060%

Genome Evolution. Amos Tanay 2012 In vitro TF binding energy is approximated by weight matrices Yeast Leu3 data (Liu and Clarke, JMB 2002) We can interpret weight matrices as energy functions: This linear approximation is reasonable for most TFs.

Genome Evolution. Amos Tanay 2012 s In-vivo TF binding affinity is approximated by weight matrices s Ume6 ChIP ranges Average PWM energy Stronger binding Stronger prediction Tanay. Genome Res 2006 Cross-link and sheer ImmunoPrecipitation Chromatin ImmunoPrecipitation (ChIP)

Genome Evolution. Amos Tanay 2012 TF binding affinity is kinetically important, with possible functional implications Kalir et al. Science 2001

Genome Evolution. Amos Tanay 2012 TFs are present at only a fraction of their optimal sequence targets. Binding is regulated by co-factors, nucleosomes and histone modifications Heinzman et al. Nature Genetics, 2007)

Genome Evolution. Amos Tanay 2012 TFs are present at only a fraction of their optimal sequence targets. Binding is regulated by co-factors, nucleosomes and histone modifications Heinzman et al. Nature Genetics, 2007)

Genome Evolution. Amos Tanay 2012 Specific proteins are identifying enhancers Here are studies of p300 binding in the developing mouse brain (visel et al. Nature 2009)

Genome Evolution. Amos Tanay 2012 TFBSs are clustered in promoters or in “sequence modules” The distribution of binding sites in the genome is non uniform In small genomes, most sites are in promoters, and there is a bias toward nucleosome free region near the TSS In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are frequently away from the TSS. These represent enhancers. A single binding site, without the context of other co-sites, is unlikely to represent a functional loci

Genome Evolution. Amos Tanay 2012 Discriminative scores for motifs So far we used a generative probabilistic model to learn PWMs The model was designed to generate the data from parameters We assumed that TFBSs are distributed differently than some fixed background model If our background model is wrong, we will get the wrong motifs.. A different scoring approach try to maximize the discriminative power of the motif model. We will not go here into the details of discriminative vs. generative models, but we shall exemplify the discriminative approach for PWMs. Lousy discriminator High specificity discriminatorHigh sensitivity discriminator

Genome Evolution. Amos Tanay 2012 Hypergeometric scores and thresholding PWMs PWM score threshold Number of sequences Positive True positive For a discriminative score, we need to decide on both the PWM model and the threshold. Hyper geometric probability (sum for j>=k is the hg p-value)

Genome Evolution. Amos Tanay 2012 Constructing a weight matrix from aligned TFBSs is trivial This is done by counting (or “voting”) Several databases (e.g., TRANSFAC, JASPAR) contain matrices that were constructed from a set of curated and validated binding site Validated site: usually using “promoter bashing” – testing reported constructs with and without the putative site Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers However, there are no real different 830 matrices out there – the real binding repertoire in nature is still somewhat unclear

Genome Evolution. Amos Tanay 2012 High density arrays quantify TF binding preferences and identify binding sites in high throughput Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them Harbison et al., Nature 2004

Genome Evolution. Amos Tanay 2012 Direct measurements of the in-vitro binding affinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008)

Genome Evolution. Amos Tanay 2012 Profiling binding affinity to the entire k-mer spectrum provide direct quantification of in-vitro affinity (Badis et al., 2009) Heatmap of 2D hierarchical agglomerative clustering analysis of 4740 ungapped 8- mers over 104 nonredundant TFs, with both 8- mers and proteins clustered using averaged E-score from the two different array designs. 8-mers 104 TFs

Genome Evolution. Amos Tanay 2012 What kind of biological function is naturally selected? Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution.

Genome Evolution. Amos Tanay 2012 The Halpern-Bruno model for selection on affinity According to Kimura’s theory, an allele with fitness s and a homogeneous population would fixate with probability: Assuming slow mutation rate (which allow us to assume a homogenous population) and motifs a and b with relative fitness s the fixation probabilities (chance of fixation given that mutation occurred!) are: If p represent the mutation probability, and  the stationary distribution, and if we assume the process as a whole is reversible then: We work on deriving the substitution rate at each position of the binding site, given its observed stationary frequency. We are assuming that the fitness of the site is defined by multiplying the fitness values of all loci. This means fitness is generally linear in the binding energy! (Halpern and Bruno, MBE 1998)

Genome Evolution. Amos Tanay 2012 The Halpern-Bruno model for selection on affinity Moses et al., 2003 The HB model is limited for the study of general sequences. When restricting the analysis to relatively specific sites, HB is not completely off

Genome Evolution. Amos Tanay 2012 The entire genome should behave like a mixture of background sequance and functional loci: So we can try and recover Q(E) and therefore F(E) from the maximum likelihood parameters fitting an empirical W(E) Testing the general binding energy – fitness correspondence While E(S) is approximated by a PWM, F(E) is unlikely to be linear Assume that the background probability of a motif a is P 0 (a). In detailed balance, and assuming the fitness of a at functional sites is F(a), the stationary distribution at sites can be shown to be: Mustonen and Lassig, PNAS 2005 If we collapse all sites with binding energy E (and hence the same F(a)=F(E(a)) Inferred F(E), is shown in Orange Expected and observed energy distribution in E.Coli CRP sites (left) and background (right) Comparison of CRP energies in E.coli and S. typhimurium (Hwa and Gerland, 2000-)

Genome Evolution. Amos Tanay 2012 TFBS evolution: purifying selection and conservation Similar function Neutral evolution Disrupted function Low rate purifying selection TF1 TF2 Altered function Low rate purifying selection TF1 CACGCGTACACGCGTT TF1 CACGAGTTCACGCGTT CACACGTTCACGCGTT Altered affinity Rate? Selection? TF1 CACACGTTCACGCGTT

Genome Evolution. Amos Tanay 2012 Kellis et al., 2003 Binding sites conservation

Genome Evolution. Amos Tanay 2012 Binding sites conservation: heuristic motif identification Kellis et al., 2003

Genome Evolution. Amos Tanay 2012 Analyzing k-mer evolutionary dynamics Instead of trying to identify conserved motifs try to infer the evolutionary rate of substitution between pairs of k-mers Start from a multiple alignment and reconstruct ancestral sequences (assuming site independence, or even max parsimony) Now estimate the number of substitution between pairs of 8-mers, compare this number to the number expected by the background model Do it for a lot of sequence, so that statistics on the difference between observed and expected substitutions can be derived

Genome Evolution. Amos Tanay 2012 Saccharomyces TFBS Selection Network Arcs: 1nt substitution RateSelection Normal Low neutral negative arc not enough stat Nodes: octamers 2SD 3SD node otherwise conservation Inter-island organization in the Reb1 cluster: selection hints toward multi modality of Reb1 Tanay et al., 2004

Genome Evolution. Amos Tanay 2012 Leu3 selection network log delta affinity High Affinity (K d < 60) Meidum Affinity (400 > K d > 60) High rate subs. Substitution changing high affinity to high affinity motifs Substitution changing high affinity to low affinity motifs Substitution rate

Genome Evolution. Amos Tanay 2012 A simple transcriptional code and its evolutionary implications AAATTT AATTTT AAAATT GATGAG GATGCG GATGAT CACGTG CACTTG ACGCGT TCGCGT ACGCGT All the rest TGACTG TGAGTG TGACTT TF 1 TF 2 TF 3 TF 4 TF 5

Genome Evolution. Amos Tanay 2012 The Halpren-Bruno model for selection on affinity The basic notion here is of the relations between sequence, binding and function/fitness Sequence Binding energy Function We argued that E(S) can be approximated by a PWM F(E) is a completely different story, for example: Is there any function at all to low affinity binding sites? Is there a difference between very high affinity and plain strong binding sites? Are all appearances of the site subject to the same fitness landscape?

Genome Evolution. Amos Tanay 2012 S. cerevisiae S. mikitae Simulation (Neutral, context aware) High affinity Low affinity ΔE.. ΔE.. KS statistics More tests for possible conservation of low binding energy sites

Genome Evolution. Amos Tanay 2012 More tests for possible conservation of low binding energy sites Tanay, GR 2006   Binding site conservation Conservation of total energy Reb1 Ume6 binding energy percentile Conservation score Cbf1 Gcn4 Mbp1 binding energy percentile Conservation score binding energy percentile

Genome Evolution. Amos Tanay 2012 Evolutionary dynamics of transcription factor binding (mammals) Schimdt et al. Science 2010 Shared binding loci: 4%

Genome Evolution. Amos Tanay 2012 Evolutionary dynamics of CTCF binding (mammals) Schimdt et al. Cell 2012 Shared binding loci: 24%

Genome Evolution. Amos Tanay 2012 Bradley et al. PLoS biology 2010 Evolutionary dynamics of transcription factor binding (flies) – correlates with the sequence