Genome-wide computational prediction of transcriptional regulatory modules reveal new insights into human gene expression Mathieu Blanchette et al. Presented.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
A Genomic Code for Nucleosome Positioning Authors: Segal E., Fondufe-Mittendorfe Y., Chen L., Thastrom A., Field Y., Moore I. K., Wang J.-P. Z., Widom.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Comparative Motif Finding
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
The Hardwiring of development: organization and function of genomic regulatory systems Maria I. Arnone and Eric H. Davidson.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Lecture 12 Splicing and gene prediction in eukaryotes
Modeling Regulatory Motifs 3/26/2013. Transcriptional Regulation  Transcription is controlled by the interaction of tran-acting elements called transcription.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Regulation of eukaryotic gene sequence expression
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April Bioinformatics Capstone presentation.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Introduction to Gene Expression
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Log 2 (expression) H3K4me2 score A SLAMF6 log 2 (expression) Supplementary Fig. 1. H3K4me2 profiles vary significantly between loci of genes expressed.
Cis-regulatory Modules and Module Discovery
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Finding genes in the genome
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
What is a Hidden Markov Model?
Learning Sequence Motif Models Using Expectation Maximization (EM)
Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Volume 33, Issue 4, Pages (February 2009)
Presented by, Jeremy Logue.
Nora Pierstorff Dept. of Genetics University of Cologne
Presented by, Jeremy Logue.
Presentation transcript:

Genome-wide computational prediction of transcriptional regulatory modules reveal new insights into human gene expression Mathieu Blanchette et al. Presented By: Manish Agrawal

Outline Introduction Cis regulatory module (CRM) prediction algorithm In silico validation of predicted modules Experimental validation of predicted modules Location of CRMs relative to genes Conclusions

Gene Regulation Chromosomal activation/deactivation Transcriptional regulation Splicing regulation mRNA degradation mRNA transport regulation Control of translation initiation Post-translational modification Source: Lecture Notes by Prof. Saurabh Sinha, UIUC

GENE ACAGTGA TRANSCRIPTION FACTOR PROTEIN Transcriptional regulation Source: Lecture Notes by Prof. Saurabh Sinha, UIUC

GENE ACAGTGA TRANSCRIPTION FACTOR PROTEIN Transcriptional regulation Source: Lecture Notes by Prof. Saurabh Sinha, UIUC

Transcription Factors(TFs) They generally have affinity for short, degenerate DNA sequences (5-15 bp). Experiments have enabled identification of consensus-binding motifs for hundreds of TFs. The binding motifs are generally represented by position-weight matrices (PWM).

Binding site sequence alignment Source:

Alignment matrix for a binding site Source:

Position weighted matrice (PWM) Source:

Position weighted matrice (PWM) To transform elements of the alignment matrix to the weight matrix we used the following formula: weighti,j = ln (ni,j+pi)/(N+1) ~ ln (fi.j /pi) pi N - total number of sequences (15 in this example) ni,j - number of times nucleotide i was observed in position j of the alignment. fi,j = ni,j/N - frequency of letter i at position j pi - a priori probability of letter I In this example pT,A is equal to 0.3 and pC,G is equal to 0.2 (overall frequency of the letters within Drosophila melanogaster genome) Source:

Position weighted matrice (PWM) Weight matrix can be used to evaluate the resemblance of any L bp DNA sequence to the training set of binding sites. The score for this sequence is calculated as the sum of the values that each base of the sequence has in the weight matrix. Any sequence with score that is higher then the predefined cut-off is a potential new binding site. Source:

Complications in indentification of TF-binding sites (TFBSs) The binding of a TF also depends on other factors like the chromatin environment and the cooperation or competition with other DNA binding proteins. In higher eukaryotes, TFs rarely operate by themselves, but a combination of TFs act together to achieve the desired gene expression. The DNA footprint of this set of factors is called cis- regulatory module (CRM).

Cis-regulatory module

Features of CRMs CRMs generally consist of several binding sites for a TF. CRMs, and in particular the binding sites they contain, are generally more evolutionarily conserved than their flanking intergenic regions Genes regulated by a common set of TFs tend to be co-expressed.

Outline Introduction CRM prediction algorithm In silico validation of predicted modules Experimental validation of predicated modules Location of CRMs relative to genes Conclusions

Predicting CRMs Different combinations of these features (of CRMs) have been used, often with PWM information, to predict regulatory elements for specific TFs. However, very few existing methods are designed to be applied on a genome-wide scale without prior knowledge about sets of interacting TFs or sets of co-regulated genes. Previous works had generally taken 5-10 PWMs and they looked for the clusters of these PWMs in the genome. Such studies have been reported for embryo development in Drosophila.

Goals and challenges The goal of this study is to do a genome-wide analysis and identify CRMs in human genome without any prior knowledge about interaction of TFs. The new algorithm only uses the features of CRMs (mentioned earlier) for its prediction. Although, CRMs predicted like this may contain a significant number of false positives, the whole genome approach provides sufficient statistical power to formulate specific biological hypotheses.

Mathieu Blanchette et al. Genome Res. 2006; 16: CRM prediction algorithm (Overview)

CRM prediction algorithm 1.A set of 481 vertebrate PWMs frm Transfac 7.2 was used for the analysis. PWMs were grouped into 229 families. 2.The genome-wide multiple alignment was done for the human, mouse and rat genomes by the MULTIZ program. Only the regions within MULTIZ alignment were considered in the later part of the study. These regions cover 34% of the human genome 3. For each of the 481 PWMs, individual binding sites were first predicted. The human, mouse and rat genomes were scanned separately on both strands, and a log likelihood score is computed in the standard way.

CRM prediction algorithm 4.For each species and each PWM, a hit score was computed. Later, a weighted average of the human, mouse and rat scores was used to define a “hit score” for each alignment column p and PWM m, 5.hitScore aln (m,p) = hitScore Hum (m,p) + 1/2 max(0, hitScore Mou (m,p) + hitScore Rat (m,p)) 6.The human hit scores has been given higher weight to allow prediction of human-specific binding sites, provided that they are very good matches to the PWM considered. 7.Only positions with hitScore aln (m,p) > 10 are retained to construct modules. This threshold is somewhat arbitrary but results in total number of bases predicted in pCRMs to be ~2.88% of the genome.

CRM prediction algorithm: Computation of module score We need moduleScore(p 1 …p 2 ) for the alignment region going from position p 1 to p 2 of human. Define TotalScore(m, p 1.p 2 ) to be the sum of the hitScores aln of all non-overlapping hits for m in the region p 1.p 2. The optimization problem of choosing the best set of non- overlapping hits is solved heuristically using a greedy algorithm. This greedy algorithm iteratively selects the hit with the maximal score that does not overlap with the other hits previously chosen. For each matrix and each region, a P-value is assigned.

CRM prediction algorithm: Computaion of module score The score for a module is computed based on one to five PWMs called tags. The first tag for region p 1.p 2 is the matrix with the most significant TotalScore, i.e., tag 1 = argmin m  PWMs pValue(TotalScore(m,p 1.p 2 )). The regions belonging to tag 1 are then masked out and the TotalScores for each matrix are recomputed, excluding hits overlapping those of tag 1. The second tag is then the matrix with most significant TotalScore. The process is repeated until five tags are selected if possible.

CRM prediction algorithm: Computation of module score Finally, we define totalModuleScore as a function of the P- values of individual tags. So, a module can consist of one to five tags, depending on which number of tags yields the highest statistical significance. The above algorithm was used to search for modules of maximal length 100, 200, 500, 1000 and 2000 bp.

Mathieu Blanchette et al. Genome Res. 2006; 16: CRM prediction algorithm (Overview)

Results The algorithm could identify about 118,000 putative CRMs covering 2.88% of the genome. This constitutes one of the first genome-wide, non-promoter centric set of human cis-regulatory modules. The biological relevance of pCRMs were evaluated by measuring the extent they overlap known regulatory elements in databases such as TRRD, Transfac and GALA.

Outline Introduction CRM prediction algorithm In silico validation of predicted modules Experimental validation of predicated modules Location of CRMs relative to genes Conclusions

Mathieu Blanchette et al. Genome Res. 2006; 16: In silico validation of predicted modules

Comparison to other genome- wide predictions The ability of the algorithm to take advantage of interspecies TFBS conservation contributes in good part to its accuracy. The 34% of the human genome that lies within an alignment block with the mouse and rat genome contains 90% of bases within Transfac sites, 67% of those within TRRD modules, and 87% of those within GALA regulatory regions.

Outline Introduction CRM prediction algorithm In silico validation of predicted modules Experimental validation of predicted modules Location of CRMs relative to genes Conclusions

Experimental validation of predicted modules Experimentally verified the data by Chip- chip analysis. This method allows for the large scale identification of protein-DNA interactions as they occur in vivo.

Chip-chip Analysis Buck et al. Genome Biol. 2005; 6(11): R97

Experimental validation of predicted modules They selected modules predicted to be bound by the estrogen receptor (ER), the E2F transcription factor (E2F4), STAT3 and HIFI to print a DNA microarray. The microarray contains 758, 1370, 860 and 1882 modules predicted to be bound by ER, E2F4, STAT3, and HIFI respectively. In the current study, the microarray was then probed by ChIP-chip for ER and E2F4, respectively. Approx. 3% of the 758 ER-predicted pCRMs on the microarray actually proved to be bound by ER, while 17% of the 1370 E2F4-predicted pCRMs on the microarray were bound by E2F4.

Experimental validation of predicted modules These numbers need to be considered as an underestimation of the actual specificity of the algorithm, since the protein-DNA interactions were tested in a single cell type, while TFs are known to regulate different sets of genes in different cell types, physiological conditions, and time in development. In addition, the experiment was conducted under a single set of conditions (concentration of estradiol, time of treatment, etc. ). For all of these reasons, it is difficult to determine the real accuracy of the algorithm.

Experimental validation of predicted modules As the microarray contained, predicted modules for four different TFs, the data can be used to assess the specificity of TFBS predictions. Among the 55 modules bound by ER, 44% were indeed selected for their ER-binding sites and among the 433 modules bound by E2F4, 54% were selected for that factor. In addition to false positive ChIP-chip signals or the failure of the algorithm to detect some binding sites, it is likely that binding of TFs through alternative mechanisms such as protein-protein interactions contributes to this result. The present algorithm can only predict the binding of TF through direct DNA-binding interactions.

Outline Introduction CRM prediction algorithm In silico validation of predicted modules Experimental validation of predicated modules Location of CRMs relative to genes Conclusions

Mathieu Blanchette et al. Genome Res. 2006; 16: Distribution of pCRMs along a region of chromosome 11

Global view of the gene regulatory landscape The module density varies widely across the genome, with an average of four modules per 100 kb and a maximum of 44 modules per 100-kb window, covering from 0% to 55% of such a region. As illustrated in the previous figure, some regions are rich in modules, but relatively poor in genes. In some cases, this could reflect the presence of many unknown protein- coding genes, or at least of many alternative TSSs. Another possible explanation is that some of these modules may be regulating the transcription of noncoding transcripts. Finally, this observation may be due to the presence of long-range enhancers, which may affect transcription of genes upto several hundreds of kilobases away.

Regulatory modules are preferentially located in specific regions relative to genes The position of pCRMs with respect to their closest gene was studied. The genome was divided into several types of noncoding regions, i.e., upstream of a gene, 5' UTR, 1st intron, internal introns, last intron, 3' UTR, and downstream region. Within each type of region, they computed the fraction of bases included in a pCRM as a function of the distance to a reference point for each type of region.

Mathieu Blanchette et al. Genome Res. 2006; 16: Distribution of pCRMs relative to specific regions of the genes

Observations 1.Regions immediately surrounding TSSs are highly enriched for predicted modules. This was expected as this region often contains the promoter of the genes. Surprisingly, there are modules immediately downstream of TSSs. These may represent alternative promoters for initiation downstream from the annotated transcripts. 2.Regions surrounding the sites of termination are also enriched for modules. 3' UTRs are essentially as enriched as 5' UTRs for pCRMs. Two reasons may explain this. First, these may represent enhancer type of regulatory elements that activate the upstream gene via a DNA-looping mechanism. Second, these may represent promoter elements driving noncoding transcript, antisense relative to the coding gene. Such antisense transcripts may regulate gene expression by a post-transcriptional mechanism

Mathieu Blanchette et al. Genome Res. 2006; 16: Distribution of pCRMs relative to specific regions of the genes

Observations Another surprising observation is that the density of modules is the lowest in regions located 10–50 kb upstream of the TSS and, symmetrically, 10–30 kb downstream of the end of transcription. This is unexpected, as one would expect that these regions (at least those upstream of the TSS) would be prime estate for transcriptional regulation. However, this is confirmed by the density of interspecies conserved elements, which is also at its lowest in those regions. Being close to the TSS, regulatory elements in these regions may be allowed to contain fewer binding sites (or binding sites with less affinity), making them difficult to detect using the current method.

Observations Alternatively, these regions(10-50 kb upstream) may actually be depleted for regulatory elements. This could be due to constraints imposed by the chromatin structure of the nuclear architecture, making it more difficult for the DNA of these regions to come in physical proximity to the TSS. Another notable observation is that the density of predicted modules in intronic regions is very low in the close vicinity of exons (except the first and the last one), but increases with the distance to the closest exon.

Mathieu Blanchette et al. Genome Res. 2006; 16: TFs target different regions relative to their target genes. RED => Highly enriched for TFBSs, BLUE => Depleted in TFBSs

TFs target different regions relative to their target genes. The previous figure shows that more than 70 of the 229 TFs families considered exhibit a significant enrichment for one or more types of genomic regions. A number of TFs show preference for distal positions, mostly those located more than 100 kb upstream of the TSS, and are also enriched within introns. This set of TFs is enriched for factors containing homeo domains or basic helix-loop-helix domains and are often involved in regulating development.

TFs target different regions relative to their target genes. A second set of TFs preferentially binds within 1 kb of the TSSs. This set is enriched for leucine zipper TF and factors from Ets family. Notably, most of these factors, contrary to what is observed for those binding distal sites, are involved in basic cellular functions.

Outline Introduction CRM prediction algorithm In silico validation of predicted modules Experimental validation of predicated modules Location of CRMs relative to genes Conclusions

Blanchett et al have identified a set of rules describing the architecture of DNA regulatory elements and used them to build an algorithm allowing them to explore the regulatory potential of the human genome. Although the false positive rate in CRM prediction is likely to be high, the statistical power obtained through a large- scale, genome-wide approach revealed new insights about transcriptional regulation. It was noted that a significant number of TFs have a strong bias for regulating genes either from a great distance or from promoter-proximal binding sites.

Conclusions Noteworthy is the fact that most TFs that preferentially work from a large distance are involved in development, while those predicted to work from promoter-proximal sites tend to regulate genes involved in basic cellular processes. It is expected that the database containing the modules presented in this study may speed up the discovery and experimental validation of CRMs

THANK YOU