Computational analysis of PromoterS

Computational analysis of PromoterS

Gene regulation Genomes usually contain several thousands of different genes. Some of the gene products are required by the cell under all growth conditions and are called housekeeping genes. genes for DNA polymerase, RNA polymerase, rRNA, tRNA, … Many other gene products are required under specific growth conditions. e.g. enzymes responding to a specific environmental condition such as DNA damage - Today, TSSs (transcriptional start sites) are mapped at once for a whole genome with high-throughput technologies such as 5′ SAGE (DOI: /nbt998) or CAGE (DOI: /science ).

Gene regulation Housekeeping genes must be expressed at some level all of the time. Frequently, as the cell grows faster, more of the housekeeping gene products are needed. The gene products required for specific growth conditions are not needed all of the time. These genes are frequently expressed at extremely low levels, or not expressed at all when they are not needed and yet made when they are needed. Apparently, the gene expression must be regulated so that the genes that are being expressed meet the needs of different cell types, developmental stages, or different external conditions.

Gene regulation Gene regulation basically occurs at three different places: transcriptional regulation transcription of the gene is regulated control of transcription initiation – most important control mechanism translational regulation translation of the gene is regulated How often the mRNA is translated influences the amount of gene product that is made. post-transcriptional/post-translational regulation regulation of gene products after they are completely synthesized, e.g. degradation, chemical modifications (methylation, phosphorylation) - the most characteristic and biologically far-reaching purpose of gene control in multicellular organisms is execution of the genetic program that underlies embryological development

Transcriptional regulation
Transcription control has two key features: protein-binding regulatory DNA sequences (control elements) are associated with genes specific proteins that bind to regulatory sequences determine where transcription will start, and either activate or repress its transcription DNA sequence specifying where RNA polymerase binds and initiates transcription of a gene is called a promoter. Transcription from a particular promoter is controlled by DNA-binding proteins, termed transcription factors. DNA control elements in binding transcription factors may be located very far from the promoter they regulate.

Three different polymerases
As a result of this arrangement, transcription from a single promoter may be regulated by binding of multiple transcription factors to alternative control elements, permitting complex control of gene expression. RNA polymerase I synthesizes rRNA. RNA polymerase II synthesizes mRNA. RNA polymerase III synthesizes small RNAs and tRNA.

left - Initiation of transcription of a eucaryotic gene by RNA polymerase II ( top right - The gene control region of a typical eucaryotic gene ( bottom right - Activation of transcription initiation in eucaryotes by recruitment of the eucaryotic RNA polymerase II holoenzyme complex ( source: Molecular Biology of the Cell. 4th edition. Alberts B

Three parts of promoter
core promoter responsible for actual binding of transcription apparatus very close upstream (~35 bp), may also be downstream, see later proximal promoter contains several regulatory elements few hundreds bases upstream of transcriptional start site (TSS) distal promoter contains enhancers (upstream/downstream), silencers They are cis-acting … cis-element regulates gene on the same DNA molecule. cis-acting sequences are bound by trans-acting (i.e. acting from a different molecule) regulatory proteins. However, the distinctions between proximal elements and enhancers/silencers is not very clear.

Core promoter Eukaryotic RNAPII is not itself capable of transcriptional initiation in vitro. It needs to be supplemented by general (basal) transcription factors (GTFs). Factors are identified as TFIIX, where X is a letter. e.g. TFIIA, TFIIB, … RNAPII + TFs form pre-initiation complex (PIC). Only then transcription can commence. minimal (core) promoter – DNA sequence sufficient for assembly of pre-initiation complex. Transcription initiated by the core promoter is called basal transcription. Though core promoter does not exist in vivo, it directs transcription inititation in vitro. Thus, it has been particularly useful in defining the minimum sets of factors necessary to form the transctiptional complex (also called the transcriptional machinery). These factors called the basal or general transcription factors (GTFs), are necessary and sufficient (in addition to RNA Pot II) to form the transcriptional machinery at the core promoter and direct accurate initiation in vitro.

Core promoter elements
Core promoter is usually located proximal to or overlapping TSS. Contains several sequence motifs. TFs interact with them in sequence-specific manner. Combination of TF-binding motifs vary depending on the gene.

Core promoter elements
TATA box … ~ 30 bp upstream, consensus TATA(A/T)A(A/T) Instead of a TATA box, some eukaryotic (TATA-less) genes contain initiator (Inr) … surrounds TSS, extremely degenerate consensus sequence YYAN(T/A)YYY (A – TSS, N – any nucleotide) Promoters with both TATA and Inr also exist. DPE (downstream promoter element) in TATA-less Present in some TATA-, Inr+ promoters, 30 bp downstream. consensus: RGWCGTG (W = A or T) Also other minor elements exist: BRE (TFIIB Recognition Element), MTE (motif 10 element), DCE (downstream core element) Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002; 16 (20):

Promoter proximal elements
Found within 100 to 200 bp of the TSS. CAAT (CCAAT, CAT) box … consensus GGCCAATCT GC box … consensus G/T G/A GGCG G/T G/A G/A C/T. It’s GC rich segment. Promoter may contain multiple GC boxes, such promoter usually lack TATA box.

A hypothetic mammalian promoter region
Proximal Element +1 Enhancer Intron Enhancer TATA Enhancer -200 -30 -10~-50 Kb +10~50 Kb Exon

CpG island Transcription of genes with TATA/Inr promoters begins at a well-defined sites. However, transcription of many protein-coding genes has been shown to begin at any one of multiple possible sites over an extended region 20–200 bp long. As a result, such genes give rise to mRNAs with multiple alternative 5’ ends. These are housekeeping genes, they do not contain TATA, Inr. Most genes of this type contain a CG-rich stretch of several hundreds nucleotides – CpG island – within ≈100 base pairs upstream of TSS. CpG islands are typical for vertebrates (including human). They are not common in lower eukaryotes. CG dinucleotide is underrepresented in genome. The reason is that C gets easily methylated (at the 5 position), and then it is mutated to thymine by deamination. This is not repaired by the DNA repair machinery. However, longer stretches of CG are conserved (they are not mutated to TG), as they have an important functional role. When CpG island remains unmethylated, TF-binding site can be recognized by TF. In contrast, when methylated, the presence o 5-methyl cytosine interferes with the binding of TFs and thus suppresses transcription. The precise mechanism of the core promoter function of CpG islands is not well understood. They may contain multiple weak promoters instead of one strong core promoter. One common property of CpG islands is the presence of multiple Sp1 binding sites (Sp1 bounds to GC box motif) [Brandeis at al, Sp1 elements protect a CpG island from de novo methylation. Nature, 1994, 371, ]. Sp1 protects CpG from methylation and may work in concert with the general transcription machinery to support nucleation of the PIC.

CpG island mRNA Multiple 5’-start sites CpG island ~100 bp Computational analysis is based on CG dinucleotide imbalance. length = 200 bp, C+G content min 50%, Cp G 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 Cp G 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = 𝑝(CG) 𝑝 C 𝑝(𝐺) >0.60 M. Gardiner-Garden, M. Frommer, CpG islands in vertebrate genomes, J. Mol. Biol. 1987, 196, length = 500 bp, C+G content min 55%, Cp G 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 Cp G 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 >0.65 D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22, PNAS 2002, 99,

CpG island len=51, #C=76, #g=101, #CG=30, 𝑝C= , 𝑝G= , 𝑝CG= , CG content =𝑝C+𝑝G=0.71, CpGo/e=0.98 simple methods based on the frequency of CG perform remarkably well at correctly predicting regions containing TSSs EMBOSS CpGPlot/CpGReport - CpG Island Searcher - (IE only)

Promoter regions in human genes
Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 2001, 11(5): TATA 32% Inr 85% GC box 97% CAAT box 64% located in CpG 48% TATA+Inr+ 28% TATA+Inr- 4% TATA-Inr+ 56% TATA-Inr- 12% - potential promoter regions were identified for 1031 human genes

Computational analysis of promoters

Introduction Regulatory regions typically contain several transcription factor binding sites strung out over a large region. Which particular factor is used not only relies on the binding site, but also on what factors are available for binding in a given cell type at a given time. Any given gene will typically have its very own pattern of binding sites for transcriptional activators and repressors ensuring that the gene is only transcribed in the proper cell type(s) and at the proper time during the development.

Introduction Transcription factors themselves are also subject to similar transcriptional regulation, thereby forming transcriptional cascades and feed-back control loops. While this all is very nice and interesting from a biologist’s point of view, it spells big trouble for promoter prediction.

Computational difficulties
There thousands of transcriptional regulators, many of which have recognition sequences that are not yet characterized. Any given sequence element might be recognized by different factors in different cell types. Core promoter regulatory elements are short and not completely conserved ⟹ similar elements will be found purely by chance all over the genome.

What promoter prediction methods actually predict?
1st nucleotide copied at the 5’ end of the corresponding mRNA – transcription start site TSS region around TSS is often referred as the core promoter Owing to the strong link between TSS and core promoter, these terms are often used interchangeably. Three distinct types of promoter prediction signal features context features structure features

Evaluating predictions
sensitivity (Se), recall, TPR proportion of correct predictions of TSSs relative to all experimental TSSs Se= 𝑇𝑃 𝑇𝑃+𝐹𝑁 positive predictive value (PPV), precision proportion of correct predictions of TSSs out of all counted positive predictions PPV= 𝑇𝑃 𝑇𝑃+𝐹𝑃 - the most often used

specificity Sp Sp= 𝑇𝑁 𝑇𝑁+𝐹𝑃 false positive rate (FPR) FPR= 𝐹𝑃 𝐹𝑃+𝑇𝑁 correlation coefficient (CC) 𝐶𝐶= 𝑇𝑃×𝑇𝑁−𝐹𝑃×𝐹𝑁 (𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁) - It is not clear to me, how do I obtain TN? According to Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM Trans Comput Biol Bioinform. 2010; 7(3): PubMed PMID: : “If there are no predictions in the [+2001, EndOfGene], it represents TN.” However, I don’t understand this statement.

And how to obtain FP, FN, TP? You have a gene sequence for which you know TSS location. And you make your prediction. If it falls within the region [-2000, +2000] relative to annotated TSS, you have TP. Prediction falling into the annotated part of gene within [+2001, EndOfGene] are FPs. If you predict no promoter for this gene sequence, you have FN. This procedure is described in Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nat Biotechnol Nov; 22(11): PMID:

Signal features Recognize “conserved” signals such as TATA box, Inr, DPE, BRE etc. Such motifs are highly variable and degenerate. This leads to high false positive rate. Methods based on core promoter elements and other specific TFBs (e.g. CAAT box) are far from being accurate. Much more reliable signal is CpG-island feature. However, only ≈50% of human genes contain CpG islands. ⇓ CpG and non-CpG promoters are predicted with different success, prediction of non-CpG is less accurate

Context features Extracted from genomic context of promoters
Represented by a set of n-mers (DNA sequence long n bases). Their statistics are estimated from training samples. n-mers can cover most biological signals (TFBS: TATAAA, CCAAT; CpG: GC rich n-mers like CGGCG) n-mer representation encodes contextual information of promoters and has following advantages contextual information is independent of any biological signals distribution of n-mers may have biological significance (TFBS, CpG) n-mers may reveal details of yet unknown promoter regions n-mers reduce FPR while maintaining relatively high TPR (i.e. Se) 6-mers keep a good balance between discriminative power abd computational complexity: Wu S, Xie X, Liew AW, Yan H. Eukaryotic promoter prediction based on relative entropy and positional information. Phys Rev E Stat Nonlin Soft Matter Phys (4 Pt 1):

Structure features They originate from DNA 3D structures that characterize proximal promoters. DNA actually encodes in its sequence at least two independent levels of functional information DNA sequence – encodes proteins and their regulatory elements. Physical and structural properties of DNA itself. Example: dinucleotide properties – stacking energy, propeller twist trinucleotide – bendability, nucleocome position preference They have long-range interactions (up to 10 kbp), so they can exhibit properties not visible in the sequence. for more information about structure features see Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13): PMID: long-range character of structure features: Faiger H, Ivanchenko M, Cohen I, Haran TE. TBP flanking sequences: asymmetry of binding, long-range effects and consensus sequences. Nucleic Acids Res. 2006; 34(1): PMID: Are structure features different from sequence features? After all, structure features are calculated from sequence (di- or trinucleotodes). But it has really been shown, that sequence scales and structure scales differ: Baldi P, Chauvin Y, Brunak S, Gorodkin J, Pedersen AG. Computational applications of DNA structural scales. Proc Int Conf Intell Syst Mol Biol 1998; 6: PMID: Moreover, Florquin 2005 have clustered promotrs based on structural properties, and genes associated with each promoter in a cluster varied greatly for the different properties -> structural properties contain complementary information.

Model for cooperative assembly of an activated transcription-initiation complex. This figure clearly shows, why are structural features such as flexibility important. Molecular Cell Biology. 4th edition. Lodish H, Berk A, Zipursky SL, et al. New York: W. H. Freeman; 2000. Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation. FASEB J. 2003; 17(10):

Software Signal features (two leading CpG predictors)
FirstEF – different quadratic discriminant functions for CpG and non-CpG, slightly improves performance by concentrating to regions around first exon Eponine – TATA and G+C rich domain, Relevance Vector Machine Context features PromoterInspector – IUPAC word groups with wildcards Structure features McPromoter – DNA sequence, bending, DNA twist, ANN EP3 – features from1, prediction based just on the threshold imposed on the structural profile. FirstEF - Davuluri, R.V., Grosse, I. & Zhang, M.Q. Computational identification of promoters and first exons in the human genome. Nat. Genet. 29, 412–417 (2001). CpGProD - Ponger, L. & Mouchiroud, D. CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 18, 631–633 (2002). Eponine - Down, T.A. & Hubbard, T.J. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12, 458–461 (2002). PromoterInpector – now commercial from Scherf M, Klingenhoff A, Werner T. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J Mol Biol Mar 31;297(3): PubMed PMID: Ficket & Hatzigeorgiou reviewed the field in At that time programs used mainly signals, and their performance was very poor. The most notable advance was PromoterInspector in (Fickett JW, Hatzigeorgiou AG. Eukaryotic promoter recognition. Genome Res. 1997; 7(9): PubMed PMID: ) McPromoter - Ohler U, Niemann H, Liao Gc, Rubin GM. Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics. 2001;17 S PubMed PMID: 1 Florquin K et al., Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255

Integrated approaches
combine sequence, context and structural features ARTS – SVM, sophisticated kernels, combines n-mers to structure features (e.g. twist angle, stacking energies) does not distinguish CpG related promoter from unrelated, not clear how it performs on non-CpG SCS – sequence (TATA, Inr, DPE, CpG), structure (flexibility), and context (6-mers) features are used in different prediction models, their outcomes are combined by Decission Tree CoreBoost – boosting technique with stumps, integrates core promoter signals, DNA flexibility, n-mer frequency, … CoreBoost_HM … adds experimental histone modification data ARTS - Sonnenburg S, Zien A, Rätsch G. ARTS: accurate recognition of transcription starts in human. Bioinformatics. 2006; 22(14):e PMID: CoreBoost - Zhao X, Xuan Z, Zhang MQ. Boosting with stumps for predicting transcription start sites. Genome Biol. 2007; 8(2):R17. PubMed PMID: CoreBoost_HM - Wang X, Xuan Z, Zhao X, Li Y, Zhang MQ. High-resolution human core-promoter prediction with CoreBoost_HM. Genome Res. 2009; 19(2): PMID. SCS – available as a Supplemental material from Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM Trans Comput Biol Bioinform. 2010; 7(3): PubMed PMID:

Boosting, stumps Boosting Stump
Belongs between ensemble methods that produce a very accurate prediction rule (strong learner) by combining rough and moderately inaccurate (i.e. just a bit better than random guessing) rules (weak learners, WL). Iteratively learn weak classifiers and add them to a final strong classifier When WL is added, it’s weighted based on their accuracy. After a WL is added, the data is reweighted: misclassified examples gain weight and correctly classified examples lose weight. Thus, future WLs focus more on the examples that previous WLs misclassified. Stump One-level decision tree (i.e. it has one root and two terminal nodes) This is just general description of boosting. Many different algorithms exist (AdaBoost being the most popular), they differ in weight schemes etc. CoreBoost_HM uses LogitBoost algorithm. source: wikipedia

Databases EPD – Eukaryotic Promoter Database DBTSS
manually annotated non-redundant collection of eukaryotic POL II promoters DBTSS putative core promoter: e.g bp … +50 bp, -250 bp … +50 bp, -200 … +200 bp EPD EPD in its twentieth year: towards complete promoter coverage of selected model organisms Schmid, C.D., Perier, R., Praz, V. and Bucher, P. (2006) Nucleic Acids Res, 34, D82-85. Schmid CD, Praz V, Delorenzi M, Périer R, Bucher P. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res Jan 1;32(Database issue):D82-5.

Actual state of the promoter prediction
CpG island promoters are better to predict than non-CpG. CpG islands usually correspond to housekeeping genes. Promoters of housekeeping genes are easier to predict, but housekeeping genes are not regulated that strongly. So if biologist wants to up- or down-regulate the expression and you tell him he has CpG island promoter, he is usually not happy. non-CpG islands correspond to tissue-specific expression. And are the bottleneck in accurate promoter prediction. Best way how to do it: use transcription data. Alignment of the 5’ of ESTs or full cDNAs can be indicative of promoter sequence. However, cDNA does not contain 5’ UTR. This is overcome by new mRNA cap cloning techniques – DBTSS.

Future directions False positives are still the main problem.
This is because the information about chromatine structure is missing in prediction models. Without knowing which regions of chromatin are opened or closed (and to what degree), researchers have to assume the whole genome is accessi- ble for binding, which is obvi- ously wrong and will lead to more FP (and FN because of the extra noise). Chromatin remodelling: enzyme-assisted movement of nucleosomes on DNA. this slide was prepared based on Zhang MQ. Computational analyses of eukaryotic promoters. BMC Bioinformatics. 2007;8 Suppl 6:S3. PubMed PMID: source:

PPP evaluation and comparison
clanky Whole genome a update

Motif discovery So far we have discussed only one of the problems in computational promoter analysis: localization of the core promoter (TSS prediction) Another related problem is identification of cis-regulatory elements – motif discovery. - vice v Computational analyses of eukaryotic promoters, M. Q. Zhang, BMC Bioinformatics 2007, 8 a v Recent advances in computational promoter analysis in uderstanding …, Ping Qiu

Motif discovery

Biology of transcriptional regulation
References Biology of transcriptional regulation Pedersen AG, Baldi P, Chauvin Y, Brunak S. The biology of eukaryotic promoter prediction-a review. Comput. Chem Jun 15;23(3-4): comprehesive list of features and references to their models may be found in Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13): PMID:

Zeng J, Zhao XY, Cao XQ, Yan H
Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM Trans Comput Biol Bioinform. 2010; 7(3): PubMed PMID: In Introduction contains very nice overview of sequence, context and structure features and lists promoter prediction software using these features.

Large-scale software comparison
Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nat Biotechnol Nov; 22(11): PMID:

IUPAC words

Computational analysis of PromoterS

Similar presentations

Presentation on theme: "Computational analysis of PromoterS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational analysis of PromoterS

Similar presentations

Presentation on theme: "Computational analysis of PromoterS"— Presentation transcript:

Similar presentations

About project

Feedback