Presentation on theme: "Genomic organization and functional characterization of regulatory elements in higher eukaryotes Boris Lenhard Computational Biology Unit Bergen Center."— Presentation transcript:
Genomic organization and functional characterization of regulatory elements in higher eukaryotes Boris Lenhard Computational Biology Unit Bergen Center for Computational Science University of Bergen, Norway
Genome comparison reveals unknown functional elements % IDENTITY Actin gene compared between human and mouse.
Ultraconserved non-coding regions (UCR) in vertebrate genomes a.k.a. Conserved non-coding elements (CNE) a.k.a. Conserved non-genic sequences (CNG) a.k.a. Highly conserved non-coding regions (HCNR)
There exist unusually highly conserved noncoding elements in vertebrate genomes
Ultraconserved regions (UCR) in vertebrate genomes Definition of UCR >= 50 bp human:mouse identity >95% no coding potential 3583 human:mouse UCRs have detectable conservation in Fugu A few [dozen] characterized, all as long-range enhancers Many UCRs occur in clusters spanning hundreds of kilobases:
What genes are UCRs associated with? Nr. Nr UCR s Gene SymbolDescriptionInterpro domains 1 84MEIS2Meis1, myeloid ecotropic viral integration site 1 homolog 2 (mouse) Homeobox 2 81ZFHX1Bzinc finger homeobox 1bHomeobox Zn-finger, C2H2 type 3 80KIAA0390KIAA0390 gene productZnf_C2H2, NLS_BP 4 79EBF-3COE3_HUMAN, Transcription factor COE3 (Early B-cell factor 3) (EBF-3) COE 5 77ZNF503zinc finger protein 503 Znf_PHD Znf_C2H2 Eggshell 6 64IRX-3 IRX-5 IRX-6 Iroquis-class protein IRX-3 Iroquis-class protein IRX-5 Iroquis-class protein IRX-6 Homeobox 7 62PBX3pre-B-cell leukemia transcription factor 3 PBX Homeobox 8 62NR2F1nuclear receptor subfamily 2, group F, member 1 Hormone_rec_lig Stdhrmn_receptor Str_ncl_receptor Znf_C4steroid 9 60FOXP2 ------- TFEC forkhead box P2 (immune tolerance development) ----- Similar to transcription factor EC Involucrin_rpt TF_Fork_head Znf_C2H2 ------- HLH_basic 10 52DACHdachshund homolog (Drosophila)Transform_Ski
What genes are UCRs associated with? 10 top UCR clusters: Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. (2004) Arrays of ultraconserved non- coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99.
What genes are UCRs associated with? Out of 150 most prominent UCR clusters, at least 144 concide with one or more genes for DNA binding proteins (generally transcription factors) Among them are most key regulators of animal development: HOX clusters, Iroquois genes, GSH1, GSH2, PPAR , LMO1… Many are associated with malignancies and recurring chromosomal breakpoints/rearrangement sites: MEIS2, PBX3, BCL11A, MEIS1, LMO4, BCL11B, EVI1...
Quantitative evidence I: Categories of genes in the vicinity of UCRs Over-representation of protein domains in genes flanking UCRs. Bonferroni-corrected and uncorrected Fisher Exact Test p-values are shown for the 16 most over-represented INTERPRO domains. Typical transcription factor domains are in bold. Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99. 50% of Homeobox-containing genes, 20% of forkeads, 20% of nuclear receptors and 8% of zinc finger proteins are within 200 kb of a UCR Only 3% of random genes are within 200 kb of a UCR
What is the function of UCRs (cont’d)? Most known ones: enhancers A very small fraction: pre-microRNA genes can be easily distinguished from putative enhancer elements A distinct conservation pattern between mammals and fish Different binding site pattern composition than most other UCRs Pre-miRNA gene
Putative conserved regulatory elements show distinct motif compositions Is the diference significant (T-test)? Mean number of sites (per 400 bp bp) detected in conserved noncoding elements associated with embryonic development genes Mean number of sites (per 400 bp) detected in conserved noncoding elements containing known pre-miRNA genes No (P=0.07) 0.50 0.32 pax6 Yes (P=1.1e-08) 5.83 2.43 nkx6.1 No (P=0.83) 1.05 1.11 nkx2.2 Yes (P=7.8e-15) 5.13 1.80 gsh2 Yes (P=1.8e-09) 9.56 4.06 homeobox domain Yes (P=2.3e-06) 4.66 2.54 oct family Yes (P=0.002) 6.12 4.46 sox-5 MOST UCRs CONTAIN A HIGH DENSITY OF BINDING SITES FOR KEY DEVELOPMENTAL TRANSCRIPTION FACTORS.
Can we recognize the “neural” ultraconserved enhancers? Most UCRs show a high overrepresentation of a number of putative transcription factor binding site motifs: General homeobox motifs, Sox (SRY) and Oct (POU) Sox2 and Oct3/4 are highly expressed in mouse ES cells (Nagano K et al (200 5) Proteomics 5:1346-61) Oct and Sox transcription factors control many different aspects of neural development and embryogenesis, often binding to adjacent sites on DNA Williams, D. C. et al. (2004) J. Biol. Chem.279:1449-1457
The SPH (Sox-Oct-Homeobox) model: A simple screen to select UCRs governing neural expression The model measures the combined probability of ocurrence of Sox, Oct(POU) and core homeobox motifs in 400 bp regions centered on UCRs
SPH-enriched UCRs around genes coding for known neural patterning regulators
SPH-model detects genomic regions with neural expression
UCRs: common to all metazoan genomes? Vertebrates: Drosophila: ETS TIR Homeobox Paired Cfc4 NHR ligand von Willebrand factor type C domain Laminin g Imunoglobulin Fibronectin type III Cadherin Cyclic nucleotide-binding domain Neurotransmitter-gated ion-channel transmembrane region Ligand-gated ion channel Neurotransmitter-gated ion-channel ligand binding domain BTB/POZ domain
Large-scale mapping oftranscription start sites using CAGE (Cap Analysis of Gene Expression) Like SAGE, but 5’ ends of cDNAs (using RIKEN 5’ GTP cap trapping technology) Large-scale sequencing of 5’ ends (CAGE tags of 20- 22 nucleoties) of mRNAs: ~6.5 million mouse and ~4 million human CAGE tags uniquely mapped to genome
Single-peak (SP) vs. broad (BR) core promoters: “shape classes” of core promoters
Association of shape classes with different core promoter elements Overrepresentation and underrepresentation of core promoter elements in different shape classes. A SPBRPBMU TATA (all)3.1e-731.9e-161.8e-102.4e-09 CCAAT (all)0.040.420.370.49 GC (all)1e-40.200.400.33 CpG (all)1.0e-1371.4e-658.7e-060.02 B SPBRPBMU TATA (no CpG)2.6e-771.6e-162.8e-161.0e-09 CCAAT (no CpG)6.8e-239.2e-160.110.42 GC (no CpG)7.8e-255.9e-180.480.35 CpG (no TATA, CCAAT or GC)4.8e-454.7e-173.4e-050.87 SP (single peak) promoters: strongly associated with TATA boxes BR (broad) promoters: strongly associated with CpG islands and absence of TATA box
Conclusions Key vertebrate (and most likely invertebrate) transcription factor genes are controlled by arrays of highly conserved regulatory elements; the arrays ofter span more than a megabase around their target genes. Highly conserved regulatory elements contain clusters of putative transcription factor binding sites indicative of their function, enabling the building of predictive models. There are fundamentally different classes of vertebrate core promoters, differing in mechanism of transcriptional initiation and choice of TSS, tissue specificity, evolutionary dynamics and responsivneness to long-range enhancers.
Acknowledgements Lenhard Group at CGB, Karolinska Institutet (now at Bergen Center for Computational Science, University of Bergen) Pär Engström (PhD student) Ying Sheng (PhD student) Albin Sandelin (Postdoc) – now at RIKEN GSC Sara Bruce (Project student) – now at Dept. Of Bioscience, Karolinska Institutet Collaborators RIKEN Genome Science Center Piero Carninci and the members of FANTOM3 Consortium Wyeth Wasserman group (University of British Columbia) Shannan Ho Sui, David Arenillas Johan Ericson group (CMB, Karolinska Institutet) Peter Bailey, Joanna Klos