Presentation on theme: "Genomic organization and functional characterization of regulatory elements in higher eukaryotes Boris Lenhard Computational Biology Unit Bergen Center."— Presentation transcript:
Genomic organization and functional characterization of regulatory elements in higher eukaryotes Boris Lenhard Computational Biology Unit Bergen Center for Computational Science University of Bergen, Norway
Genome comparison reveals unknown functional elements % IDENTITY Actin gene compared between human and mouse.
Ultraconserved non-coding regions (UCR) in vertebrate genomes a.k.a. Conserved non-coding elements (CNE) a.k.a. Conserved non-genic sequences (CNG) a.k.a. Highly conserved non-coding regions (HCNR)
There exist unusually highly conserved noncoding elements in vertebrate genomes
Ultraconserved regions (UCR) in vertebrate genomes Definition of UCR >= 50 bp human:mouse identity >95% no coding potential 3583 human:mouse UCRs have detectable conservation in Fugu A few [dozen] characterized, all as long-range enhancers Many UCRs occur in clusters spanning hundreds of kilobases:
What genes are UCRs associated with? Nr. Nr UCR s Gene SymbolDescriptionInterpro domains 1 84MEIS2Meis1, myeloid ecotropic viral integration site 1 homolog 2 (mouse) Homeobox 2 81ZFHX1Bzinc finger homeobox 1bHomeobox Zn-finger, C2H2 type 3 80KIAA0390KIAA0390 gene productZnf_C2H2, NLS_BP 4 79EBF-3COE3_HUMAN, Transcription factor COE3 (Early B-cell factor 3) (EBF-3) COE 5 77ZNF503zinc finger protein 503 Znf_PHD Znf_C2H2 Eggshell 6 64IRX-3 IRX-5 IRX-6 Iroquis-class protein IRX-3 Iroquis-class protein IRX-5 Iroquis-class protein IRX-6 Homeobox 7 62PBX3pre-B-cell leukemia transcription factor 3 PBX Homeobox 8 62NR2F1nuclear receptor subfamily 2, group F, member 1 Hormone_rec_lig Stdhrmn_receptor Str_ncl_receptor Znf_C4steroid 9 60FOXP TFEC forkhead box P2 (immune tolerance development) Similar to transcription factor EC Involucrin_rpt TF_Fork_head Znf_C2H HLH_basic 10 52DACHdachshund homolog (Drosophila)Transform_Ski
What genes are UCRs associated with? 10 top UCR clusters: Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. (2004) Arrays of ultraconserved non- coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99.
11 52PAX2paired box gene 2 (kidney, differentiation, eyes, CNS) Paired_box Homeobox 12 52FOXP1forkhead box P1 (specification and differentiation of lung epithelium) TF_Fork_head Znf_C2H BCL11AB-cell lymphoma/leukemia 11A (B- cell CLL/lymphoma 11A) (COUP-TF interacting protein 1) (Ecotropic viral integration site 9 protein) (EVI-9) Znf_C2H IRX-4 IRX-2 IRX-1 IRX-4 IRX-2 IRX-1 Homeobox 15 46ATF EVX HOX-D* activating transcription factor 2 (brain) HOMEOBOX EVEN-SKIPPED HOMOLOG PROTEIN 2 (EVX-2) HOX-D cluster Znf_C2H2 TF_bZIP Homeobox Antifreeze_1 HTH_lambrepressr CytC_heme_bind Homeobox HTH_lambrepressr 16 41NR4A2nuclear receptor subfamily 4, group A, member 2 (brain) Znf_C4steroid hormone_rec_lig 17 39FOXD3forkhead, box D3, at chr1: TF_Fork_head 18 38LMO KIAA1221 LIM domain only KIAA1221 (brain) LIM Znf_C2H ZNF407zinc finger protein 407Znf_C2H MEIS1Meis1, myeloid ecotropic viral integration site 1 homolog (mouse) Homeobox
What genes are UCRs associated with? Out of 150 most prominent UCR clusters, at least 144 concide with one or more genes for DNA binding proteins (generally transcription factors) Among them are most key regulators of animal development: HOX clusters, Iroquois genes, GSH1, GSH2, PPAR , LMO1… Many are associated with malignancies and recurring chromosomal breakpoints/rearrangement sites: MEIS2, PBX3, BCL11A, MEIS1, LMO4, BCL11B, EVI1...
Quantitative evidence I: Categories of genes in the vicinity of UCRs Over-representation of protein domains in genes flanking UCRs. Bonferroni-corrected and uncorrected Fisher Exact Test p-values are shown for the 16 most over-represented INTERPRO domains. Typical transcription factor domains are in bold. Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99. 50% of Homeobox-containing genes, 20% of forkeads, 20% of nuclear receptors and 8% of zinc finger proteins are within 200 kb of a UCR Only 3% of random genes are within 200 kb of a UCR
What is the function of UCRs (cont’d)? Most known ones: enhancers A very small fraction: pre-microRNA genes can be easily distinguished from putative enhancer elements A distinct conservation pattern between mammals and fish Different binding site pattern composition than most other UCRs Pre-miRNA gene
Putative conserved regulatory elements show distinct motif compositions Is the diference significant (T-test)? Mean number of sites (per 400 bp bp) detected in conserved noncoding elements associated with embryonic development genes Mean number of sites (per 400 bp) detected in conserved noncoding elements containing known pre-miRNA genes No (P=0.07) pax6 Yes (P=1.1e-08) nkx6.1 No (P=0.83) nkx2.2 Yes (P=7.8e-15) gsh2 Yes (P=1.8e-09) homeobox domain Yes (P=2.3e-06) oct family Yes (P=0.002) sox-5 MOST UCRs CONTAIN A HIGH DENSITY OF BINDING SITES FOR KEY DEVELOPMENTAL TRANSCRIPTION FACTORS.
Can we recognize the “neural” ultraconserved enhancers? Most UCRs show a high overrepresentation of a number of putative transcription factor binding site motifs: General homeobox motifs, Sox (SRY) and Oct (POU) Sox2 and Oct3/4 are highly expressed in mouse ES cells (Nagano K et al (200 5) Proteomics 5: ) Oct and Sox transcription factors control many different aspects of neural development and embryogenesis, often binding to adjacent sites on DNA Williams, D. C. et al. (2004) J. Biol. Chem.279:
The SPH (Sox-Oct-Homeobox) model: A simple screen to select UCRs governing neural expression The model measures the combined probability of ocurrence of Sox, Oct(POU) and core homeobox motifs in 400 bp regions centered on UCRs
SPH-enriched UCRs around genes coding for known neural patterning regulators
SPH-model detects genomic regions with neural expression
UCRs: common to all metazoan genomes? Vertebrates: Drosophila: ETS TIR Homeobox Paired Cfc4 NHR ligand von Willebrand factor type C domain Laminin g Imunoglobulin Fibronectin type III Cadherin Cyclic nucleotide-binding domain Neurotransmitter-gated ion-channel transmembrane region Ligand-gated ion channel Neurotransmitter-gated ion-channel ligand binding domain BTB/POZ domain
UCRs in Drosophila twist locus
Core promoters and responsiveness to long-range enhancers
A textbook-type core promoter TATAGC-boxCAAT
Large-scale mapping oftranscription start sites using CAGE (Cap Analysis of Gene Expression) Like SAGE, but 5’ ends of cDNAs (using RIKEN 5’ GTP cap trapping technology) Large-scale sequencing of 5’ ends (CAGE tags of nucleoties) of mRNAs: ~6.5 million mouse and ~4 million human CAGE tags uniquely mapped to genome
Single-peak (SP) vs. broad (BR) core promoters: “shape classes” of core promoters
Association of shape classes with different core promoter elements Overrepresentation and underrepresentation of core promoter elements in different shape classes. A SPBRPBMU TATA (all)3.1e-731.9e-161.8e-102.4e-09 CCAAT (all) GC (all)1e CpG (all)1.0e e-658.7e B SPBRPBMU TATA (no CpG)2.6e-771.6e-162.8e-161.0e-09 CCAAT (no CpG)6.8e-239.2e GC (no CpG)7.8e-255.9e CpG (no TATA, CCAAT or GC)4.8e-454.7e-173.4e SP (single peak) promoters: strongly associated with TATA boxes BR (broad) promoters: strongly associated with CpG islands and absence of TATA box
Association of shape classes with tissue specificity TissueSPBRPBMU adipose 1.98 P= P= P= P=0.47 cns 1.02 P= P= P= P=0.10 embryo 4.11 P=1.21e P=6.22e P= P=8.096e-05 liver 2.15 P=3.56e P=1.14e P= P=0.56 lung 2.41 P=1.37e P=1.42e P= P=0.049 macrophag e 1.39 P= P= P= P=0.14 other 3.59 P=3.87e P=4.029e P= P=0.016 testis 4.36 P=7.70e P= P=0.21 Overrepresented1e-101e Underrrepresented1e-101e SP (single peak) promoters (and by association, TATA-box promoters): strongly associated with tissue- specific genes (except brain) BR (broad) promoters (and, by association, CpG island overlapping & TATA-less promoters): strongly associated with housekeeping genes (and developmental regulatory genes)
Conclusions Key vertebrate (and most likely invertebrate) transcription factor genes are controlled by arrays of highly conserved regulatory elements; the arrays ofter span more than a megabase around their target genes. Highly conserved regulatory elements contain clusters of putative transcription factor binding sites indicative of their function, enabling the building of predictive models. There are fundamentally different classes of vertebrate core promoters, differing in mechanism of transcriptional initiation and choice of TSS, tissue specificity, evolutionary dynamics and responsivneness to long-range enhancers.
Acknowledgements Lenhard Group at CGB, Karolinska Institutet (now at Bergen Center for Computational Science, University of Bergen) Pär Engström (PhD student) Ying Sheng (PhD student) Albin Sandelin (Postdoc) – now at RIKEN GSC Sara Bruce (Project student) – now at Dept. Of Bioscience, Karolinska Institutet Collaborators RIKEN Genome Science Center Piero Carninci and the members of FANTOM3 Consortium Wyeth Wasserman group (University of British Columbia) Shannan Ho Sui, David Arenillas Johan Ericson group (CMB, Karolinska Institutet) Peter Bailey, Joanna Klos