Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania.

Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania Nov. 17, 2004 Department of Physiology Seminar Series University of Kentucky

What is the code for determining where (and when) a gene is expressed? http://molbio.info.nih.gov/molbio/gcode.html Expression TFBS1TFBS4TFBS3 TFBS4 TFBS2 TFBS1 TFBS = transcription factor binding site

Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or CRMs) that Specify Tissue Expression From Wasserman & Sandelin, NRG 2004

A Genomics Unified Schema approach to understanding gene expression Dave Barkan, Jonathan Crabtree, Shailesh Date, Steve Fischer, Bindu Gajria, Thomas Gan, Greg Grant, Hongxian He, John Iodice, Li Li, Junmin Liu, Matt Mailman, Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics

Stem Cell Gene Anatomy Project Beta Cell Biology Consortium Plasmodium Genome Resource Allgenes (human and mouse DoTS) GUS

CoreSRESTESSRADDoTS Oracle RDBMS Object Layer for Data Loading Java Servlets GUS is an open source project Sanger Institute U. Georgia Flora Centromere Database U. Chicago U. Penn U. Toronto Phytophthora sojae genome Virginia Bioinformiatics Insitiute

GUS (Genomics Unified Schema) http://www.gusdb.org MIAME/MAGE-OM Gene ExpressionRAD EST clusters and gene models Sequence and annotation DoTS DocumentationData ProvenanceCore Ontologies Shared Resources Sres TFBS organization Gene RegulationTESS FeaturesDomainNamespace

RAD EST clustering and assembly DoTS Genomic alignment and comparative sequence analysis Identify shared TF binding sites TESS BioMaterial annotation SRES

DoTS integrates sequence annotation including where expressed

kidney, mammary gland, brain, liver, colon, lung, retina, spinal cord, rhabdomyosarcoma cell line brain, liver, kidney, lung, melanocyte embryo, fetus, kidney, limb, retina, salivary gland brain, rhabdomyosarcoma cell line, kidney Sorbs1: sorbin and SH3 domain containing 1 - GO molecular function - actin binding and protein kinase binding - GO cellular component – actin cytoskeletal stress fibers

RAD Contains Detailed Expression Experiments Including Tissue Surveys

TESS Allows You to Find Potential TFBS But there are too many potential sites!

Promoters Features Related to Tissue- Specificity as Measured by Shannon Entropy Jonathan Schug 1, Winfried-Paul Schuller 2, Claudia Kappen 2, J. Michael Salbaum 2, Maja Bucan 3, Christian J. Stoeckert Jr. 1 1.Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA 2.Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, 68198, USA 3.Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

What is a Liver-Specific Gene? * http://expression.gnf.org/

Assessing Tissue Specificity of Genes Using Shannon Entropy Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity. To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression. (a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450 (b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7

Agreement between Microarrays and ESTs on Tissue Specificity

Specificity Characteristics of Tissues

CpG Islands are Associated with the Start Sites of Genes with Wide-Spread Expression CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions CpG+ CpG- Multi-Tissue H >= 4.4 Tissue Specific H <= 3.5

TATA Boxes are Associated with Tissue-Specific Genes

Functional relationships of promoter classes based on over- represented GO terms (EASE)

First Clues: TATA Box indicates Tissue Specific; CpG indicates Wide Spread Expression Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

Pattern Analysis of Pancreas Gene Promoters Guang (Gary) Chen, Jonathan Schug

Shannon Entropy GNF Gene Expression Atlas Gene Lists with Tissue Specificity Gene Lists with Tissue Specificity DBTSS Sequences around Transcription Start Sites Teiresias Pattern Clusters (PWM) Pattern Clusters (PWM) Represent Seqs with PWMs Gene Clusters Gene Ontology (GO) GO Category Analysis Patterns Pattern Clustering Comparative Genome Analysis Identifying TFBMs – Method Pipeline Starting with a gene expression tissue survey, pancreas-specific genes with common TFBS and biological processes are identified Tissue Specific Regulatory Modules Associated with GO Biological Process Tissue Specific Regulatory Modules Associated with GO Biological Process

–DBTSS: Database of Transcriptional Start Sites Based on 400,225 and 580,209 human and mouse full length cDNA sequences, DBTSS contains the genomic positions of the transcriptional start sites and the adjacent promoters for 8,793 and 6,875 human and mouse genes, respectively. http://dbtss.hgc.jp/ Yutaka Suzuki, Riu Yamashita, Kenta Nakai and Sumio Sugano (2002). DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 30: 328-331. –Pancreas genes are chosen based on efforts to understand pancreatic development and function (EPConDB) 500bp upstream for preliminary study 159 human (mouse) pancrea specific genes (Qislet <7, positive(p)) & 159 human (mouse) ubiquitous genes (Qislet >10, negative (n)) –This approach can be applied to any tissue to study the tissue specificity of transcription factor binding motifs (TFBMs) & Modules Methods & Resources (Cont.)

A Teiresias Pattern P is a pattern (with L ≤ W) if P containing at least L residues such that every subpattern of P containing L residues is at most W symbols in length. PatternACTGGCA. C. GT Method- Pattern Discovery - Teiresias Teiresias Patterns *Rigoutsos, I. and A. Floratos, Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics, 14(1), January 1998.

Identifying TFBMs - Pattern Distribution With 117 human pancreas specific genes (Q pancreas 10, negative (n)), roughly 90,000 patterns were discovered in the 1kb+/200bp- promoter region. Patterns with ∆ p-n >20 (in blue box) are more likely to be pancreas specific Each point represents a pattern with occurrence in positive data set (y-axis) and negative data set (x-axis) For each pattern (x-axis), the occurrence difference ∆ p-n (y-axis) between positive (Q 10) data set

Method - Pattern Clustering Pattern Clustering Patterns Smith- Waterma n Distance of pattern pair Hierarchic al K- Median Pattern Clusters (PWM) Num of Cluster Pattern Clustering

Results - Pattern Clustering Clustering Results (human, ∆ p-n >20, 72 patterns)

Identifying TFBMs 72 patterns (Human, ∆ p-n >20) were clustered to 18 pattern clusters and 6 of them were identified as known ones by searching TRANSFAC. Identified known binding sites associated with human pancreas genes AP2ALPHA MEF2 SRY NKX62CAP_01HOXA3

AP2ALPHA MEF2 NKX62 CAP_01 Identifying TFBM By conducting comparative genomic analysis, some discovered TFBMs are conserved between Human & Mouse pancreas Orthologs HOXA3

Gene Clustering - Based on TFBMs pancreas specific genes can be clustered according to presence or absence of conserved promoter motifs Upstream sequences can be characterized by pattern occurrences, which can then be used to calculate pairwise similarities between sequences. For simplicity, we just used a boolean model by considering 7 conserved pattern appearance. Centered pearson correlation was used to calculated similarity, and 117 pancreas specific (Q<6.5) were clustered into 10 clusters with hierarchical clustering.

Gene Clustering – GO Category Assign Gene Clusters to GO Category To interpret clustering results, we used EASE to find the significant biological features of a gene cluster of interest of a gene cluster through the GO Biological Process.

More Clues: Known and novel TFBS found associated with genes expressed in the pancreas See conservation of sites between human and mouse Associated with digestion, catabolism, and response to stimulus GO biological processes

Discovering regulatory modules by creating profiles for Gene Ontology Biological Processes based on tissue-specificity scores Elisabetta Manduchi, Jonathan Schug

If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes? Tissue Biological Process Genes

For a given tissue survey, we attach “tissue- specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q. To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set. The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov- Smirnov statistic.

The following results refer to the application of the methods described above to the GeneNote tissue survey: –12 tissues in duplicate on the HGU95 Affymetrix chip set (Av2, B-E). We looked at the 2316 GO BPs that we could map to probe sets (using version 1.5.1 of the Bioconductor GO and hgu95XXX metadata R packages). Application to a Human Tissue Survey

GO BPs having significantly specific profiles for each tissue can be identified significant in liver significant in heart and skeletal muscle

Excerpt of cluster of GO BPs based on their tissue-specificity profiles (up in spinal cord/brain)

Focusing on steroid metabolism A.After mapping probe sets to RefSeqs and retrieving from DBTSS their upstream sequences, we assembled a set of 63 promoter sequences, which was our positive set. B.We generated 5 negative sets, each consisting of 315 sequences, by randomly scrambling each of the positive set sequences. C.We ranked each of 666 Transcription Factor Binding Sites (TFBSs) from TRANSFAC - represented by position matrices - in terms of their ability (measured by average ROC area) in discriminating between the positive set and the negative sets.

D.We then selected high ranking TFBSs from (C) and high ranking TFBSs from an independent study focusing on liver specificity and formed all possible pairs between these two sets. E.These pairs were ranked according to their discriminative ability and on the basis of the distance between their components in the positive hits. Optimal parameters (distance and individual TFBS match scores) were selected for each pair scoring at the top. F.By assessing the performance over a test set composed of mouse promoter sequences, we found 2 candidate CRMs (involving 3 and, respectively, 4 TFBSs) with an over-representation of steroid metabolism genes. Focusing on steroid metabolism

Example of production hits to steroid metabolism mouse promoter sequences No. mouse promoter sequences: 6875. Of these 50 belong to genes mapping to steroid metabolism. No. production hits: 257. Of these 8 belong to genes mapping to steroid metabolism. TSS Production TFBSs: {FOXD3_01, GKLF_01, HFH1_01, MADSA_Q2} Parameters: max distance=130 FOXD3_01 min score=9.934705 GKLF_01 min score=10.815614 HFH1_01 min score=9.442617 MADSA_Q2 min score=8.246301 green=forward strand red=reverse strand shading indicates strength

More Clues: We can identify candidate CRMs from top- ranking GO Biological Processes for tissues Identified a candidate CRM for steroid metabolism.

Summary GUS is a functional genomics database system used by a growing number of sites for genome and expression projects. Using expression data in GUS and entropy-based metrics, we can rank genes according to their tissue-specificity and learn promoter properties and associate functional roles In addition to general properties of tissue-specific promoters, we are beginning to identify combinations of motifs (i.e., regulatory modules) associated with expression in specific tissues.

Future Directions Refine analysis from genes to transcripts Refine analysis from organs to cells Apply approach to splicing Apply approach to developmental stage and differentiation state Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".

http://www.cbil.upenn.edu

Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania.

Similar presentations

Presentation on theme: "Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania.

Similar presentations

Presentation on theme: "Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback