Presentation is loading. Please wait.

Presentation is loading. Please wait.

Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi 1, U. Ala.

Similar presentations


Presentation on theme: "Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi 1, U. Ala."— Presentation transcript:

1 Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi 1, U. Ala 1, R. Piro 2, F. Rosa 3, F. Di Cunto 1 and P. Provero 1 1 Dipartimento di Genetica, Biologia e Biochimica, Università di Torino, Torino, Italy; 2 INFN, Sezione di Torino, Torino, Italy; 3 ISI Foundation, Torino, Italy Introduction Among the open problems of molecular biology in the post-genomic era the functional annotation of the human genome and the identification of genes involved in genetic diseases are especially important. Expression data on a genomic scale have been available for several years thanks to a set of new experimental techniques, and are widely believed to contain much information potentially relevant towards the solution of such problems. Here we present the results of a computational analysis of publicly available expression data on human normal tissues, based on the integration of data obtained with the two most important experimental platforms (microarrays and SAGE) and different measures of dissimilarity between expression profiles. The building blocks of the procedure are the Gene Expression Neighborhoods (GEN), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation and relevance to human diseases. This analysis provides putative functional annotations for many genes, and identifies promising candidate disease genes for experimental verification. The “guilt by association” principle: The presented work is based on the following principle: “ since there is a strong correlation between coexpression and functional relatedness, a gene found to be coexpressed with several others involved in the same biological process can be putatively given the same functional annotation (Brazma A. et Vilo J., 2000, FEBS Lett. 480:17-24) ”. Method In this work we analyze publicly available expression data on human normal tissues obtained with Affymetrix microarrays (http://symatlas.gnf.org/SymAtlas/) and with SAGE (Serial Analysis of Gene Expression; http://cgap.nci.nih.gov/).http://symatlas.gnf.org/SymAtlas/ http://cgap.nci.nih.gov/ We considered 158 experiments concerning 12109 genes for Affymetrix and 62 experiments concerning 11741 genes for SAGE. Different measures of dissimilarity between expression profiles have been defined and integrated: Euclidean distance and Pearson linear dissimilarity for the microarray data, Euclidean distance and a dissimilarity measure based on the Poisson distribution (developed in Van Helden J., 2004, Bioinformatics 20(3):399-406 in a different context) for SAGE data. The unit of functional analysis, named Gene Expression Neighborhood (GEN), has been defined as a gene plus its k nearest expression neighbors, with k typically a rather small number (the results we report were obtained with k=6). For each dataset and each choice of dissimilarity measure we identified a number of GENs equal to the number of genes represented in the dataset. A GEN was considered functionally characterized if there was at least one Gene Ontology term (http://www.geneontology.org/) shared by the majority (K) of its genes (K=4 genes in the results presented). To avoid too generic GO terms, the analysis has been limited to those terms, shared by no more than a given maximum number M of genes in the whole experimental dataset under investigation (M=300 in the results presented). This limit ensures that the majority rule used to define functionally characterized GENs automatically implies statistically significant overrepresentation of the GO term involved.http://www.geneontology.org/ The false discovery rate for the functionally characterized GENs has been estimated: random GENs have been generated by reshuffling the gene names in the whole dataset (thus preserving the characteristics of the actual GENs, such as their degree of self-overlapping) and subjected to the same functional analysis. A leave-one-out analysis has been performed to estimate how many correct annotations the method can correctly identify. Characterized GENs have been used to determine putative new functional annotations: for each functionally characterized GEN and for each GO term associated to it (shared by the majority of its genes), the same GO term has been putatively attributed to the genes in the GEN not associated to it. Finally, we looked for functionally characterized GENs containing at least 3 genes associated with a genetic disease in the OMIM database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). When the relevant OMIM entries were related to each other, the genes in the GEN not associated to OMIM entries have been considered as interesting candidates to be involved in similar pathologies.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM Publicly available expression data integration of different quantitative measures of dissimilarity between expression profiles Identification of Gene Expression Neighborhoods (GEN) GEN functional analysis using the controlled annotation vocabulary Gene Ontology Potential new disease genes (OMIM) MicroarraysSAGE Putative new GO functional annotations Integration with OMIM data Estimation of false discovery rate Leave-one-out EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray428788/958428 SAGE50/515092 Microarray+ SAGE 46878851992504 EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray318546/598318 SAGE48/ 82 Microarray+ SAGE 35354648625376 EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray6881215/1731688 SAGE188/230188407 Microarray+ SAGE 866121523019061081 EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray569950/1240569 SAGE173/216173362 Microarray+ SAGE 7209502161378892 Conclusion We have developed a useful approach to analyze and integrate information obtained with different experimental techniques and different definitions of dissimilarity measures able to explore several aspects of coexpression. The results demonstrate that this integration increases the amount of useful information obtained. Results The leave-one-out analysis showed that 1026 correct GO annotations involving 644 genes and 94 GO terms would have been correctly identified by the method (see table 1). Table 1 - Leave-one-out analysis results showing the number of GO annotations (a) and annotated genes (b) correctly identified. a) b) c) d) Table 2 - Number of obtained putative new functional GO annotations (c) and new annotated genes (d). Different definition of dissimilarity measures describe different aspects of coexpression correlated with different kinds of functional annotation (see table 1 and 2) as shown by the fact that only a small fraction of GO annotations is predicted by more than one dissimilarity measure – dataset. The distribution of GO terms among the three Gene Ontology branches changes significantly among the experimental datasets-dissimilarity measures showing that different combinations are able to capture different aspects of coexpression. We have obtained 2113 putative new GO annotations involving 1540 genes and 194 GO terms (see table 2). Fig.1- the graphics show the distribution of correct obtained GO annotations among the three GO branch ( Biological process; Molecular function; Cellular conponent) The integration of our functional annotation results with the OMIM database allowed us to identify at least 59 interesting candidate genes potentially involved in human genetic disease (see table 3). Table 3 – List of candidates genes potentially involved in human genetic diseases. DatasetDiseaseGene Microarray+PearsonACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIAENSG00000069482 Microarray+PearsonAORTIC ANEURYSM, FAMILIAL THORACIC 1ENSG00000149591 Microarray+PearsonCARDIOMYOPATHY, DILATED, 1C; CMD1CENSG00000107796 Microarray+PearsonCHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2G; CMT2GENSG00000166986 Microarray+PearsonCHARCOT-MARIE-TOOTH DISEASE, DOMINANT INTERMEDIATE AENSG00000166197 Microarray+PearsonCONVULSIONS, BENIGN FAMILIAL INFANTILE, 2ENSG00000087258 Microarray+PearsonCONVULSIONS, FAMILIAL INFANTILE, WITH PAROXYSMAL CHOREOATHETOSIS; ICCAENSG00000087258 Microarray+PearsonDEAFNESS, NEUROSENSORY, AUTOSOMAL RECESSIVE 46; DFNB46ENSG00000101608 Microarray+PearsonEPILEPSY, IDIOPATHIC GENERALIZED, SUSCEPTIBILITY TO, 3; EIG3ENSG00000078725 Microarray+PearsonEPILEPSY, PARTIAL, WITH VARIABLE FOCIENSG00000100095 Microarray+PearsonFACIOSCAPULOHUMERAL MUSCULAR DYSTROPHY 1A; FSHMD1AENSG00000154553 Microarray+PearsonMUSCULAR DYSTROPHY, LIMB-GIRDLE, TYPE 1F; LGMD1FENSG00000128595 Microarray+PearsonPARKINSON DISEASE 3, AUTOSOMAL DOMINANT LEWY BODY; PARK3ENSG00000075340 Microarray+PearsonPOLYDACTYLY, PREAXIAL II; PPD2ENSG00000106538 Microarray+PearsonROSSELLI-GULIENETTI SYNDROMEENSG00000137699 Microarray+PearsonSCAPULOPERONEAL MYOPATHY; SPMENSG00000139329 Microarray+PearsonVACUOLAR NEUROMYOPATHYENSG00000077009 Microarray+PearsonVACUOLAR NEUROMYOPATHYENSG00000099800 Microarray+PearsonACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIAENSG00000131808 Microarray+PearsonBREAST CANCER, 11-22 TRANSLOCATION ASSOCIATEDENSG00000137713 Microarray+PearsonBREAST CANCER, DUCTAL, 1; BRCD1ENSG00000139618 Microarray+PearsonELECTROENCEPHALOGRAM, LOW-VOLTAGEENSG00000075043 Microarray+PearsonEOSINOPHILIA, FAMILIALENSG00000113721 Microarray+PearsonMICROCEPHALY, PRIMARY AUTOSOMAL RECESSIVE, 4; MCPH4ENSG00000156970 Microarray+PearsonMUSCULAR DYSTROPHY, CONGENITAL, 1BENSG00000143632 Microarray+PearsonSCAPULOPERONEAL MYOPATHY; SPMENSG00000011465 Microarray+PearsonTRIPHALANGEAL THUMB-POLYSYNDACTYLY SYNDROMEENSG00000106538 Microarray+PearsonTUMOR SUPPRESSOR GENE ON CHROMOSOME 11ENSG00000137713 Microarray+PearsonCARDIOMYOPATHY, DILATED, 1F; CMD1FENSG00000118523 Microarray+PearsonCARDIOMYOPATHY, DILATED, 1Q; CMD1QENSG00000091136 Microarray+PearsonDEAFNESS, AUTOSOMAL RECESSIVE 51; DFNB51ENSG00000026508 Microarray+PearsonMYOPATHY, LIMB-GIRDLE, WITH BONE FRAGILITYENSG00000147872 Microarray+EuclideaARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5ENSG00000160808 Microarray+EuclideaNONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2ENSG00000130598 Microarray+EuclideaSCAPULOPERONEAL MYOPATHY; SPMENSG00000011465 Microarray+EuclideaMUSCULAR DYSTROPHY, CONGENITAL, 1BENSG00000143632 Microarray+EuclideaCARDIOMYOPATHY, DILATED, 1C; CMD1CENSG00000122367 SAGE+EuclideanANEURYSM, INTRACRANIAL BERRY, 3ENSG00000158747 SAGE+EuclideanMYOPIA 5ENSG00000108821 SAGE+EuclideanMYOPIA 6ENSG00000100122 SAGE+EuclideanNONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2ENSG00000130598 SAGE+EuclideanMICROPHTHALMIA-CATARACTENSG00000167971 SAGE+EuclideanEXFOLIATIVE ICHTHYOSIS, AUTOSOMAL RECESSIVE, ICHTHYOSIS BULLOSA OF SIEMENS-LIKEENSG00000186081 SAGE+EuclideanMACULAR DYSTROPHY, RETINAL, 2, BULL'S EYEENSG00000007062 SAGE+EuclideanCATARACT, CONGENITAL NUCLEAR, AUTOSOMAL RECESSIVE 1; CATCN1ENSG00000105370 SAGE+EuclideanCARDIOMYOPATHY, DILATED, 1C; CMD1CENSG00000122367 SAGE+EuclideanARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5ENSG00000160808 SAGE+EuclideanACHROMATOPSIA 1ENSG00000129535 SAGE+EuclideanACHROMATOPSIA 1ENSG00000139988 SAGE+EuclideanCONE-ROD DYSTROPHY 5; CORD5ENSG00000109047 SAGE+EuclideanCONE-ROD DYSTROPHY 5; CORD5ENSG00000179036 SAGE+EuclideanPOSTERIOR COLUMN ATAXIA WITH RETINITIS PIGMENTOSA; AXPC1ENSG00000116703 SAGE+EuclideanMYOPIA 6ENSG00000196431 SAGE+EuclideanGLAUCOMA 3, PRIMARY INFANTILE, B; GLC3BENSG00000158747 SAGE+EuclideanMICROPHTHALMIA-CATARACTENSG00000197253 SAGE+EuclideanDUPUYTREN CONTRACTUREENSG00000087245 SAGE+EuclideanCORNEAL DYSTROPHY, CRYSTALLINE, OF SCHNYDERENSG00000158747 SAGE+EuclideanCATARACT, AUTOSOMAL RECESSIVE, EARLY-ONSET, PULVERULENTENSG00000172014 SAGE+EuclideanCATARACT, POSTERIOR POLAR 3ENSG00000125864


Download ppt "Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi 1, U. Ala."

Similar presentations


Ads by Google