Presentation on theme: "Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium."— Presentation transcript:
Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium Budapest 2009. February 9.
In the last decade the genomes of numerous organisms have been sequenced, however, conversion of raw genome sequence data into biological knowledge remains a difficult task. Genome annotation - the process that maps biological knowledge onto the relevant genome-elements - requires the definition of the positions of all protein-coding (and non-coding) genes along the genome sequence, identification of their coding regions, regulatory sequences, promoters etc.
Although a large number of programs have been developed for computational gene identification, correct prediction of the structure of all protein-coding genes of higher eukaryotes is still an elusive goal. The uncertainties associated with gene finding may be illustrated by the fact that - eight years after the publication of the draft genome sequence (2001) - the exact number of protein-coding genes in the human genome is still unknown.
Finishing the euchromatic sequence of the human genome. Nature. 2004 Oct 21;431(7011):931-45.
Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.
Since direct evidence of protein existence is generally absent, the criterion often employed to annotate a transcript as protein-coding is the existence of an Open Reading Frame (ORF). However, this criterion has been recently questioned by a number of methods developed to assess the quality of protein-coding gene annotations.
The rationale of the method of Clamp et al. is that functional protein-coding genes are subject to purifying selection, and therefore they are expected to show evolutionary conservation. The authors used two types of measures for the assessment of evolutionary conservation of predicted human genes: reading frame conservation (RFC, based on the observation that indels do not affect significantly the size of functional proteins) and codon substitution frequency (CSF, based on the observation that the patterns of nucleotide substitution in functional protein-coding genes is different from that observed on random DNA). Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.
In their analysis of a number of human gene reference sets, Clamp et al. identified ~1200 human “orphans”: ORFs that lack homology to known genes. Both, RFC and CSF analysis revealed that many of these human orphans exhibit a behavior which is essentially indistinguishable from matched random controls, and very different of that observed in nonorphan protein-coding genes. From these, the authors concluded that overall about 15% of the entries in the gene catalogues investigated are not valid protein-coding genes. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.
While the quality control method of Clamp et al. can distinguish protein-coding genes from noncoding sequences, it is less suitable to identify gene predictions that are only partially correct. If an annotated gene misses one or more exons, or a fraction of one exon, it may still exhibit the expected evolutionary characteristics of protein-coding genes.
Indeed, - in addition to uncertainties of the number of protein-coding genes – a very serious problem is that the structure of a significant proportion of the human genes is incorrectly predicted. According to recent analyses the predicted genomic structure of human genes is estimated to be correct for only about half of the predicted genes. Obviously, erroneous prediction of the structure of protein-coding genes leads to serious problems in prediction of the structure and function of the proteins they encode and hinder the identification of elements that regulate their expression.
A recent study has systematically compared the performance of various computational methods to predict human protein-coding genes. A set of well-annotated ENCODE sequences were blind-analyzed with the different gene finding programs and the predictions obtained were compared with the annotations. Predictions were analyzed at the nucleotide, exon, transcript and gene levels to evaluate how well the predictions reproduce the annotation. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006;7 Suppl 1:S2.1-31. Epub 2006 Aug 7. Review.
The computational methods compared were classified as 1)EST-, mRNA-, and protein-based methods (AUGUSTUS-EST, PARAGON+NSCAN_EST, ACEVIEW, ENSEMBL, EXOGEAN, EXONHUNTER, ACEMBLY, ECGene, MGCGene) 2)single-genome ab initio methods (AUGUSTUSabinit, GENEMARKhmm, GENEZILLA, GENEID, GENESCAN) 3)dual- or multiple-genome based comparative genomic methods (AUGUST-dual, ACESCAN, DOGFISH-C, NSCAN, SAGA, MARS, SGP2, TWINSCAN) 4)complex methods using any type of available information (AUGUSTUSany, FGENESH++, JIGSAW, PARAGONany, CCDSGene, KNOWNGene, REFSEQ)
At all levels, two basic measures were computed: - sensitivity: the proportion of annotated features (nucleotide, exon, gene) that have been correctly predicted - specificity: the proportion of predicted features that is correct. The average sensitivity and specificity ((Sn + Sp)/2) was also calculated for each program.
Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31. Gene feature projection for evaluation of the accuracy of predictions missing exons wrong exons PREDICTION KNOWN
Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31. Gene prediction accuracy at the transcript level. Boxplots of the average sensitivity and specificity ((Sn + Sp)/2) for each program. A transcript is accurately predicted if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct.
Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31. These studies have revealed that i)none of the strategies produced perfect predictions ii) prediction methods that rely on mRNA and protein sequences and those that used combined informations (including expressed sequence information) were generally the most accurate. iii)the dual- or multiple genome methods were more accurate than the single genome ab initio prediction methods. iv)At the transcript level (the most stringent criterion) - no prediction method correctly identified greater than 45% of the coding transcripts.
The MisPred project The implicit question is: are there signs that could indicate that the predicted structure of a protein-coding gene may be incorrect? The rationale of our MisPred project is that a protein-coding gene is suspected to be mispredicted if some of its features conflict with our current knowledge about protein-coding genes and proteins. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353. MisPred: Database of mispredicted and abnormal proteins; http://mispred.enzim.hu/http://mispred.enzim.hu/
Several quality control tools of MisPred address the issue whether the predicted protein is able to reach the cellular compartment where it could be properly folded, is stable and functional. The rationale of these tools is that protein domains have adapted to different subcellular compartments during evolution and they are usually misfolded, unstable and non-functional if mislocalized. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.
Domain co-occurrence network of metazoan multidomain proteins Extracellular domain Cytoplasmic signalling domain Nuclear domain Tordai H, Nagy A, Farkas K, Bányai L, Patthy L. Modules, multidomain proteins and organismic complexity. FEBS J. 2005 Oct;272(19):5064-78. As a corollary, in multidomain proteins domain-types do not co-occur at random: - in extracellular proteins domains adapted to the extracellular milieu are used - extracellular and intracellular domains can co-occur only in transmembrane proteins - nuclear and extracellular domains do not co-occur in a single protein etc.
Some mislocalization-based MisPred tools used for the identification of abnormal or mispredicted proteins -Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals. -Conflict between the presence of extracellular and intracellular signaling domains and the absence of transmembrane domains. -Co-occurrence of extracellular and nuclear domains.
Rationale: proteins containing domains that occur exclusively in the extracellular space (e.g. in secreted extracellular proteins or in the extracellular part of type I, type II, type III single pass transmembrane proteins or in multispanning transmembrane proteins) have a cleavable signal peptide at the N-terminal end and/or transmembrane segments. Accordingly, proteins that contain extracellular domains but lack signal peptide and/or transmembrane segments are considered abnormal. latrophilin-2 SP complement factor masp-3 SP leukocyte activation antigen m6 SPTM receptor tyrosine kinase-like orphan receptor 2 TM SP Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals. killer cell lectin-like receptor TM
Conflict between the presence of extracellular and intracellular signalling domains and the absence of transmembrane domains. Rationale: extracellular domains and intracellular signalling domains can co- occur in multidomain proteins only if transmembrane segments separate these two types of domains. Domain co-occurrence network of metazoan multidomain proteins Extracellular module Cytoplasmic signalling module Nuclear module Tordai et al., FEBS J. 2005; 272(19):5064-78. Accordingly, proteins that contain extracellular and intracellular signalling domains but lack a transmembrane segment separating them are considered abnormal. receptor tyrosine kinase-like orphan receptor 2 TM SP KR
ENSXETP00000040601 (Xenopus tropicalis) is erroneous since it lacks a transmembrane segment although it contains both extracellular and cytoplasmic signaling domains.
Co-occurrence of extracellular and nuclear domains. Rationale: nuclear domains do not co-occur with extracellular domains in multidomain proteins. Accordingly, proteins that contain both extracellular and nuclear domains are considered abnormal. Domain co-occurrence network of metazoan multidomain proteins Extracellular module Cytoplasmic signalling module Nuclear module Tordai et al., FEBS J. 2005; 272(19):5064-78.
YL15_CAEEL 1 50 q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ yl15_caeel MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTKRTTFTS 51 100 q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ MFVWSAAVLI FSSVVPTFAQ YGCI....SE yl15_caeel EQVTLLELEF AKNEYICKDR RGELAQTIEL TECQVKTWFQ NRRTKKRSSE 101 150 q619j1_caebr LTFGKACPQN KTSTKWFFDA KLSFCYPYQF LGCDEGSNSF ESSDICLESC yl15_caeel LKFGTACSEN KTSTKWYYDS KLLFCYPYKY LGCGEGSNSF ESNENCLESC 151 200 q619j1_caebr KPADQFSCGG NTDADGICFS PSDSGCKKGT DCVMGGNIGF CCNKATQDEW yl15_caeel KPADQFSCGG NTGPDGVCFA HGDQGCKKGT VCVMGGMVGF CCDKKIQDEW 201 250 q619j1_caebr NKEHSPTCSK GSVVQFKQWF GMTPLIGRNC AHKFCPAGST CIQGKWTAHC yl15_caeel NKENSPKCLK GQVVQFKQWF GMTPLIGRSC SHNFCPEKST CVQGKWTAYC 251 q619j1_caebr CQ yl15_caeel CQ HomeoboxKunitz_BPTI Mispredicted protein (Swiss-Prot entry) containing nuclear and extracellular domains Exons belonging to tandem genes on C. elegans chromosome X have been incorrectly joined MISPREDICTED Hypothetical homeobox protein C02F12.5 in chromosome X
Another MisPred tool detects errors in gene prediction based on ‘Domain size deviation’. The rationale of this tool is that the highly cooperative, rapid folding of protein domains is the result of natural selection, therefore insertion/deletion of larger segments into/from protein domains may yield macromolecules that are unable to rapidly adopt a correctly folded, viable and stable three-dimensional structure. Accordingly, proteins containing domains that consist of a significantly larger or smaller number of residues than closely related members of the same family may be suspected to be unable to fold efficiently into a correctly folded, viable and stable domain/protein.
STRUCTURE OF HUMAN CARNITINE ACETYLTRANSFERASE 1NM8.pdb His 343 Three-dimensional structure of human carnitine O- acetyltransferase. 1NM8.pdb The region highlighted in yellow is missing from transcript RP11-247A12.5-001. This region also contains the catalytic residue His-343
Conclusions from MisPred analyses of various databases The number of UniProtKB/Swiss-Prot entries identified by MisPred as erroneous is very low, attesting to both the high quality of this manually curated database and the reliability of the MisPred approach. In the case of UniProtKB/TrEMBL MisPred identified a large proportion of TrEMBL entries as erroneous, the majority of which were missing signal peptides or suffered from domain size deviation. This is due primarily to the fact that these TrEMBL entries are translated in silico from non-full length cDNAs. In the case of the EnsEMBL- and NCBI/GNOMON-predicted sequences MisPred identified ~3-4 % of human sequences as erroneous. The majority of errors were also identified on the basis of missing signal peptides and domain size deviation, probably reflecting the influence of non-full-length or abnormal cDNAs on gene predictions. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.
Application of the MisPred tools to GENCODE peptides revealed that many of the potential alternative gene products encode proteins that are likely to be mislocalized and/or misfolded, suggesting that they do not have a role as functional proteins. Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, López G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Størling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramírez F, Schlicker A, Denoeud F, Jones P, Kerrien S, Orchard S, Antonarakis SE, Reymond A, Birney E, Brunak S, Casadio R, Guigo R, Harrow J, Hermjakob H, Jones DT, Lengauer T, Orengo CA, Patthy L, Thornton JM, Tramontano A, Valencia A. The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500. Epub 2007 Mar 19. Conclusions from MisPred analyses of various databases
Tress et al., Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500.
Although large scale whole genome analyses have shown that mammalian transcriptomes are made of a swarming mass of different overlapping transcripts, little evidence exists that the majority of this transcript complexity leads to protein complexity. The 5.7 average transcripts per coding locus annotated in GENCODE translates only to 1.7 proteins per locus (since a large fraction of transcript variation corresponds to non-coding transcripts or accumulates in the UTRs of coding transcripts). Moreover, if the GENCODE proteins flagged as problematic by the protein assessment methods, such as MisPred, are ignored, there are barely 1.3 annotated proteins per locus. The discrepancy between a complex, variable and largely unexplored population of RNA molecules, and a relatively small, stable, and well defined population of proteins, constitutes one of the challenges that Molecular Biology needs to address to fully elucidate cellular function. Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE and Guigó R. Identifying protein-coding genes in genomic sequences. Genome Biology 2009, 10:201
László Bányai Krisztina Farkas Hédi Hegyi Evelin Kozma Alinda Nagy Hedvig Tordai This work was carried out as part of the BioSapiens project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2003-503265. The authors thank the partial support of the National Office for Research and Technology under grant no.: eScience RET14/2005.