Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium.

Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium Budapest 2009. February 9.

In the last decade the genomes of numerous organisms have been sequenced, however, conversion of raw genome sequence data into biological knowledge remains a difficult task. Genome annotation - the process that maps biological knowledge onto the relevant genome-elements - requires the definition of the positions of all protein-coding (and non-coding) genes along the genome sequence, identification of their coding regions, regulatory sequences, promoters etc.

Although a large number of programs have been developed for computational gene identification, correct prediction of the structure of all protein-coding genes of higher eukaryotes is still an elusive goal. The uncertainties associated with gene finding may be illustrated by the fact that - eight years after the publication of the draft genome sequence (2001) - the exact number of protein-coding genes in the human genome is still unknown.

Finishing the euchromatic sequence of the human genome. Nature. 2004 Oct 21;431(7011):931-45.

Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

Since direct evidence of protein existence is generally absent, the criterion often employed to annotate a transcript as protein-coding is the existence of an Open Reading Frame (ORF). However, this criterion has been recently questioned by a number of methods developed to assess the quality of protein-coding gene annotations.

The rationale of the method of Clamp et al. is that functional protein-coding genes are subject to purifying selection, and therefore they are expected to show evolutionary conservation. The authors used two types of measures for the assessment of evolutionary conservation of predicted human genes: reading frame conservation (RFC, based on the observation that indels do not affect significantly the size of functional proteins) and codon substitution frequency (CSF, based on the observation that the patterns of nucleotide substitution in functional protein-coding genes is different from that observed on random DNA). Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

In their analysis of a number of human gene reference sets, Clamp et al. identified ~1200 human “orphans”: ORFs that lack homology to known genes. Both, RFC and CSF analysis revealed that many of these human orphans exhibit a behavior which is essentially indistinguishable from matched random controls, and very different of that observed in nonorphan protein-coding genes. From these, the authors concluded that overall about 15% of the entries in the gene catalogues investigated are not valid protein-coding genes. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

While the quality control method of Clamp et al. can distinguish protein-coding genes from noncoding sequences, it is less suitable to identify gene predictions that are only partially correct. If an annotated gene misses one or more exons, or a fraction of one exon, it may still exhibit the expected evolutionary characteristics of protein-coding genes.

Indeed, - in addition to uncertainties of the number of protein-coding genes – a very serious problem is that the structure of a significant proportion of the human genes is incorrectly predicted. According to recent analyses the predicted genomic structure of human genes is estimated to be correct for only about half of the predicted genes. Obviously, erroneous prediction of the structure of protein-coding genes leads to serious problems in prediction of the structure and function of the proteins they encode and hinder the identification of elements that regulate their expression.

A recent study has systematically compared the performance of various computational methods to predict human protein-coding genes. A set of well-annotated ENCODE sequences were blind-analyzed with the different gene finding programs and the predictions obtained were compared with the annotations. Predictions were analyzed at the nucleotide, exon, transcript and gene levels to evaluate how well the predictions reproduce the annotation. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006;7 Suppl 1:S2.1-31. Epub 2006 Aug 7. Review.

The computational methods compared were classified as 1)EST-, mRNA-, and protein-based methods (AUGUSTUS-EST, PARAGON+NSCAN_EST, ACEVIEW, ENSEMBL, EXOGEAN, EXONHUNTER, ACEMBLY, ECGene, MGCGene) 2)single-genome ab initio methods (AUGUSTUSabinit, GENEMARKhmm, GENEZILLA, GENEID, GENESCAN) 3)dual- or multiple-genome based comparative genomic methods (AUGUST-dual, ACESCAN, DOGFISH-C, NSCAN, SAGA, MARS, SGP2, TWINSCAN) 4)complex methods using any type of available information (AUGUSTUSany, FGENESH++, JIGSAW, PARAGONany, CCDSGene, KNOWNGene, REFSEQ)

At all levels, two basic measures were computed: - sensitivity: the proportion of annotated features (nucleotide, exon, gene) that have been correctly predicted - specificity: the proportion of predicted features that is correct. The average sensitivity and specificity ((Sn + Sp)/2) was also calculated for each program.

Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31. Gene feature projection for evaluation of the accuracy of predictions missing exons wrong exons PREDICTION KNOWN

Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31. Gene prediction accuracy at the transcript level. Boxplots of the average sensitivity and specificity ((Sn + Sp)/2) for each program. A transcript is accurately predicted if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct.

Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31. These studies have revealed that i)none of the strategies produced perfect predictions ii) prediction methods that rely on mRNA and protein sequences and those that used combined informations (including expressed sequence information) were generally the most accurate. iii)the dual- or multiple genome methods were more accurate than the single genome ab initio prediction methods. iv)At the transcript level (the most stringent criterion) - no prediction method correctly identified greater than 45% of the coding transcripts.

The MisPred project The implicit question is: are there signs that could indicate that the predicted structure of a protein-coding gene may be incorrect? The rationale of our MisPred project is that a protein-coding gene is suspected to be mispredicted if some of its features conflict with our current knowledge about protein-coding genes and proteins. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353. MisPred: Database of mispredicted and abnormal proteins; http://mispred.enzim.hu/http://mispred.enzim.hu/

Several quality control tools of MisPred address the issue whether the predicted protein is able to reach the cellular compartment where it could be properly folded, is stable and functional. The rationale of these tools is that protein domains have adapted to different subcellular compartments during evolution and they are usually misfolded, unstable and non-functional if mislocalized. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.

Domain co-occurrence network of metazoan multidomain proteins Extracellular domain Cytoplasmic signalling domain Nuclear domain Tordai H, Nagy A, Farkas K, Bányai L, Patthy L. Modules, multidomain proteins and organismic complexity. FEBS J. 2005 Oct;272(19):5064-78. As a corollary, in multidomain proteins domain-types do not co-occur at random: - in extracellular proteins domains adapted to the extracellular milieu are used - extracellular and intracellular domains can co-occur only in transmembrane proteins - nuclear and extracellular domains do not co-occur in a single protein etc.

Some mislocalization-based MisPred tools used for the identification of abnormal or mispredicted proteins -Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals. -Conflict between the presence of extracellular and intracellular signaling domains and the absence of transmembrane domains. -Co-occurrence of extracellular and nuclear domains.

Rationale: proteins containing domains that occur exclusively in the extracellular space (e.g. in secreted extracellular proteins or in the extracellular part of type I, type II, type III single pass transmembrane proteins or in multispanning transmembrane proteins) have a cleavable signal peptide at the N-terminal end and/or transmembrane segments. Accordingly, proteins that contain extracellular domains but lack signal peptide and/or transmembrane segments are considered abnormal. latrophilin-2 SP complement factor masp-3 SP leukocyte activation antigen m6 SPTM receptor tyrosine kinase-like orphan receptor 2 TM SP Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals. killer cell lectin-like receptor TM

enst00000359637.1.pep UNI_TREMBL:Q8N708 ID Q8N708 PRELIMINARY; PRT; 449 AA. AC Q8N708; DT 01-OCT-2002 (TrEMBLrel. 22, Created) DT 01-OCT-2002 (TrEMBLrel. 22, Last sequence update) DT 01-MAR-2003 (TrEMBLrel. 23, Last annotation update) DE HF1 protein.... SCORES Init1: 3167 Initn: 3167 Opt: 3167 z-score: 3657.9 E(): 1.1e-195 >>UNI_TREMBL:Q8N708 (449 aa) initn: 3167 init1: 3167 opt: 3167 Z-score: 3657.9 expect(): 1.1e-195 Smith-Waterman score: 3167; 99.5% identity in 430 aa overlap (1-430:20-449) 10 20 30 40 enst00000359 DCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG ||||||||||||||||||||||||||||||||||||||||| Q8N708 MRLLAKIICLMLWAICVAEDCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG 10 20 30 40 50 60 50 60 70 80 90 100 enst00000359 NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL 70 80 90 100 110 120 110 120 130 140 150 160 enst00000359 LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHFGQAVRFVCNS |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHFGQAVRFVCNS 130 140 150 160 170 180 170 180 190 200 210 220 enst00000359 GYKIEGDEEMHCSDDGFWSKEKPKCVEISCKSPDVINGSPISQKIIYKENERFQYKCNMG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 GYKIEGDEEMHCSDDGFWSKEKPKCVEISCKSPDVINGSPISQKIIYKENERFQYKCNMG 190 200 210 220 230 240 230 240 250 260 270 280 enst00000359 YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEITYQCRNGFYP |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEITYQCRNGFYP 250 260 270 280 290 300 290 300 310 320 330 340 enst00000359 ATRGNTAKCTSTGWIPAPRCTLKPCDYPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 ATRGNTAKCTSTGWIPAPRCTLKPCDYPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH 310 320 330 340 350 360 350 360 370 380 390 400 enst00000359 FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNHGRKFVQGKSIDVACHPGY |||||||||||||||||||||||||||||||||||||||||:|||||||||||||||||| Q8N708 FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNYGRKFVQGKSIDVACHPGY 370 380 390 400 410 420 410 420 430 enst00000359 ALPKAQTTVTCMENGWSPTPRCIRVKFTL |||||||||||||||||||||||||:||| Q8N708 ALPKAQTTVTCMENGWSPTPRCIRVSFTL 430 440 Q8N708 ENSP00000352658.1 SP CORRECT MISPREDICTED Complement factor H, isoform b.

enst00000256876.3.pep UNI_SPROT:IL2A_HUMAN ID IL2A_HUMAN STANDARD; PRT; 272 AA. AC P01589; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 01-OCT-2004 (Rel. 45, Last annotation update) DE Interleukin-2 receptor alpha chain precursor (IL-2 receptor alpha... SCORES Init1: 637 Initn: 719 Opt: 637 z-score: 759.2 E(): 3.2e-34 >>UNI_SPROT:IL2A_HUMAN (272 aa) initn: 719 init1: 637 opt: 637 Z-score: 759.2 expect(): 3.2e-34 Smith-Waterman score: 637; 100.0% identity in 92 aa overlap (1-92:31-122) 10 20 30 enst00000256 IPHATFKAMAYKEGTMLNCECKRGFRRIKS |||||||||||||||||||||||||||||| IL2A_HUMAN MDSYLLMWGLLTFIMVPGCQAELCDDDPPEIPHATFKAMAYKEGTMLNCECKRGFRRIKS 10 20 30 40 50 60 40 50 60 70 80 90 enst00000256 GSLYMLCTGNSSHSSWDNQCQCTSSATRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| IL2A_HUMAN GSLYMLCTGNSSHSSWDNQCQCTSSATRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS 70 80 90 100 110 120 100 enst00000256 LPDFQIQTEMAATMETSI || IL2A_HUMAN LPGHCREPPPWENEATERIYHFVVGQMVYYQCVQGYRALHRGPAESVCKMTHGKTRWTQP 130 140 150 160 170 180 P01589 ENSP00000256876.3 SPTM CORRECT MISPREDICTED Interleukin-2 receptor alpha chain precursor

enst00000248228.2.pep UNI_SPROT:209L_HUMAN ID 209L_HUMAN STANDARD; PRT; 399 AA. AC Q9H2X3; Q969M4; Q96QP3; Q96QP4; Q96QP5; Q96QP6; Q9BXS3; Q9H2Q9; AC Q9H8F0; Q9Y2A8; DT 05-JUL-2004 (Rel. 44, Created) DT 05-JUL-2004 (Rel. 44, Last sequence update) DT 01-OCT-2004 (Rel. 45, Last annotation update)... SCORES Init1: 1721 Initn: 1997 Opt: 1732 z-score: 1784.9 E(): 2.4e-91 >>UNI_SPROT:209L_HUMAN (399 aa) initn: 1997 init1: 1721 opt: 1732 Z-score: 1784.9 expect(): 2.4e-91 Smith-Waterman score: 2034; 82.2% identity in 399 aa overlap (1-332:1-399) 10 20 30 40 enst00000248 MSDSKEPRVQQLGLLEEDPTTSGIRLFPRDFQFQQIHGHKSST------------VPFLL ||||||||||||||||||||||||||||||||||||||||||| : |:| 209L_HUMAN MSDSKEPRVQQLGLLEEDPTTSGIRLFPRDFQFQQIHGHKSSTGCLGHGALVLQLLSFML 10 20 30 40 50 60 50 60 70 80 90 enst00000248 --G-------PVSKVPSSLSQEQSEQDAIYQNLTQLKAAVGELSEKSKLQEIYQELTQLK | ||||||||||||||||||||||||||||||||||||||||||||||||| 209L_HUMAN LAGVLVAILVQVSKVPSSLSQEQSEQDAIYQNLTQLKAAVGELSEKSKLQEIYQELTQLK 70 80 90 100 110 120 100 110 120 130 140 150 enst00000248 AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQE |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 209L_HUMAN AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQE 130 140 150 160 170 180 160 170 180 190 enst00000248 IYQELTRLKAAVGELPEKSKLQEIYQELTELKAAV------------------------- ||||||||||||||||||||||||||||||||||| 209L_HUMAN IYQELTRLKAAVGELPEKSKLQEIYQELTELKAAVGELPEKSKLQEIYQELTQLKAAVGE 190 200 210 220 230 240 200 210 220 230 enst00000248 ---------------------ERLCRHCPKDWTFFQGNCYFMSNSQRNWHDSVTACQEVR ||||||||||||||||||||||||||||||||||||||| 209L_HUMAN LPDQSKQQQIYQELTDLKTAFERLCRHCPKDWTFFQGNCYFMSNSQRNWHDSVTACQEVR 250 260 270 280 290 300 240 250 260 270 280 290 enst00000248 AQLVVIKTAEEQNFLQLQTSRSNRFSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 209L_HUMAN AQLVVIKTAEEQNFLQLQTSRSNRFSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP 310 320 330 340 350 360 300 310 320 330 enst00000248 NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE ||||||||||||||||||||||||||||||||||||||| 209L_HUMAN NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE 370 380 390 Q9H2X3 ENSP00000248228.2 TM CORRECT MISPREDICTED CD209 antigen-like protein 1

CADH2_HUMAN [906 residues] AC110015.1_002 [191 residues] 1 50 cadh2_human MCRIAGALRT LLPLLLALLQ ASVEASGEIA LCKTGFPEDV YSAVLSKDVH ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 51 100 cadh2_human EGQPLLN... VKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS ac110015 ~~~~MCNTQR MKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS 101 150 cadh2_human EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE VEEIVFPRQF ac110015 EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE VEEIVFPRQF 151 200 cadh2_human SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR DKNLSLRYSV ac110015 SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR DKNLSLRYSV 201 250 cadh2_human TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA HAVDINGNQV ac110015 TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA HAVDI~~~~~ 251 300 cadh2_human ENPIDIVINV IDMNDNRPEF LHQVWNGTVP EGSKPGTYVM TVTAIDADDP ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 301 350 cadh2_human NALNGMLRYR IVSQAPSTPS PNMFTINNET GDIITVAAGL DREKVQQYTL ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 351 400 cadh2_human IIQATDMEGN PTYGLSNTAT AVITVTDVND NPPEFTAMTF YGEVPENRVD ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 401 450 cadh2_human IIVANLTVTD KDQPHTPAWN AVYRISGGDP TGRFAIQTDP NSNDGLVTVV ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 451 500 cadh2_human KPIDFETNRM FVLTVAAENQ VPLAKGIQHP PQSTATVSVT VIDVNENPYF ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 501 550 cadh2_human APNPKIIRQE EGLHAGTMLT TFTAQDPDRY MQQNIRYTKL SDPANWLKID ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 551 600 cadh2_human PVNGQITTIA VLDRESPNVK NNIYNATFLA SDNGIPPMSG TGTLQIYLLD ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 601 650 cadh2_human INDNAPQVLP QEAETCETPD PNSINITALD YDIDPNAGPF AFDLPLSPVT ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 651 700 cadh2_human IKRNWTITRL NGDFAQLNLK IKFLEAGIYE VPIIITDSGN PPKSNISILR ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 701 750 cadh2_human VKVCQCDSNG DCTDVDRIVG AGLGTGAIIA ILLCIIILLI LVLMFVVWMK ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 751 800 cadh2_human RRDKERQAKQ LLIDPEDDVR DNILKYDEEG GGEEDQDYDL SQLQQPDTVE ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 801 850 cadh2_human PDAIKPVGIR RMDERPIHAE PQYPVRSAAP HPGDIGDFIN EGLKAADNDP ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 851 900 cadh2_human TAPPYDSLLV FDYEGSGSTA GSLSSLNSSS SGGEQDYDYL NDWGPRFKKL ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 901 cadh2_human ADMYGGGDD ac110015 ~~~~~~~~~ SP TM CORRECT ABNORMAL Cadherin-2 precursor

Conflict between the presence of extracellular and intracellular signalling domains and the absence of transmembrane domains. Rationale: extracellular domains and intracellular signalling domains can co- occur in multidomain proteins only if transmembrane segments separate these two types of domains. Domain co-occurrence network of metazoan multidomain proteins Extracellular module Cytoplasmic signalling module Nuclear module Tordai et al., FEBS J. 2005; 272(19):5064-78. Accordingly, proteins that contain extracellular and intracellular signalling domains but lack a transmembrane segment separating them are considered abnormal. receptor tyrosine kinase-like orphan receptor 2 TM SP KR

ENSXETP00000040601 (Xenopus tropicalis) is erroneous since it lacks a transmembrane segment although it contains both extracellular and cytoplasmic signaling domains.

Query 181 SDVTYRVVCKRCSWEQGECIPCANTIGYVPQQSGLVDTYISIVDLVAHANYTFEVEAVNG 240 +DVTYR++CKRCSWEQGEC+PC + IGY+PQQ+GLVD Y++++DL+AHANYTFEVEAVNG Sbjct 361 NDVTYRILCKRCSWEQGECVPCGSNIGYMPQQTGLVDNYVTVMDLLAHANYTFEVEAVNG 420 Query 241 VSDLSRSQRLFAAVSVTTGQAAPSQVSGVMKERVLQRAVDLSWQEPEHPNGVITEYEIKY 300 VSDLSRSQRLFAAVS+TTGQAAPSQVSGVMKERVLQR+V+LSWQEPEHPNGVITEYEIKY Sbjct 421 VSDLSRSQRLFAAVSITTGQAAPSQVSGVMKERVLQRSVELSWQEPEHPNGVITEYEIKY 480 Query 301 YEKDQRERTYSTLKTKSTSVSINNLRPGTAYIFQIRAFTAAGYGMYSPRLDVSTLEEATV 360 YEKDQRERTYST+KTKSTS SINNL+PGT Y+FQIRAFTAAGYG YSPRLDV+TLEEAT Sbjct 481 YEKDQRERTYSTVKTKSTSASINNLKPGTVYVFQIRAFTAAGYGNYSPRLDVATLEEATA 540 Query 361 YYIFA-CSYCIAYIMGSQSSLLLCLQIALQLLINSSSLYYTAALCDLNYNKSLKMHFPSG 419 + + + I + + + ++ + + +I Y+ A D ++ L HF Sbjct 541 TAVSSEQNPVIIIAVVAVAGTIILVFMVFGFIIGRRHCGYSKA--DQEGDEELYFHF--- 595 Query 420 LVKFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR 479 KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR Sbjct 596 --KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR 653 Query 480 DVPVAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEFMENGA 539 DV VAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIE+MENGA Sbjct 654 DVAVAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEYMENGA 713 Query 540 LDAFLRKLDGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 599 LDAFLRK DGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG Sbjct 714 LDAFLRKHDGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 773 Query 600 LSRIIEDDPDAVYTTTQGGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY 659 LSR+IEDDP+AVYTTT GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY Sbjct 774 LSRVIEDDPEAVYTTT-GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY 832 Query 660 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP 719 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP Sbjct 833 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP 892 Query 720 NSLKTPMGTCNRPTSPLLDQNTLDFNSFCSVGEWLEAIKMERYKENFSSSGYNSLESVAR 779 NSLKTP+GTC+RP SPLLDQNT DF +FCSVGEWL+AIKMERYK+NF+++GYNSLESVAR Sbjct 893 NSLKTPLGTCSRPISPLLDQNTPDFTTFCSVGEWLQAIKMERYKDNFTAAGYNSLESVAR 952 Query 780 MSIDDVISLGITLVGHQKKIMNSIQTMRAQMLQLHGTGI 818 M+I+DV+SLGITLVGHQKKIM+SIQTMRAQML LHGTGI Sbjct 953 MTIEDVMSLGITLVGHQKKIMSSIQTMRAQMLHLHGTGI 991 The chicken ortholog of ENSXETP00000040601 (Xenopus tropicalis), EPHA7_CHICK Ephrin type-A receptor 7 (np_990414), does contain a transmembrane segment ENSXETP00000040601 (Xenopus tropicalis) deviates most significantly in this region from EPHA7_CHICK

The erroneous part of ENSXETP00000040601 (Xenopus tropicalis) could be corrected, by identifying the exons encoding the ‘missing’ transmembrane segment. 451 500 ensxetp00000040601_corrected KERVLQRAVD LSWQEPEHPN GVITEYEIKY YEKDQRERTY STLKTKSTSV np_990414 KERVLQRSVE LSWQEPEHPN GVITEYEIKY YEKDQRERTY STVKTKSTSA ensxetp00000040601 KERVLQRAVD LSWQEPEHPN GVITEYEIKY YEKDQRERTY STLKTKSTSV 501 550 ensxetp00000040601_corrected SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATA TAVSTEQNPV np_990414 SINNLKPGTV YVFQIRAFTA AGYGNYSPRL DVATLEEATA TAVSSEQNPV ensxetp00000040601 SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATV YYIFACSYCI 551 600 ensxetp00000040601_corrected IIIAVVAVAG TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHC... np_990414 IIIAVVAVAG TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHF... ensxetp00000040601 AYI.MGSQSS LLLCLQIALQ LLINSSSLYY TAALCDLNYN KSLKMHFPSG 601 650 ensxetp00000040601_corrected......TKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC np_990414..KFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC ensxetp00000040601 LVKFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC 651 700 ensxetp00000040601_corrected SGRLKLPGKR DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE np_990414 SGRLKLPGKR DVAVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE ensxetp00000040601 SGRLKLPGKR DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE

Co-occurrence of extracellular and nuclear domains. Rationale: nuclear domains do not co-occur with extracellular domains in multidomain proteins. Accordingly, proteins that contain both extracellular and nuclear domains are considered abnormal. Domain co-occurrence network of metazoan multidomain proteins Extracellular module Cytoplasmic signalling module Nuclear module Tordai et al., FEBS J. 2005; 272(19):5064-78.

YL15_CAEEL 1 50 q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ yl15_caeel MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTKRTTFTS 51 100 q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ MFVWSAAVLI FSSVVPTFAQ YGCI....SE yl15_caeel EQVTLLELEF AKNEYICKDR RGELAQTIEL TECQVKTWFQ NRRTKKRSSE 101 150 q619j1_caebr LTFGKACPQN KTSTKWFFDA KLSFCYPYQF LGCDEGSNSF ESSDICLESC yl15_caeel LKFGTACSEN KTSTKWYYDS KLLFCYPYKY LGCGEGSNSF ESNENCLESC 151 200 q619j1_caebr KPADQFSCGG NTDADGICFS PSDSGCKKGT DCVMGGNIGF CCNKATQDEW yl15_caeel KPADQFSCGG NTGPDGVCFA HGDQGCKKGT VCVMGGMVGF CCDKKIQDEW 201 250 q619j1_caebr NKEHSPTCSK GSVVQFKQWF GMTPLIGRNC AHKFCPAGST CIQGKWTAHC yl15_caeel NKENSPKCLK GQVVQFKQWF GMTPLIGRSC SHNFCPEKST CVQGKWTAYC 251 q619j1_caebr CQ yl15_caeel CQ HomeoboxKunitz_BPTI Mispredicted protein (Swiss-Prot entry) containing nuclear and extracellular domains Exons belonging to tandem genes on C. elegans chromosome X have been incorrectly joined MISPREDICTED Hypothetical homeobox protein C02F12.5 in chromosome X

1 50 yl15_caeel_corr MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTK.RTTFT hm07_caeel ~~~~~~~~~~ ~~~~~~~~MK HEMVFTFLLM MVRPEASTSR IPRR.RTTFT q7qbz2_anoga ~~~~~~~~~~ ~~~~MLFTTS YSRNKPTNNS NVARRRKKEG RPRRQRTTFS 51 100 yl15_caeel_corr SEQVTLLELE FAKNEYICKD RRGELAQTIE LTECQVKTWF QNRRTKKRSF hm07_caeel VEQLYLLEMY FAQSQYVGCD ERERLARILS LDEYQVKIWF QNRRIRMRRE q7qbz2_anoga SEQTLRLEVE FHRNEYISRG RRFELAEVLK LSETQIKIWF QNRRAKDKRI 101 115 yl15_caeel_corr I*~~~~~~~~ ~~~~~ hm07_caeel ANK~~~~~~~ ~~~~~ q7qbz2_anoga EKAQIDQQYR SVRIK 1 50 q619j1_caebr MFVWSAAVLI FSSVVPTFAQ YGCISELTFG KACPQNKTST KWFFDAKLSF yl15_caeel_corr1 MLFFTLLIQL F..LVPVLCQ YACSSELKFG TACSENKTST KWYYDSKLLF 51 100 q619j1_caebr CYPYQFLGCD EGSNSFESSD ICLESCKPAD QFSCGGNTDA DGICFSPSDS yl15_caeel_corr1 CYPYKYLGCG EGSNSFESNE NCLESCKPAD QFSCGGNTGP DGVCFAHGDQ 101 150 q619j1_caebr GCKKGTDCVM GGNIGFCCNK ATQDEWNKEH SPTCSKGSVV QFKQWFGMTP yl15_caeel_corr1 GCKKGTVCVM GGMVGFCCDK KIQDEWNKEN SPKCLKGQVV QFKQWFGMTP 151 178 q619j1_caebr LIGRNCAHKF CPAGSTCIQG KWTAHCCQ yl15_caeel_corr1 LIGRSCSHNF CPEKSTCVQG KWTAYCCQ Homeobox Kunitz_BPTI Corrected predictions for the distinct constituent proteins containing the nuclear homeobox and extracellular KUNITZ_BPTI domains CORRECT

Another MisPred tool detects errors in gene prediction based on ‘Domain size deviation’. The rationale of this tool is that the highly cooperative, rapid folding of protein domains is the result of natural selection, therefore insertion/deletion of larger segments into/from protein domains may yield macromolecules that are unable to rapidly adopt a correctly folded, viable and stable three-dimensional structure. Accordingly, proteins containing domains that consist of a significantly larger or smaller number of residues than closely related members of the same family may be suspected to be unable to fold efficiently into a correctly folded, viable and stable domain/protein.

RP11-247A12.5-001 [544 aa] CACP_HUMAN, Carnitine O-acetyltransferase [626 residues] 1 100 cacp_human MLAFAARTVV KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE rp11-247a12 MLAFAARTVV KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE 101 200 cacp_human WWLKTAYLQY RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP rp11-247a12 WWLKTAYLQY RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP 201 300 cacp_human THITVVHNYQ FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ KSIFTVCLDA TMPRVSEDVY rp11-247a12 THITVVHNYQ FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ.................... 301 400 cacp_human RSHVAGQMLH GGGSRLNSGN RWFDKTLQFI VAEDGSCGLV YEHAAAEGFP IVTLLDYVIE YTKKPELVRS PMVPLPMPKK LRFNITPEIK SDIEKAKQNL rp11-247a12..............................................................KKPELVRS PLVPLPMPKK LRFNITPEIK SDIEKAKQNL 401 500 cacp_human SIMIQDLDIT VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV rp11-247a12 SIMIQDLDIT VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV 501 600 cacp_human QAHRGYTDRA IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA rp11-247a12 QAHRGYTDRA IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA 601 626 cacp_human ARLAHYLEKA LLDMRALLQS HPRAKL rp11-247a12 ARLAHYLEKA LLDMRALLQS HPRAKL CORRECT ABNORMAL RP11-247A12.5-001 encodes an internally deleted Carn_acyltransf domain Region missing from RP11-247A12.5- 001

STRUCTURE OF HUMAN CARNITINE ACETYLTRANSFERASE 1NM8.pdb His 343 Three-dimensional structure of human carnitine O- acetyltransferase. 1NM8.pdb The region highlighted in yellow is missing from transcript RP11-247A12.5-001. This region also contains the catalytic residue His-343

1 50 epha5_human MRGSGPRGAG HRRPP..SGG GDTPITPASL AGCYSAPRRA PLWTCLLLCA epha5_rat MRGSGPRGAG RRRTQGRGGG GDTPRVPASL AGCYSAPLKG PLWTCLLLCA epha5_chick M...GLRGGG.....GRAGG......PA............ PGWTCLLLCA epha5_mouse MRGSGPRGAG HRRTQGRGGG DDTPRVPASL AGCYSAPLKG PLWTCLLLCA 51 100 epha5_human ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE VDENYAPIHT epha5_rat ALRTLLASPS NEVNLLDSRT VLGDLGWIAF PKNGWEEIGE VDENYAPIHT epha5_chick ALRSLLASPG SEVNLLDSRT VMGDLGWIAY PKNGWEEIGE VDENYAPIHT epha5_mouse ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE VDENYAPIHT 101 150 epha5_human YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK epha5_rat YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK epha5_chick YQVCKVMEQN QNNWLLTSWI SNEGRPASSF ELKFTLRDCN SLPGGLGTCK epha5_mouse YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK 151 200 epha5_human ETFNMYYFES DDQNGRNIKE NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_rat ETFNMYYFES DDENGRNIKD NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_chick ETFNMYYFES DDEDGRNIRE NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_mouse ETFNMYYFES DDENGRSIKE NQYIKIDTIA ADESFTELDL GDRVMKLNTE 201 250 epha5_human VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAVFPDTIT epha5_rat VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAVFPDTIT epha5_chick VRDVGPLTKK GFYLAFQDVG ACIALVSVRV YYKKCPSVIR NLARFPDTIT epha5_mouse VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAIFPDTIT 251 300 epha5_human GADSSQLLEV SGSCVNHSVT DEPPKMHCSA EGEWLVPIGK CMCKAGYEEK epha5_rat GADSSQLLEV SGSCVNHSVT DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK epha5_chick GADSSQLLEV SGVCVNHSVT DEAPKMHCSA EGEWLVPIGK CLCKAGYEEK epha5_mouse GADSSQLLEV SGSCVNHSVT DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK 301 350 epha5_human NGTCQVCRPG FFKASPHIQS CGKCPPHSYT HEEASTSCVC EKDYFRRESD epha5_rat NGTCQVCRPG FFKASPHSQT CSKCPPHSYT HEEASTSCVC EKDYFRRESD epha5_chick NNTCQVCRPG FFKASPHSPS CSKCPPHSYT LDEASTSCLC EEHYFRRESD epha5_mouse NGTCQ............................................. 351 400 epha5_human PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGRKD VSYYIACKKC epha5_rat PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGGKD VSYYILCKKC epha5_chick PPTMACTRPP SAPRSAISNV NETSVFLEWI PPADTGGRKD VSYYIACKKC epha5_mouse.................................................. 401 450 epha5_human NSHAGVCEEC GGHVRYLPRQ SGLKNTSVMM VDLLAHTNYT FEIEAVNGVS epha5_rat NSHAGVCEEC GGHVRYLPQQ IGLKNTSVMM ADPLAHTNYT FEIEAVNGVS epha5_chick NSHSGLCEAC GSHVRYLPQQ TGLKNTSVMM VDLLAHTNYT FEIEAVNGVS epha5_mouse.................................................. 451 500 epha5_human DLSPGARQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS WQEPDRPNGI epha5_rat DLSPGTRQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS WQEPDRPNGI epha5_chick DQNPGARQFV SVNVTTNQAA PSPVSSVKKG KITKNSISLS WQEPDRPNGI epha5_mouse...................A PSPVTNVKKG KIAKNSISLS WQEPDRPNGI 501 550 epha5_human ILEYEIKHFE KDQETSYTII KSKETTITAE GLKPASVYVF QIRARTAAGY epha5_rat ILEYEIKYFE KDQETSYTII KSKETTITAE GLKPASVYVF QIRARTAAGY epha5_chick ILEYEIKYFE KDQETSYTII KSKETAITAD GLKPGSAYVF QIRARTAAGY epha5_mouse ILEYEIKYFE KDQETSYTII KSKETSITAE GLKPASVYVF QIRARTAAGY 551 600 epha5_human GVFSRRFEFE TTPV.FAASS DQSQIPVIAV SVTVGVILLA VVIGVLLSGS epha5_rat GVFSRRFEFE TTPV.FGASN DQSQIPIIGV SVTVGVILLA VMIGFLLSGS epha5_chick GGFSRRFEFE TSPV.LAASS DQSQIPIIVV SVTVGVILLA VVIGFLLSGS epha5_mouse GVFSRRFEFE TTPVSVAASN DQSQIPIIAV SVTVGVILLA VMIGFLLSGS 601 650 epha5_human CCECGCGRAS SLCAVAHPIL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_rat CCECGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_chick CCDHGCGWAS SLRAVAYPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_mouse CCDCGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV 651 700 epha5_human RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGRLKLP epha5_rat RTYIDPHTYE DPTQAVHEFG KEIEASCITI ERVIGAGEFG EVCSGRLKLP epha5_chick RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGRLKLQ epha5_mouse RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGCLKLP 701 750 epha5_human GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_rat GKRELPVATK TLKVGYTEKQ RRDFLSEASI MGQFDHPNII HLEGVVTKSK epha5_chick GKREFPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_mouse GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK 751 800 epha5_human PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGISAG MKYLSDMGYV epha5_rat PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV epha5_chick PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIASG MKYLSDMGYV epha5_mouse PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV 801 850 epha5_human HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_rat HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_chick HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_mouse HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP 851 900 epha5_human EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_chick EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL 901 950 epha5_human PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN epha5_chick PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR NPSSLKTLVN 951 1000 epha5_human ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_chick ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV 1001 1041 epha5_human AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L epha5_rat AQVTLE................................... epha5_chick AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM KVQMVNGMVP V EPHA5_RATEPHA5_RAT ephrin type-a receptor 5 precursor [1005 residues] EPHA5_HUMANEPHA5_HUMAN ephrin type-a receptor 5 precursor [1037 residues] EPHA5_RAT contains a C-terminal truncated SAM_1 domain, although not annotated as fragment by SwissProt. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain.

851 900 epha5_rat_corrected EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_human EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_chick EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL 901 950 epha5_rat_corrected PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN epha5_human PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_chick PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR NPSSLKTLVN 951 1000 epha5_rat_corrected ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_human ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV epha5_chick ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV 1001 1042 epha5_rat_corrected AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP V* epha5_rat AQVTLE~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~ epha5_human AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L~ epha5_chick AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L~ epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM KVQMVNGMVP V~ corrected

Conclusions from MisPred analyses of various databases The number of UniProtKB/Swiss-Prot entries identified by MisPred as erroneous is very low, attesting to both the high quality of this manually curated database and the reliability of the MisPred approach. In the case of UniProtKB/TrEMBL MisPred identified a large proportion of TrEMBL entries as erroneous, the majority of which were missing signal peptides or suffered from domain size deviation. This is due primarily to the fact that these TrEMBL entries are translated in silico from non-full length cDNAs. In the case of the EnsEMBL- and NCBI/GNOMON-predicted sequences MisPred identified ~3-4 % of human sequences as erroneous. The majority of errors were also identified on the basis of missing signal peptides and domain size deviation, probably reflecting the influence of non-full-length or abnormal cDNAs on gene predictions. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.

Application of the MisPred tools to GENCODE peptides revealed that many of the potential alternative gene products encode proteins that are likely to be mislocalized and/or misfolded, suggesting that they do not have a role as functional proteins. Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, López G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Størling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramírez F, Schlicker A, Denoeud F, Jones P, Kerrien S, Orchard S, Antonarakis SE, Reymond A, Birney E, Brunak S, Casadio R, Guigo R, Harrow J, Hermjakob H, Jones DT, Lengauer T, Orengo CA, Patthy L, Thornton JM, Tramontano A, Valencia A. The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500. Epub 2007 Mar 19. Conclusions from MisPred analyses of various databases

Tress et al., Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500.

Although large scale whole genome analyses have shown that mammalian transcriptomes are made of a swarming mass of different overlapping transcripts, little evidence exists that the majority of this transcript complexity leads to protein complexity. The 5.7 average transcripts per coding locus annotated in GENCODE translates only to 1.7 proteins per locus (since a large fraction of transcript variation corresponds to non-coding transcripts or accumulates in the UTRs of coding transcripts). Moreover, if the GENCODE proteins flagged as problematic by the protein assessment methods, such as MisPred, are ignored, there are barely 1.3 annotated proteins per locus. The discrepancy between a complex, variable and largely unexplored population of RNA molecules, and a relatively small, stable, and well defined population of proteins, constitutes one of the challenges that Molecular Biology needs to address to fully elucidate cellular function. Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE and Guigó R. Identifying protein-coding genes in genomic sequences. Genome Biology 2009, 10:201

László Bányai Krisztina Farkas Hédi Hegyi Evelin Kozma Alinda Nagy Hedvig Tordai This work was carried out as part of the BioSapiens project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2003-503265. The authors thank the partial support of the National Office for Research and Technology under grant no.: eScience RET14/2005.

Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium.

Similar presentations

Presentation on theme: "Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium.

Similar presentations

Presentation on theme: "Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium."— Presentation transcript:

Similar presentations

About project

Feedback