Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium.

Similar presentations


Presentation on theme: "Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium."— Presentation transcript:

1 Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium Budapest February 9.

2 In the last decade the genomes of numerous organisms have been sequenced, however, conversion of raw genome sequence data into biological knowledge remains a difficult task. Genome annotation - the process that maps biological knowledge onto the relevant genome-elements - requires the definition of the positions of all protein-coding (and non-coding) genes along the genome sequence, identification of their coding regions, regulatory sequences, promoters etc.

3 Although a large number of programs have been developed for computational gene identification, correct prediction of the structure of all protein-coding genes of higher eukaryotes is still an elusive goal. The uncertainties associated with gene finding may be illustrated by the fact that - eight years after the publication of the draft genome sequence (2001) - the exact number of protein-coding genes in the human genome is still unknown.

4 Finishing the euchromatic sequence of the human genome. Nature Oct 21;431(7011):

5 Proc Natl Acad Sci U S A Dec 4;104(49):

6 Since direct evidence of protein existence is generally absent, the criterion often employed to annotate a transcript as protein-coding is the existence of an Open Reading Frame (ORF). However, this criterion has been recently questioned by a number of methods developed to assess the quality of protein-coding gene annotations.

7 The rationale of the method of Clamp et al. is that functional protein-coding genes are subject to purifying selection, and therefore they are expected to show evolutionary conservation. The authors used two types of measures for the assessment of evolutionary conservation of predicted human genes: reading frame conservation (RFC, based on the observation that indels do not affect significantly the size of functional proteins) and codon substitution frequency (CSF, based on the observation that the patterns of nucleotide substitution in functional protein-coding genes is different from that observed on random DNA). Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A Dec 4;104(49):

8 In their analysis of a number of human gene reference sets, Clamp et al. identified ~1200 human “orphans”: ORFs that lack homology to known genes. Both, RFC and CSF analysis revealed that many of these human orphans exhibit a behavior which is essentially indistinguishable from matched random controls, and very different of that observed in nonorphan protein-coding genes. From these, the authors concluded that overall about 15% of the entries in the gene catalogues investigated are not valid protein-coding genes. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A Dec 4;104(49):

9 While the quality control method of Clamp et al. can distinguish protein-coding genes from noncoding sequences, it is less suitable to identify gene predictions that are only partially correct. If an annotated gene misses one or more exons, or a fraction of one exon, it may still exhibit the expected evolutionary characteristics of protein-coding genes.

10 Indeed, - in addition to uncertainties of the number of protein-coding genes – a very serious problem is that the structure of a significant proportion of the human genes is incorrectly predicted. According to recent analyses the predicted genomic structure of human genes is estimated to be correct for only about half of the predicted genes. Obviously, erroneous prediction of the structure of protein-coding genes leads to serious problems in prediction of the structure and function of the proteins they encode and hinder the identification of elements that regulate their expression.

11 A recent study has systematically compared the performance of various computational methods to predict human protein-coding genes. A set of well-annotated ENCODE sequences were blind-analyzed with the different gene finding programs and the predictions obtained were compared with the annotations. Predictions were analyzed at the nucleotide, exon, transcript and gene levels to evaluate how well the predictions reproduce the annotation. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006;7 Suppl 1:S Epub 2006 Aug 7. Review.

12 The computational methods compared were classified as 1)EST-, mRNA-, and protein-based methods (AUGUSTUS-EST, PARAGON+NSCAN_EST, ACEVIEW, ENSEMBL, EXOGEAN, EXONHUNTER, ACEMBLY, ECGene, MGCGene) 2)single-genome ab initio methods (AUGUSTUSabinit, GENEMARKhmm, GENEZILLA, GENEID, GENESCAN) 3)dual- or multiple-genome based comparative genomic methods (AUGUST-dual, ACESCAN, DOGFISH-C, NSCAN, SAGA, MARS, SGP2, TWINSCAN) 4)complex methods using any type of available information (AUGUSTUSany, FGENESH++, JIGSAW, PARAGONany, CCDSGene, KNOWNGene, REFSEQ)

13 At all levels, two basic measures were computed: - sensitivity: the proportion of annotated features (nucleotide, exon, gene) that have been correctly predicted - specificity: the proportion of predicted features that is correct. The average sensitivity and specificity ((Sn + Sp)/2) was also calculated for each program.

14 Guigo et al., Genome Biol. 2006;7 Suppl 1:S Gene feature projection for evaluation of the accuracy of predictions missing exons wrong exons PREDICTION KNOWN

15 Guigo et al., Genome Biol. 2006;7 Suppl 1:S Gene prediction accuracy at the transcript level. Boxplots of the average sensitivity and specificity ((Sn + Sp)/2) for each program. A transcript is accurately predicted if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct.

16 Guigo et al., Genome Biol. 2006;7 Suppl 1:S These studies have revealed that i)none of the strategies produced perfect predictions ii) prediction methods that rely on mRNA and protein sequences and those that used combined informations (including expressed sequence information) were generally the most accurate. iii)the dual- or multiple genome methods were more accurate than the single genome ab initio prediction methods. iv)At the transcript level (the most stringent criterion) - no prediction method correctly identified greater than 45% of the coding transcripts.

17 The MisPred project The implicit question is: are there signs that could indicate that the predicted structure of a protein-coding gene may be incorrect? The rationale of our MisPred project is that a protein-coding gene is suspected to be mispredicted if some of its features conflict with our current knowledge about protein-coding genes and proteins. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics Aug 27;9:353. MisPred: Database of mispredicted and abnormal proteins;

18 Several quality control tools of MisPred address the issue whether the predicted protein is able to reach the cellular compartment where it could be properly folded, is stable and functional. The rationale of these tools is that protein domains have adapted to different subcellular compartments during evolution and they are usually misfolded, unstable and non-functional if mislocalized. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics Aug 27;9:353.

19 Domain co-occurrence network of metazoan multidomain proteins Extracellular domain Cytoplasmic signalling domain Nuclear domain Tordai H, Nagy A, Farkas K, Bányai L, Patthy L. Modules, multidomain proteins and organismic complexity. FEBS J Oct;272(19): As a corollary, in multidomain proteins domain-types do not co-occur at random: - in extracellular proteins domains adapted to the extracellular milieu are used - extracellular and intracellular domains can co-occur only in transmembrane proteins - nuclear and extracellular domains do not co-occur in a single protein etc.

20 Some mislocalization-based MisPred tools used for the identification of abnormal or mispredicted proteins -Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals. -Conflict between the presence of extracellular and intracellular signaling domains and the absence of transmembrane domains. -Co-occurrence of extracellular and nuclear domains.

21 Rationale: proteins containing domains that occur exclusively in the extracellular space (e.g. in secreted extracellular proteins or in the extracellular part of type I, type II, type III single pass transmembrane proteins or in multispanning transmembrane proteins) have a cleavable signal peptide at the N-terminal end and/or transmembrane segments. Accordingly, proteins that contain extracellular domains but lack signal peptide and/or transmembrane segments are considered abnormal. latrophilin-2 SP complement factor masp-3 SP leukocyte activation antigen m6 SPTM receptor tyrosine kinase-like orphan receptor 2 TM SP Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals. killer cell lectin-like receptor TM

22 enst pep UNI_TREMBL:Q8N708 ID Q8N708 PRELIMINARY; PRT; 449 AA. AC Q8N708; DT 01-OCT-2002 (TrEMBLrel. 22, Created) DT 01-OCT-2002 (TrEMBLrel. 22, Last sequence update) DT 01-MAR-2003 (TrEMBLrel. 23, Last annotation update) DE HF1 protein.... SCORES Init1: 3167 Initn: 3167 Opt: 3167 z-score: E(): 1.1e-195 >>UNI_TREMBL:Q8N708 (449 aa) initn: 3167 init1: 3167 opt: 3167 Z-score: expect(): 1.1e-195 Smith-Waterman score: 3167; 99.5% identity in 430 aa overlap (1-430:20-449) enst DCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG ||||||||||||||||||||||||||||||||||||||||| Q8N708 MRLLAKIICLMLWAICVAEDCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG enst NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL enst LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHFGQAVRFVCNS |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHFGQAVRFVCNS enst GYKIEGDEEMHCSDDGFWSKEKPKCVEISCKSPDVINGSPISQKIIYKENERFQYKCNMG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 GYKIEGDEEMHCSDDGFWSKEKPKCVEISCKSPDVINGSPISQKIIYKENERFQYKCNMG enst YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEITYQCRNGFYP |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEITYQCRNGFYP enst ATRGNTAKCTSTGWIPAPRCTLKPCDYPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Q8N708 ATRGNTAKCTSTGWIPAPRCTLKPCDYPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH enst FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNHGRKFVQGKSIDVACHPGY |||||||||||||||||||||||||||||||||||||||||:|||||||||||||||||| Q8N708 FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNYGRKFVQGKSIDVACHPGY enst ALPKAQTTVTCMENGWSPTPRCIRVKFTL |||||||||||||||||||||||||:||| Q8N708 ALPKAQTTVTCMENGWSPTPRCIRVSFTL Q8N708 ENSP SP CORRECT MISPREDICTED Complement factor H, isoform b.

23 enst pep UNI_SPROT:IL2A_HUMAN ID IL2A_HUMAN STANDARD; PRT; 272 AA. AC P01589; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 01-OCT-2004 (Rel. 45, Last annotation update) DE Interleukin-2 receptor alpha chain precursor (IL-2 receptor alpha... SCORES Init1: 637 Initn: 719 Opt: 637 z-score: E(): 3.2e-34 >>UNI_SPROT:IL2A_HUMAN (272 aa) initn: 719 init1: 637 opt: 637 Z-score: expect(): 3.2e-34 Smith-Waterman score: 637; 100.0% identity in 92 aa overlap (1-92:31-122) enst IPHATFKAMAYKEGTMLNCECKRGFRRIKS |||||||||||||||||||||||||||||| IL2A_HUMAN MDSYLLMWGLLTFIMVPGCQAELCDDDPPEIPHATFKAMAYKEGTMLNCECKRGFRRIKS enst GSLYMLCTGNSSHSSWDNQCQCTSSATRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| IL2A_HUMAN GSLYMLCTGNSSHSSWDNQCQCTSSATRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS enst LPDFQIQTEMAATMETSI || IL2A_HUMAN LPGHCREPPPWENEATERIYHFVVGQMVYYQCVQGYRALHRGPAESVCKMTHGKTRWTQP P01589 ENSP SPTM CORRECT MISPREDICTED Interleukin-2 receptor alpha chain precursor

24 enst pep UNI_SPROT:209L_HUMAN ID 209L_HUMAN STANDARD; PRT; 399 AA. AC Q9H2X3; Q969M4; Q96QP3; Q96QP4; Q96QP5; Q96QP6; Q9BXS3; Q9H2Q9; AC Q9H8F0; Q9Y2A8; DT 05-JUL-2004 (Rel. 44, Created) DT 05-JUL-2004 (Rel. 44, Last sequence update) DT 01-OCT-2004 (Rel. 45, Last annotation update)... SCORES Init1: 1721 Initn: 1997 Opt: 1732 z-score: E(): 2.4e-91 >>UNI_SPROT:209L_HUMAN (399 aa) initn: 1997 init1: 1721 opt: 1732 Z-score: expect(): 2.4e-91 Smith-Waterman score: 2034; 82.2% identity in 399 aa overlap (1-332:1-399) enst MSDSKEPRVQQLGLLEEDPTTSGIRLFPRDFQFQQIHGHKSST VPFLL ||||||||||||||||||||||||||||||||||||||||||| : |:| 209L_HUMAN MSDSKEPRVQQLGLLEEDPTTSGIRLFPRDFQFQQIHGHKSSTGCLGHGALVLQLLSFML enst G PVSKVPSSLSQEQSEQDAIYQNLTQLKAAVGELSEKSKLQEIYQELTQLK | ||||||||||||||||||||||||||||||||||||||||||||||||| 209L_HUMAN LAGVLVAILVQVSKVPSSLSQEQSEQDAIYQNLTQLKAAVGELSEKSKLQEIYQELTQLK enst AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQE |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 209L_HUMAN AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQE enst IYQELTRLKAAVGELPEKSKLQEIYQELTELKAAV ||||||||||||||||||||||||||||||||||| 209L_HUMAN IYQELTRLKAAVGELPEKSKLQEIYQELTELKAAVGELPEKSKLQEIYQELTQLKAAVGE enst ERLCRHCPKDWTFFQGNCYFMSNSQRNWHDSVTACQEVR ||||||||||||||||||||||||||||||||||||||| 209L_HUMAN LPDQSKQQQIYQELTDLKTAFERLCRHCPKDWTFFQGNCYFMSNSQRNWHDSVTACQEVR enst AQLVVIKTAEEQNFLQLQTSRSNRFSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 209L_HUMAN AQLVVIKTAEEQNFLQLQTSRSNRFSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP enst NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE ||||||||||||||||||||||||||||||||||||||| 209L_HUMAN NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE Q9H2X3 ENSP TM CORRECT MISPREDICTED CD209 antigen-like protein 1

25 CADH2_HUMAN [906 residues] AC _002 [191 residues] 1 50 cadh2_human MCRIAGALRT LLPLLLALLQ ASVEASGEIA LCKTGFPEDV YSAVLSKDVH ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human EGQPLLN... VKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS ac ~~~~MCNTQR MKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS cadh2_human EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE VEEIVFPRQF ac EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE VEEIVFPRQF cadh2_human SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR DKNLSLRYSV ac SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR DKNLSLRYSV cadh2_human TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA HAVDINGNQV ac TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA HAVDI~~~~~ cadh2_human ENPIDIVINV IDMNDNRPEF LHQVWNGTVP EGSKPGTYVM TVTAIDADDP ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human NALNGMLRYR IVSQAPSTPS PNMFTINNET GDIITVAAGL DREKVQQYTL ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human IIQATDMEGN PTYGLSNTAT AVITVTDVND NPPEFTAMTF YGEVPENRVD ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human IIVANLTVTD KDQPHTPAWN AVYRISGGDP TGRFAIQTDP NSNDGLVTVV ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human KPIDFETNRM FVLTVAAENQ VPLAKGIQHP PQSTATVSVT VIDVNENPYF ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human APNPKIIRQE EGLHAGTMLT TFTAQDPDRY MQQNIRYTKL SDPANWLKID ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human PVNGQITTIA VLDRESPNVK NNIYNATFLA SDNGIPPMSG TGTLQIYLLD ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human INDNAPQVLP QEAETCETPD PNSINITALD YDIDPNAGPF AFDLPLSPVT ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human IKRNWTITRL NGDFAQLNLK IKFLEAGIYE VPIIITDSGN PPKSNISILR ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human VKVCQCDSNG DCTDVDRIVG AGLGTGAIIA ILLCIIILLI LVLMFVVWMK ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human RRDKERQAKQ LLIDPEDDVR DNILKYDEEG GGEEDQDYDL SQLQQPDTVE ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human PDAIKPVGIR RMDERPIHAE PQYPVRSAAP HPGDIGDFIN EGLKAADNDP ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ cadh2_human TAPPYDSLLV FDYEGSGSTA GSLSSLNSSS SGGEQDYDYL NDWGPRFKKL ac ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ 901 cadh2_human ADMYGGGDD ac ~~~~~~~~~ SP TM CORRECT ABNORMAL Cadherin-2 precursor

26 Conflict between the presence of extracellular and intracellular signalling domains and the absence of transmembrane domains. Rationale: extracellular domains and intracellular signalling domains can co- occur in multidomain proteins only if transmembrane segments separate these two types of domains. Domain co-occurrence network of metazoan multidomain proteins Extracellular module Cytoplasmic signalling module Nuclear module Tordai et al., FEBS J. 2005; 272(19): Accordingly, proteins that contain extracellular and intracellular signalling domains but lack a transmembrane segment separating them are considered abnormal. receptor tyrosine kinase-like orphan receptor 2 TM SP KR

27 ENSXETP (Xenopus tropicalis) is erroneous since it lacks a transmembrane segment although it contains both extracellular and cytoplasmic signaling domains.

28 Query 181 SDVTYRVVCKRCSWEQGECIPCANTIGYVPQQSGLVDTYISIVDLVAHANYTFEVEAVNG 240 +DVTYR++CKRCSWEQGEC+PC + IGY+PQQ+GLVD Y++++DL+AHANYTFEVEAVNG Sbjct 361 NDVTYRILCKRCSWEQGECVPCGSNIGYMPQQTGLVDNYVTVMDLLAHANYTFEVEAVNG 420 Query 241 VSDLSRSQRLFAAVSVTTGQAAPSQVSGVMKERVLQRAVDLSWQEPEHPNGVITEYEIKY 300 VSDLSRSQRLFAAVS+TTGQAAPSQVSGVMKERVLQR+V+LSWQEPEHPNGVITEYEIKY Sbjct 421 VSDLSRSQRLFAAVSITTGQAAPSQVSGVMKERVLQRSVELSWQEPEHPNGVITEYEIKY 480 Query 301 YEKDQRERTYSTLKTKSTSVSINNLRPGTAYIFQIRAFTAAGYGMYSPRLDVSTLEEATV 360 YEKDQRERTYST+KTKSTS SINNL+PGT Y+FQIRAFTAAGYG YSPRLDV+TLEEAT Sbjct 481 YEKDQRERTYSTVKTKSTSASINNLKPGTVYVFQIRAFTAAGYGNYSPRLDVATLEEATA 540 Query 361 YYIFA-CSYCIAYIMGSQSSLLLCLQIALQLLINSSSLYYTAALCDLNYNKSLKMHFPSG I I Y+ A D ++ L HF Sbjct 541 TAVSSEQNPVIIIAVVAVAGTIILVFMVFGFIIGRRHCGYSKA--DQEGDEELYFHF Query 420 LVKFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR 479 KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR Sbjct KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR 653 Query 480 DVPVAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEFMENGA 539 DV VAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIE+MENGA Sbjct 654 DVAVAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEYMENGA 713 Query 540 LDAFLRKLDGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 599 LDAFLRK DGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG Sbjct 714 LDAFLRKHDGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 773 Query 600 LSRIIEDDPDAVYTTTQGGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY 659 LSR+IEDDP+AVYTTT GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY Sbjct 774 LSRVIEDDPEAVYTTT-GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY 832 Query 660 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP 719 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP Sbjct 833 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP 892 Query 720 NSLKTPMGTCNRPTSPLLDQNTLDFNSFCSVGEWLEAIKMERYKENFSSSGYNSLESVAR 779 NSLKTP+GTC+RP SPLLDQNT DF +FCSVGEWL+AIKMERYK+NF+++GYNSLESVAR Sbjct 893 NSLKTPLGTCSRPISPLLDQNTPDFTTFCSVGEWLQAIKMERYKDNFTAAGYNSLESVAR 952 Query 780 MSIDDVISLGITLVGHQKKIMNSIQTMRAQMLQLHGTGI 818 M+I+DV+SLGITLVGHQKKIM+SIQTMRAQML LHGTGI Sbjct 953 MTIEDVMSLGITLVGHQKKIMSSIQTMRAQMLHLHGTGI 991 The chicken ortholog of ENSXETP (Xenopus tropicalis), EPHA7_CHICK Ephrin type-A receptor 7 (np_990414), does contain a transmembrane segment ENSXETP (Xenopus tropicalis) deviates most significantly in this region from EPHA7_CHICK

29 The erroneous part of ENSXETP (Xenopus tropicalis) could be corrected, by identifying the exons encoding the ‘missing’ transmembrane segment ensxetp _corrected KERVLQRAVD LSWQEPEHPN GVITEYEIKY YEKDQRERTY STLKTKSTSV np_ KERVLQRSVE LSWQEPEHPN GVITEYEIKY YEKDQRERTY STVKTKSTSA ensxetp KERVLQRAVD LSWQEPEHPN GVITEYEIKY YEKDQRERTY STLKTKSTSV ensxetp _corrected SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATA TAVSTEQNPV np_ SINNLKPGTV YVFQIRAFTA AGYGNYSPRL DVATLEEATA TAVSSEQNPV ensxetp SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATV YYIFACSYCI ensxetp _corrected IIIAVVAVAG TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHC... np_ IIIAVVAVAG TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHF... ensxetp AYI.MGSQSS LLLCLQIALQ LLINSSSLYY TAALCDLNYN KSLKMHFPSG ensxetp _corrected......TKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC np_ KFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC ensxetp LVKFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC ensxetp _corrected SGRLKLPGKR DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE np_ SGRLKLPGKR DVAVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE ensxetp SGRLKLPGKR DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE

30 Co-occurrence of extracellular and nuclear domains. Rationale: nuclear domains do not co-occur with extracellular domains in multidomain proteins. Accordingly, proteins that contain both extracellular and nuclear domains are considered abnormal. Domain co-occurrence network of metazoan multidomain proteins Extracellular module Cytoplasmic signalling module Nuclear module Tordai et al., FEBS J. 2005; 272(19):

31 YL15_CAEEL 1 50 q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ yl15_caeel MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTKRTTFTS q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ MFVWSAAVLI FSSVVPTFAQ YGCI....SE yl15_caeel EQVTLLELEF AKNEYICKDR RGELAQTIEL TECQVKTWFQ NRRTKKRSSE q619j1_caebr LTFGKACPQN KTSTKWFFDA KLSFCYPYQF LGCDEGSNSF ESSDICLESC yl15_caeel LKFGTACSEN KTSTKWYYDS KLLFCYPYKY LGCGEGSNSF ESNENCLESC q619j1_caebr KPADQFSCGG NTDADGICFS PSDSGCKKGT DCVMGGNIGF CCNKATQDEW yl15_caeel KPADQFSCGG NTGPDGVCFA HGDQGCKKGT VCVMGGMVGF CCDKKIQDEW q619j1_caebr NKEHSPTCSK GSVVQFKQWF GMTPLIGRNC AHKFCPAGST CIQGKWTAHC yl15_caeel NKENSPKCLK GQVVQFKQWF GMTPLIGRSC SHNFCPEKST CVQGKWTAYC 251 q619j1_caebr CQ yl15_caeel CQ HomeoboxKunitz_BPTI Mispredicted protein (Swiss-Prot entry) containing nuclear and extracellular domains Exons belonging to tandem genes on C. elegans chromosome X have been incorrectly joined MISPREDICTED Hypothetical homeobox protein C02F12.5 in chromosome X

32 1 50 yl15_caeel_corr MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTK.RTTFT hm07_caeel ~~~~~~~~~~ ~~~~~~~~MK HEMVFTFLLM MVRPEASTSR IPRR.RTTFT q7qbz2_anoga ~~~~~~~~~~ ~~~~MLFTTS YSRNKPTNNS NVARRRKKEG RPRRQRTTFS yl15_caeel_corr SEQVTLLELE FAKNEYICKD RRGELAQTIE LTECQVKTWF QNRRTKKRSF hm07_caeel VEQLYLLEMY FAQSQYVGCD ERERLARILS LDEYQVKIWF QNRRIRMRRE q7qbz2_anoga SEQTLRLEVE FHRNEYISRG RRFELAEVLK LSETQIKIWF QNRRAKDKRI yl15_caeel_corr I*~~~~~~~~ ~~~~~ hm07_caeel ANK~~~~~~~ ~~~~~ q7qbz2_anoga EKAQIDQQYR SVRIK 1 50 q619j1_caebr MFVWSAAVLI FSSVVPTFAQ YGCISELTFG KACPQNKTST KWFFDAKLSF yl15_caeel_corr1 MLFFTLLIQL F..LVPVLCQ YACSSELKFG TACSENKTST KWYYDSKLLF q619j1_caebr CYPYQFLGCD EGSNSFESSD ICLESCKPAD QFSCGGNTDA DGICFSPSDS yl15_caeel_corr1 CYPYKYLGCG EGSNSFESNE NCLESCKPAD QFSCGGNTGP DGVCFAHGDQ q619j1_caebr GCKKGTDCVM GGNIGFCCNK ATQDEWNKEH SPTCSKGSVV QFKQWFGMTP yl15_caeel_corr1 GCKKGTVCVM GGMVGFCCDK KIQDEWNKEN SPKCLKGQVV QFKQWFGMTP q619j1_caebr LIGRNCAHKF CPAGSTCIQG KWTAHCCQ yl15_caeel_corr1 LIGRSCSHNF CPEKSTCVQG KWTAYCCQ Homeobox Kunitz_BPTI Corrected predictions for the distinct constituent proteins containing the nuclear homeobox and extracellular KUNITZ_BPTI domains CORRECT

33 Another MisPred tool detects errors in gene prediction based on ‘Domain size deviation’. The rationale of this tool is that the highly cooperative, rapid folding of protein domains is the result of natural selection, therefore insertion/deletion of larger segments into/from protein domains may yield macromolecules that are unable to rapidly adopt a correctly folded, viable and stable three-dimensional structure. Accordingly, proteins containing domains that consist of a significantly larger or smaller number of residues than closely related members of the same family may be suspected to be unable to fold efficiently into a correctly folded, viable and stable domain/protein.

34 RP11-247A [544 aa] CACP_HUMAN, Carnitine O-acetyltransferase [626 residues] cacp_human MLAFAARTVV KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE rp11-247a12 MLAFAARTVV KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE cacp_human WWLKTAYLQY RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP rp11-247a12 WWLKTAYLQY RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP cacp_human THITVVHNYQ FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ KSIFTVCLDA TMPRVSEDVY rp11-247a12 THITVVHNYQ FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ cacp_human RSHVAGQMLH GGGSRLNSGN RWFDKTLQFI VAEDGSCGLV YEHAAAEGFP IVTLLDYVIE YTKKPELVRS PMVPLPMPKK LRFNITPEIK SDIEKAKQNL rp11-247a KKPELVRS PLVPLPMPKK LRFNITPEIK SDIEKAKQNL cacp_human SIMIQDLDIT VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV rp11-247a12 SIMIQDLDIT VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV cacp_human QAHRGYTDRA IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA rp11-247a12 QAHRGYTDRA IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA cacp_human ARLAHYLEKA LLDMRALLQS HPRAKL rp11-247a12 ARLAHYLEKA LLDMRALLQS HPRAKL CORRECT ABNORMAL RP11-247A encodes an internally deleted Carn_acyltransf domain Region missing from RP11-247A

35 STRUCTURE OF HUMAN CARNITINE ACETYLTRANSFERASE 1NM8.pdb His 343 Three-dimensional structure of human carnitine O- acetyltransferase. 1NM8.pdb The region highlighted in yellow is missing from transcript RP11-247A This region also contains the catalytic residue His-343

36 1 50 epha5_human MRGSGPRGAG HRRPP..SGG GDTPITPASL AGCYSAPRRA PLWTCLLLCA epha5_rat MRGSGPRGAG RRRTQGRGGG GDTPRVPASL AGCYSAPLKG PLWTCLLLCA epha5_chick M...GLRGGG.....GRAGG......PA PGWTCLLLCA epha5_mouse MRGSGPRGAG HRRTQGRGGG DDTPRVPASL AGCYSAPLKG PLWTCLLLCA epha5_human ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE VDENYAPIHT epha5_rat ALRTLLASPS NEVNLLDSRT VLGDLGWIAF PKNGWEEIGE VDENYAPIHT epha5_chick ALRSLLASPG SEVNLLDSRT VMGDLGWIAY PKNGWEEIGE VDENYAPIHT epha5_mouse ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE VDENYAPIHT epha5_human YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK epha5_rat YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK epha5_chick YQVCKVMEQN QNNWLLTSWI SNEGRPASSF ELKFTLRDCN SLPGGLGTCK epha5_mouse YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK epha5_human ETFNMYYFES DDQNGRNIKE NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_rat ETFNMYYFES DDENGRNIKD NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_chick ETFNMYYFES DDEDGRNIRE NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_mouse ETFNMYYFES DDENGRSIKE NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_human VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAVFPDTIT epha5_rat VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAVFPDTIT epha5_chick VRDVGPLTKK GFYLAFQDVG ACIALVSVRV YYKKCPSVIR NLARFPDTIT epha5_mouse VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAIFPDTIT epha5_human GADSSQLLEV SGSCVNHSVT DEPPKMHCSA EGEWLVPIGK CMCKAGYEEK epha5_rat GADSSQLLEV SGSCVNHSVT DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK epha5_chick GADSSQLLEV SGVCVNHSVT DEAPKMHCSA EGEWLVPIGK CLCKAGYEEK epha5_mouse GADSSQLLEV SGSCVNHSVT DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK epha5_human NGTCQVCRPG FFKASPHIQS CGKCPPHSYT HEEASTSCVC EKDYFRRESD epha5_rat NGTCQVCRPG FFKASPHSQT CSKCPPHSYT HEEASTSCVC EKDYFRRESD epha5_chick NNTCQVCRPG FFKASPHSPS CSKCPPHSYT LDEASTSCLC EEHYFRRESD epha5_mouse NGTCQ epha5_human PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGRKD VSYYIACKKC epha5_rat PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGGKD VSYYILCKKC epha5_chick PPTMACTRPP SAPRSAISNV NETSVFLEWI PPADTGGRKD VSYYIACKKC epha5_mouse epha5_human NSHAGVCEEC GGHVRYLPRQ SGLKNTSVMM VDLLAHTNYT FEIEAVNGVS epha5_rat NSHAGVCEEC GGHVRYLPQQ IGLKNTSVMM ADPLAHTNYT FEIEAVNGVS epha5_chick NSHSGLCEAC GSHVRYLPQQ TGLKNTSVMM VDLLAHTNYT FEIEAVNGVS epha5_mouse epha5_human DLSPGARQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS WQEPDRPNGI epha5_rat DLSPGTRQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS WQEPDRPNGI epha5_chick DQNPGARQFV SVNVTTNQAA PSPVSSVKKG KITKNSISLS WQEPDRPNGI epha5_mouse A PSPVTNVKKG KIAKNSISLS WQEPDRPNGI epha5_human ILEYEIKHFE KDQETSYTII KSKETTITAE GLKPASVYVF QIRARTAAGY epha5_rat ILEYEIKYFE KDQETSYTII KSKETTITAE GLKPASVYVF QIRARTAAGY epha5_chick ILEYEIKYFE KDQETSYTII KSKETAITAD GLKPGSAYVF QIRARTAAGY epha5_mouse ILEYEIKYFE KDQETSYTII KSKETSITAE GLKPASVYVF QIRARTAAGY epha5_human GVFSRRFEFE TTPV.FAASS DQSQIPVIAV SVTVGVILLA VVIGVLLSGS epha5_rat GVFSRRFEFE TTPV.FGASN DQSQIPIIGV SVTVGVILLA VMIGFLLSGS epha5_chick GGFSRRFEFE TSPV.LAASS DQSQIPIIVV SVTVGVILLA VVIGFLLSGS epha5_mouse GVFSRRFEFE TTPVSVAASN DQSQIPIIAV SVTVGVILLA VMIGFLLSGS epha5_human CCECGCGRAS SLCAVAHPIL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_rat CCECGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_chick CCDHGCGWAS SLRAVAYPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_mouse CCDCGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_human RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGRLKLP epha5_rat RTYIDPHTYE DPTQAVHEFG KEIEASCITI ERVIGAGEFG EVCSGRLKLP epha5_chick RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGRLKLQ epha5_mouse RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGCLKLP epha5_human GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_rat GKRELPVATK TLKVGYTEKQ RRDFLSEASI MGQFDHPNII HLEGVVTKSK epha5_chick GKREFPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_mouse GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_human PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGISAG MKYLSDMGYV epha5_rat PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV epha5_chick PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIASG MKYLSDMGYV epha5_mouse PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV epha5_human HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_rat HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_chick HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_mouse HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_human EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_chick EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_human PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN epha5_chick PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_human ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_chick ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV epha5_human AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L epha5_rat AQVTLE epha5_chick AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM KVQMVNGMVP V EPHA5_RATEPHA5_RAT ephrin type-a receptor 5 precursor [1005 residues] EPHA5_HUMANEPHA5_HUMAN ephrin type-a receptor 5 precursor [1037 residues] EPHA5_RAT contains a C-terminal truncated SAM_1 domain, although not annotated as fragment by SwissProt. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain.

37 epha5_rat_corrected EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_human EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_chick EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_rat_corrected PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN epha5_human PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_chick PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_rat_corrected ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_human ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV epha5_chick ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV epha5_rat_corrected AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP V* epha5_rat AQVTLE~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~ epha5_human AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L~ epha5_chick AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L~ epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM KVQMVNGMVP V~ corrected

38 Conclusions from MisPred analyses of various databases The number of UniProtKB/Swiss-Prot entries identified by MisPred as erroneous is very low, attesting to both the high quality of this manually curated database and the reliability of the MisPred approach. In the case of UniProtKB/TrEMBL MisPred identified a large proportion of TrEMBL entries as erroneous, the majority of which were missing signal peptides or suffered from domain size deviation. This is due primarily to the fact that these TrEMBL entries are translated in silico from non-full length cDNAs. In the case of the EnsEMBL- and NCBI/GNOMON-predicted sequences MisPred identified ~3-4 % of human sequences as erroneous. The majority of errors were also identified on the basis of missing signal peptides and domain size deviation, probably reflecting the influence of non-full-length or abnormal cDNAs on gene predictions. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics Aug 27;9:353.

39 Application of the MisPred tools to GENCODE peptides revealed that many of the potential alternative gene products encode proteins that are likely to be mislocalized and/or misfolded, suggesting that they do not have a role as functional proteins. Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, López G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Størling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramírez F, Schlicker A, Denoeud F, Jones P, Kerrien S, Orchard S, Antonarakis SE, Reymond A, Birney E, Brunak S, Casadio R, Guigo R, Harrow J, Hermjakob H, Jones DT, Lengauer T, Orengo CA, Patthy L, Thornton JM, Tramontano A, Valencia A. The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A Mar 27;104(13): Epub 2007 Mar 19. Conclusions from MisPred analyses of various databases

40 Tress et al., Proc Natl Acad Sci U S A Mar 27;104(13):

41 Although large scale whole genome analyses have shown that mammalian transcriptomes are made of a swarming mass of different overlapping transcripts, little evidence exists that the majority of this transcript complexity leads to protein complexity. The 5.7 average transcripts per coding locus annotated in GENCODE translates only to 1.7 proteins per locus (since a large fraction of transcript variation corresponds to non-coding transcripts or accumulates in the UTRs of coding transcripts). Moreover, if the GENCODE proteins flagged as problematic by the protein assessment methods, such as MisPred, are ignored, there are barely 1.3 annotated proteins per locus. The discrepancy between a complex, variable and largely unexplored population of RNA molecules, and a relatively small, stable, and well defined population of proteins, constitutes one of the challenges that Molecular Biology needs to address to fully elucidate cellular function. Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE and Guigó R. Identifying protein-coding genes in genomic sequences. Genome Biology 2009, 10:201

42 László Bányai Krisztina Farkas Hédi Hegyi Evelin Kozma Alinda Nagy Hedvig Tordai This work was carried out as part of the BioSapiens project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT The authors thank the partial support of the National Office for Research and Technology under grant no.: eScience RET14/2005.

43


Download ppt "Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium."

Similar presentations


Ads by Google