Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen
Summary (1) An introduction to orthology (2) Orthology determination methods (3) Benchmarking: –co-expression –conservation of co-expression –SwissProt name (4) Conclusions
An introduction to orthology (from
Orthology determination methods Orthology databases/methods: COG/KOG Inparanoid OrthoMCL Inclusiveness: one-to-one/one-to-many/many-to-many organisms Best bidirectional hit/Phylogenetic trees
Benchmarking orthology Quality of orthology difficult to test; no golden standard Orthologs should have highly similar functions Measuring conservation of function: –functional annotation –co-expression –domain structure
Benchmarked orthology determination methods BBH: Best Bidirectional Hit KOG: euKaryotic Orthologous Groups INP: INPARANOID MCL: OrthoMCL Z1H: All pairs with Z >= 100 COM: Comics Phylogenetic Tree Method EQN: Equal SwissProt Names
Data set used ‘Protein World’: all proteins in all available (SPTREMBL) proteomes compared to each other Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD Z = 5
Data set used Z-value compensates for: –bias in amino acid composition –sequence length Proteomes used: –Human: 28,508 proteins –Mouse: 20,877 proteins 595,161,516 pairs
BBH method Easiest method: ‘best bidirectional hit’ Human protein (1) SW best hit in mouse (2) Mouse protein (2) SW best hit in human (3) If 3 equals 1, the human and mouse protein are considered to be orthologs 12,817 human-mouse orthologous pairs (12,817 human, 12,817 mouse proteins)
KOG method KOG: euKaryotic Orthologous Groups Eukaryotic version of COG, Clusters of Orthologous Groups COG method: –All-vs-all seq. comparison (BLAST) –Detect and collapse obvious paralogs Sp1-Sp1 Sp2-Sp2 Sp1-Sp2 E Hs-Hs < E BBH paralogs E Mm-Mm < E BBH paralogs etc. for other species determine BBHs
KOG method –Detect triangles of best hits –Merge triangles with a common side to form COGs –Case-by-case ‘manual’ analysis, examination of large COGs (might be split up)
KOG method KOG method mainly the same as COG method; special attention for eukaryotic multidomain structure Group orthologies: many-to-many Cognitor: assign a KOG to each protein (mouse not yet in KOG) 810,697 human-mouse orthologous pairs (20,478 human, 15,640 mouse proteins) Tatusov et al., “The COG database: an updated version includes eukaryotes”, BMC Bioinformatics Sep 11;4(1):41
INP method All-vs-all followed by a number of extra steps to add ‘in-paralogs’ many-to-many relations possible 54,553 human-mouse orthologous pairs (19,504 human, 17,030 mouse proteins) Remm et al., “Automatic clustering of orthologs and in-paralogs from pairwise species comparisons”, J Mol Biol Dec 14; 314(5):
MCL method All-vs-all BLASTP determine orthologs + ‘recent’ paralogs use Markov clustering to determine ortholog groups 7,322 human-mouse orthologous pairs (human 6,332, mouse 6,115 proteins) Li et al., “OrthoMCL: identification of ortholog groups for eukaryotic genomes”, Genome Res Sep;13(9):
Z1H method All human-mouse pairs with Z >= 100 in Protein World set are considered to be orthologs 290,176 human-mouse orthologous pairs (19,055 human, 16,149 mouse proteins)
COM method Human All 9 eukaryotic proteomes in Protein World Z>20, RH>0.5*QL 24,263 groups PHYLOME SELECTION OF HOMOLOGS ALIGNMENTS AND TREES PROTEOME PROTEOMES TREE SCANNING LIST Hs-Mm: 85,848 pairs Hs-Dm: 55,934 pairs etc.
COM method Example: BMP6 (Bone Morphogenetic Protein 6) 5 Hs-Mm orthologous relations defined
EQN method Consider all Hs-Mm pairs with equal SwissProt names to be orthologous e.g. ANDR_HUMAN ANDR_MOUSE Used as benchmark later on 5,214 Hs-Mm orthologous pairs (5,214 human, 5,214 mouse proteins)
Benchmarking through co-expression Comparison of expression profiles of each orthologous gene pair Using GeneLogic Expressor data set: organismsamplesfragmentstissue categories SNOMED tissue categories human mouse
Expression tissue categories HUMANMOUSE 1 Blood vessel 2 Cardiovascular system 3 Digestive organs 4 Digestive system 5 Endocrine gland- 6 Female genital system 5 Female genital system 7 Hematopoietic system 6 Hematopoietic system 8 Integumentary system 7 Integumentary system HUMANMOUSE 9 Male genital system 8 Male genital system 10 Musculoskeletal system 9 Musculoskeletal system 11 Nervous system10 Nervous system 12 Product of conception - 13 Respiratory system 11 Respiratory system 14 Topographic region - 15 Urinary tract12 Urinary tract
Co-expression calculation Calculation of the correlation coefficient: N xy – ( x)( y) r = sqrt( (N x 2 - ( x) 2 )(N y 2 – ( y) 2 )) Measured over the 12 corresponding SNOMED tissue categories
Co-expression example #1 High correlation:
Co-expression example #2 Low correlation:
Benchmarking through co-expression - +
Benchmarking through conservation of co-expression Human Gene A Gene B Mouse Gene A’ Gene B’ Co-expression = Cab (-1<=corr.<=1) Ca’b’ >= Cab Increases probability that A and B are involved in the same process (Co-expression calculated over 115 tissues in human, 25 in mouse) All-vs-all: Human: 40,678 chip fragments Mouse: 29,910 chip fragments
Benchmarking through conservation of co-expression Gene Ontology (GO) database: hierarchical system of function and location descriptions Orthologs are in same functional category when they are in the same 4th level GO Biological Process class
Benchmarking through conservation of co-expression
Benchmarking through SwissProt name How many of the predicted orthologous relations have equal SwissProt names (EQN set in other benchmarks) + reliable because checked by hand - assumes only one-to-one relationships are possible
Benchmarking through SwissProt name (ALL: if all possible human-mouse pairs (or random fraction) would be orthologs)
Conclusions Hard to point out the ‘best’ orthology determination method In most cases: less=better, more=worse Method that should be used depends on research question: do you need few reliable orthologies or many less reliable orthologies? Future directions: look at conservation of domain structure as a benchmark
Credits Martijn Huynen Peter Groenen Comics Group Gert Vriend Rest of CMBI Organon Bioinf. Group