Benchmarking Orthology in Eukaryotes 12-01-2004 Nijmegen Tim Hulsen.

Slides:

Advertisements

Similar presentations

Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur EMBO Bioinformatic and Comparative.

Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.

Basics of Comparative Genomics Dr G. P. S. Raghava.

Types of homology BLAST

M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.

Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?

Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.

Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.

Bioinformatics and Phylogenetic Analysis

MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.

FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.

Protein Modules An Introduction to Bioinformatics.

Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?

An update on ongoing projects within Biorange SP Biorange Project Meeting Leiden, September 15 Tim Hulsen.

Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.

Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.

Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Automatic methods for functional annotation of sequences Petri Törönen.

Protein Bioinformatics Course

Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.

Affymetrix Expression Data Comics Group Nijmegen Tim Hulsen.

Genomics in Drug Organon, Oss Tim Hulsen.

Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.

Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)

Protein World SARA Amsterdam Tim Hulsen.

Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

An orthology case study: the trypsin inhibition pathway Tim Hulsen (2005/03/07)

Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.

Protein and RNA Families

Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Anis Karimpour-Fard ‡, Ryan T. Gill †,

Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.

Copyright OpenHelix. No use or reproduction without express written consent1.

P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.

Using blast to study gene evolution – an example.

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.

Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

1 The Genome Gamble, Knowledge or Carnage? Comparative Genomics Leading the Organon Tim Hulsen, Oss, November 11, 2003.

The evolution of the immune system in chicken and higher Organon, Oss Tim Hulsen.

Construction of Substitution matrices

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

1 Computational functional genomics Lital Haham Sivan Pearl.

Testing sequence comparison methods with structure Organon, Oss Tim Hulsen.

Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?

Sequence similarity, BLAST alignments & multiple sequence alignments

BLAST program selection guide

Basics of Comparative Genomics

Functional Annotation of Transcripts

Genome Annotation Continued

Protein Bioinformatics Course

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Ensembl Genome Repository.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Comparative Genomics.

Basics of Comparative Genomics

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Basic Local Alignment Search Tool

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Presentation transcript:

Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen

Summary (1) An introduction to orthology (2) Orthology determination methods (3) Benchmarking: –co-expression –conservation of co-expression –SwissProt name (4) Conclusions

An introduction to orthology (from

Orthology determination methods Orthology databases/methods: COG/KOG Inparanoid OrthoMCL Inclusiveness: one-to-one/one-to-many/many-to-many organisms Best bidirectional hit/Phylogenetic trees

Benchmarking orthology Quality of orthology difficult to test; no golden standard Orthologs should have highly similar functions Measuring conservation of function: –functional annotation –co-expression –domain structure

Benchmarked orthology determination methods BBH: Best Bidirectional Hit KOG: euKaryotic Orthologous Groups INP: INPARANOID MCL: OrthoMCL Z1H: All pairs with Z >= 100 COM: Comics Phylogenetic Tree Method EQN: Equal SwissProt Names

Data set used ‘Protein World’: all proteins in all available (SPTREMBL) proteomes compared to each other Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD  Z = 5

Data set used Z-value compensates for: –bias in amino acid composition –sequence length Proteomes used: –Human: 28,508 proteins –Mouse: 20,877 proteins  595,161,516 pairs

BBH method Easiest method: ‘best bidirectional hit’ Human protein (1)  SW  best hit in mouse (2) Mouse protein (2)  SW  best hit in human (3) If 3 equals 1, the human and mouse protein are considered to be orthologs 12,817 human-mouse orthologous pairs (12,817 human, 12,817 mouse proteins)

KOG method KOG: euKaryotic Orthologous Groups Eukaryotic version of COG, Clusters of Orthologous Groups COG method: –All-vs-all seq. comparison (BLAST) –Detect and collapse obvious paralogs Sp1-Sp1 Sp2-Sp2 Sp1-Sp2 E Hs-Hs < E BBH  paralogs E Mm-Mm < E BBH  paralogs etc. for other species  determine BBHs

KOG method –Detect triangles of best hits –Merge triangles with a common side to form COGs –Case-by-case ‘manual’ analysis, examination of large COGs (might be split up)

KOG method KOG method mainly the same as COG method; special attention for eukaryotic multidomain structure Group orthologies: many-to-many Cognitor: assign a KOG to each protein (mouse not yet in KOG) 810,697 human-mouse orthologous pairs (20,478 human, 15,640 mouse proteins) Tatusov et al., “The COG database: an updated version includes eukaryotes”, BMC Bioinformatics Sep 11;4(1):41

INP method All-vs-all followed by a number of extra steps to add ‘in-paralogs’  many-to-many relations possible 54,553 human-mouse orthologous pairs (19,504 human, 17,030 mouse proteins) Remm et al., “Automatic clustering of orthologs and in-paralogs from pairwise species comparisons”, J Mol Biol Dec 14; 314(5):

MCL method All-vs-all BLASTP  determine orthologs + ‘recent’ paralogs  use Markov clustering to determine ortholog groups 7,322 human-mouse orthologous pairs (human 6,332, mouse 6,115 proteins) Li et al., “OrthoMCL: identification of ortholog groups for eukaryotic genomes”, Genome Res Sep;13(9):

Z1H method All human-mouse pairs with Z >= 100 in Protein World set are considered to be orthologs 290,176 human-mouse orthologous pairs (19,055 human, 16,149 mouse proteins)

COM method Human All 9 eukaryotic proteomes in Protein World Z>20, RH>0.5*QL 24,263 groups PHYLOME SELECTION OF HOMOLOGS ALIGNMENTS AND TREES PROTEOME PROTEOMES TREE SCANNING LIST Hs-Mm: 85,848 pairs Hs-Dm: 55,934 pairs etc.

COM method Example: BMP6 (Bone Morphogenetic Protein 6)  5 Hs-Mm orthologous relations defined

EQN method Consider all Hs-Mm pairs with equal SwissProt names to be orthologous e.g. ANDR_HUMAN  ANDR_MOUSE Used as benchmark later on 5,214 Hs-Mm orthologous pairs (5,214 human, 5,214 mouse proteins)

Benchmarking through co-expression Comparison of expression profiles of each orthologous gene pair Using GeneLogic Expressor data set: organismsamplesfragmentstissue categories SNOMED tissue categories human mouse

Expression tissue categories HUMANMOUSE 1 Blood vessel 2 Cardiovascular system 3 Digestive organs 4 Digestive system 5 Endocrine gland- 6 Female genital system 5 Female genital system 7 Hematopoietic system 6 Hematopoietic system 8 Integumentary system 7 Integumentary system HUMANMOUSE 9 Male genital system 8 Male genital system 10 Musculoskeletal system 9 Musculoskeletal system 11 Nervous system10 Nervous system 12 Product of conception - 13 Respiratory system 11 Respiratory system 14 Topographic region - 15 Urinary tract12 Urinary tract

Co-expression calculation Calculation of the correlation coefficient: N  xy – (  x)(  y) r = sqrt( (N  x 2 - (  x) 2 )(N  y 2 – (  y) 2 )) Measured over the 12 corresponding SNOMED tissue categories

Co-expression example #1 High correlation:

Co-expression example #2 Low correlation:

Benchmarking through co-expression - +

Benchmarking through conservation of co-expression Human Gene A Gene B Mouse Gene A’ Gene B’ Co-expression = Cab (-1<=corr.<=1) Ca’b’ >= Cab  Increases probability that A and B are involved in the same process (Co-expression calculated over 115 tissues in human, 25 in mouse) All-vs-all: Human: 40,678 chip fragments Mouse: 29,910 chip fragments

Benchmarking through conservation of co-expression Gene Ontology (GO) database: hierarchical system of function and location descriptions Orthologs are in same functional category when they are in the same 4th level GO Biological Process class

Benchmarking through conservation of co-expression

Benchmarking through SwissProt name How many of the predicted orthologous relations have equal SwissProt names (EQN set in other benchmarks) + reliable because checked by hand - assumes only one-to-one relationships are possible

Benchmarking through SwissProt name (ALL: if all possible human-mouse pairs (or random fraction) would be orthologs)

Conclusions Hard to point out the ‘best’ orthology determination method In most cases: less=better, more=worse Method that should be used depends on research question: do you need few reliable orthologies or many less reliable orthologies? Future directions: look at conservation of domain structure as a benchmark

Credits Martijn Huynen Peter Groenen Comics Group Gert Vriend Rest of CMBI Organon Bioinf. Group