Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats

Slides:



Advertisements
Similar presentations
CSE-700 Parallel Programming Assignment 6 POSTECH Oct 19, 2007 박성우.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
The STRING database Michael Kuhn EMBL Heidelberg.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Orthology, paralogy and GO annotation Paul D. Thomas SRI International.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Types of homology BLAST
Comparative genomics Joachim Bargsten February 2012.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Bioinformatics and Phylogenetic Analysis
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Protein Modules An Introduction to Bioinformatics.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein World SARA Amsterdam Tim Hulsen.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Comparative genomics Haixu Tang School of Informatics.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
S. pombe Unicellular archiascomycete Diverged from S. cerevisiae Ma Size ~14 Mb, 3 chromosomes No synteny Data stored in GeneDB.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Phylogeny and the Tree of Life
Sequence similarity, BLAST alignments & multiple sequence alignments
Demo: Protein Information Resource
Basics of Comparative Genomics
Sequence based searches:
Comparative Genomics.
P-POD-PANTHER: update
Genome Annotation Continued
Protein Bioinformatics Course
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Overview of Enzyme, Protein and Network Databases
Presentation transcript:

Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats

The Gene3D Protein Family and Annotation Resource: (1)Identify sequence homologues of CATH domains -HMMs & hit resolution protocol DomainFinder. -UniProt, RefSeq, Ensembl (with generous help of SIMAP at MIPS). (2)Integrate with sequence annotation resources. -Pfam, GO, KEGG, UniProt annotation, IntAct, String -Flexible cross-resource comparisons, including CATH PDB domains. (3)Import sequence families - In-house OrthoFams, HAMAP, SIMAP clusters.

{A} Last Common Ancestor A Species 1 A’ Species 2 Defining Orthology W.M. Fitch (1970) Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113.

Defining Paralogy {A} Last Common Ancestor {a} A Species 1 A’ Species 2 a’ a W.M. Fitch (1970) Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113.

Co-orthology A Species 1 A’’ Species 2 A’ A’’’ {A} Last Common Ancestor Co-orthologues

Updating The Terminology: InParalogues: –“paralogs in a given lineage that all evolved by gene duplications that happened after the radiation (speciation) event that separated the given lineage from the other lineage under consideration” OutParalogues: –“paralogs in the given lineage that evolved by gene duplications that happened before the radiation (speciation) event” * E.L.L. Sonnhammer & E.V. Koonin (2002) Orthology, paralogy and proposed classification for paralog subtypes. TiG 18:

Defining “Ortholog Families”: Strict Definition: –Families split at every duplication event. –Many small families. Normal Definition: –Set root at appropriate level of interest. –Accept inparalogues. –More useful for function prediction.

Some Example Resources: Name# Fams# ProtsAutom- ated? Description HAMAP ,000MManually curated prokaryotic families. EggNog43,5821,241,751AUpdate and extension to COGs, with fine-grained subsets. TreeFam1,400/ 15, ,000M & AAnimal orthologue families and gene trees. ClusTr12.6 mill6,000,000ASingle-linkage high similarity clusters. Inparanoid?600,000ASpecific for pair-wise comparisons. OrthoFam300,004,600,000ALarge-scale affinity propagation clustering.

Making the OrthoFams: Get similarity matrix from SIMAP. Create 85% non-redundant sequence DB (CD-HIT). Cluster sequences using Affinity Propogation Clustering (APC; Frey & Dueck, 2007). Add back in highly similar sequences. Sub-cluster families at 10 levels of sequence identity. –“S-levels”

Creating the OrthoFams: N/AProt AProt BProt CProt D Prot AN/A42035 Prot BN/A 6520 Prot CN/A … SIMAP protein similarity matrix Prot A Prot D Prot C Prot B …. Prot A CD-HIT Prot C …. UniProt & RefSeq

A Simple Test of the OrthoFams: 99.9% OrthoFams map to one HAMAP family in bacteria. Each HAMAP family tends to map to several OrthoFams => Too conservative? >80% map to a single KEGG Orthologue term.

Inheriting Protein-Protein Interactions: Protein-protein interactions (including mechanism) can be conserved after gene duplication and speciation events. Some interactions are ancient and well conserved, many are not. Interactions within species are better conserved between homologues than between species. Interactions are not binary, but are based on affinity Not all detectable interactions are biologically relevant. Refs: Mika & Rost 2006, Shoemaker & Panchenko 2007

Interaction Inheritance Approaches: Homology-based approaches have struggled… –Mika & Rost, 2007 Problems: –High coverage or high quality input, not both. –Interaction networks re-arrange rapidly –No simple universal accurate sequence identity threshold can be found. Need to separate those that can be inherited reliably, and those that can’t.

The hiPPI Idea: homology inferred Protein-Protein Interactions (1)Assume OrthoFams provide more reliable functional groupings than simple similarity measures. (2)Assume high affinity ~= high conservation ~= low experimental false positive rate. (3)Require more than one piece of supporting evidence.

iLevelcLevelicLevelSpecies Mod Exp Mod Score None ½½ None ½¼0.3 Ofam A Hs Ce Mm S30 …. S100 ? ? ? iLevelcLevel Hs Ce Mm S30 …. S100 Poss A13.3Yes Poss B7.3No Ofam B

Interactions derived from MIPS, IntAct and MINT. GO Term semantic similarity calculated with the Lord method (Lord et al, 2003).

Links and References “Gene3D: comprehensive structural and functional annotation of genomes” Corin Yeats, Jonathan Lees, Adam Reid, Paul Kellam, Nigel Martin, Xinhui Liu, and Christine Orengo NAR (2008) 36:D414–D418.

The Algorithm: For a query protein - At each Ofam S-level (starting at 100%): Identify homologues with interactions. In the interacting Ofams are there any proteins from the same species as the query protein? If so, score the potential interactions. Each piece of supporting evidence is included in the score. Sum the scores for each potential interaction –Since each interaction may be predicted through multiple homologues