MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.

Slides:



Advertisements
Similar presentations
Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
Advertisements

C A T H C A T H lass rchitecture opology or Fold Group
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Homology Based Analysis of the Human/Mouse lncRNome
Enzyme Evolution John Mitchell, February Theories of Enzyme Evolution.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Pfam(Protein families )
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Protein structure (Part 2 of 2).
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Introduction to BioInformatics GCB/CIS535
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Protein Modules An Introduction to Bioinformatics.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
The Chemistry of Protein Catalysis
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Comparing and Classifying Domain Structures
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Protein Homologue Clustering and Molecular Modeling L. Wang.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
The Integrated Microbial Genome (IMG) systems
Bio/Chem-informatics
Demo: Protein Information Resource
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Target selection strategies for the mouse genome
Prediction of Protein Structure and Function on a Proteomic Scale
Protein Sequence Analysis - Overview -
Prediction of protein function from sequence analysis
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Annabel E Todd, Christine A Orengo, Janet M Thornton  Structure 
Presentation transcript:

MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families can we identify in the proteomes  How many structures needed to cover a high fraction of prokaryotic, eukaryotic families  Targeting Universal Recurrent Superfamilies (SCOP/CATH/Pfam) to optimise coverage of fold and function space Russell Marsden, Alastair Grant, David Lee, Annabel Todd Janet Thornton, Andrzej Joachim Midwest Consortium

Protein Families in Complete Genomes with Structural/Functional Annotations 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes Gene3D Buchan, Thornton, Orengo, Genome Research (2002)

Protein Families in Complete Genomes with Structural/Functional Annotations 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes Gene3D Buchan, Thornton, Orengo, Genome Research (2002)

BLAST all the sequences from 120 completed genomes against each and cluster into protein families BLAST all the sequences from 120 completed genomes against each and cluster into protein families For each sequence identify CATH and Pfam domains For each sequence identify CATH and Pfam domains Clustering Sequences into Protein Superfamilies of Known Domain Composition PFscape - Protein Family Landscape SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000 TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002

Clustering ~800,000 genes from 120 complete genomes PFscape Gene Superfamily 1 Gene Superfamily 2 Gene Superfamily 3 Gene Superfamily 4 ~50,000 gene superfamilies of 2 or more sequences, 150,000 singletons

Library of HMMs built for representative sequences from each CATH and Pfam domain superfamily Library of HMMs built for representative sequences from each CATH and Pfam domain superfamily Mapping CATH and Pfam Domains onto Genome Sequences Scan against CATH & Pfam SAM-T99 HMM library protein sequences from genomes assign domains to CATH and Pfam superfamilies

Performance of Sequence Mapping Method 1D-HMM (SAM-T99) Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by SAM-T99 (%) of homologues found Error rate Library of 1D-HMM models detects ~80% of remote homologues

Use HMMs to annotate Gene Superfamilies with CATH and Pfam domains Gene Superfamily 1 Gene Superfamily 3 Gene Superfamily 4 Gene Superfamily 2 50,000 Gene Superfamilies CATH Pfam NewFam

Gene Superfamily 1 Gene Superfamily 3 Gene Superfamily 2 Merge superfamilies with the same domain combinations Gene3D: 50,000 -> 36,000 Superfamilies

Superfamily Families (35%ID) Superfamilies Further Classified into Families Multi-linkage clustering relatives in each sequence family have 35% or more sequence identity relatives in each sequence family have 35% or more sequence identity For good homology models one structure is needed for each family within a superfamily

Percentage of Families CATH (60,360)+Pfam(53,907)+Newfam(56,973) = 171,240 Families Number of domain superfamilies and families with no close structural homologue CATH (1400)+Pfam(4100)+Newfam(46,384) = 51,844 Superfamilies NewFamCATHPfam Percentage of Sequence Families with and without Close Structural Homologues (>35% identity) No close PDB homologue

CATH Number of Non-identical Relatives Pfam Fitted power-laws (with gradients) CATH (-0.4) Pfam (-1.0) Newfam (-1.9) Newfam Number of Non-identical Relatives Number of Superfamilies containing given number of Non-identical relatives as percentage of the total Preferentially Target Largest Superfamilies

50 ~70% of Proteomes are contained in < 2500 Largest CATH + Pfam + NewFamTarget Superfamilies Proteome Coverage by Superfamilies Superfamilies Ordered by Size Percentage of Proteomes (Number of non-identical proteins in 120 completed genomes)

Superfamilies Ordered by Size Percentage of Proteomes (120 completed genomes) 50 Proteome Coverage by Superfamilies CATH (superfamilies of known fold) Pfam NewFam

What Fraction of the Proteomes is covered by Bacterial Family Targets? Number of Target Families Percentage of Proteomes (120 completed genomes) 40 o 50 ~100,000 prokaryotic targets cover nearly 60% of proteomes 100,000200, prokaryotes eukaryotes eukaryotes plus prokaryotes

How many family targets cover a significant proportion of the eukaryotes and/or prokaryotes? Number of Target Families Percentage of Kingdom Proteomes (120 completed genomes) 40 o 50 25, ,000 family targets cover 70% of proteomes (< 2500 largest superfamily targets) prokaryotes eukaryotes eukaryotes plus prokaryotes 25,00045,000 30,000

MCSG Site Visit, Argonne, January 30, 2003 Target Selection Strategy  the largest < 2500 superfamily targets give 70% of proteomes  this corresponds to 25, ,000 family targets  accurate homology models are not needed for all families  target families of biological interest or containing human homologues with disease association  targets families from functionally diverse superfamilies to understand how changes in the structure can modify function  For example, Universal, Highly Recurrent Superfamilies are an interesting biological subset with diverse functions

Universal CATH Domain Superfamilies 30 representative eukaryotic and prokaryotic organisms Proportion of CATH domain annotations ~60-70% of CATH domain annotations within each organism are from < 200 CATH universal superfamilies common to all kingdoms of life some of which are very extensively duplicated

Domain Recurrences in the Genomes number of superfamilies occurrences Highly Recurrent, Extensively Duplicated Superfamilies

S R Y V Z W O U T N M D A J L B P Q K I H E F G C Poorly charac. Cellular processes and signalling Information stor. & proce. Metabolism Analysis in bacterial genomes showed that 56 Universal Superfamilies recurred in proportion to the genome size and accounted for 45% of the CATH domain annotations Analysis in bacterial genomes showed that 56 Universal Superfamilies recurred in proportion to the genome size and accounted for 45% of the CATH domain annotations COG functional annotation (25 Functional Categories) E (Amino acid metabolism) J (Translation and protein biosynthesis) K (Transcription) T (Signal Transduction) 56 Universal and Highly Recurrent Superfamilies 15,000 bacterial family targets

Relative with most neighbours for which homology model can be built or function assigned For >95% confidence when inheriting functional properties, homologues should have at least 60% identity (Todd, Valencia, Rost) In Functionally Diverse Superfamilies Select More Targets In Functionally Diverse Superfamilies Select More Targets

functional clusters identified by sequence conservation annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT) annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT) stored in Gene3D functional clusters S60_1 Superfamily S60_2 S60_3 S60_4 S60_5 Representative Structures for Superfamilies will help identify Functional Families

MCSG Site Visit, Argonne, January 30, 2003 Target Selection Strategy  Targeting the 2500 largest superfamilies will cover a significant proportion (70%) of the proteomes  For good homology models between 25, ,000 family targets are needed  Preferentially select targets from medically important and/or structurally and functionally diverse superfamilies  For example, targeting Universal and Recurrent superfamilies which exhibit significant structural and functional divergence will help to improve function prediction methods