Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.

Slides:



Advertisements
Similar presentations
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Advertisements

Pfam(Protein families )
Basics of Comparative Genomics Dr G. P. S. Raghava.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Structural bioinformatics
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
A number of slides taken/modified from:
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center PIRSF PROTEIN CLASSIFICATION SYSTEM AND SEQUENCE ANNOTATION.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Genome Biology and Biotechnology The next frontier: Systems biology Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein families, domains and motifs in functional prediction May 31, 2016.
Bio/Chem-informatics
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Demo: Protein Information Resource
Basics of Comparative Genomics
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Basics of Comparative Genomics
Presentation transcript:

Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information Resource Department of Biochemistry and Molecular Biology Georgetown University Medical Center

Overview Role of bioinformatics/computational biology in proteomics research Functional annotation of proteins = assigning correct name, describing function or predicting function for a sequence Classification of proteins = grouping them into families of related sequences Annotating a family helps the annotation of its members Sequence function

Bioinformatics as related to proteins 1. Sequence analysis Genome projects -> Gene prediction Protein sequence analysis Comparative genomics Protein sequence and family databases (annotation and classification) 2. Structural genomics 3. Data analysis and integration for: Large scale gene expression analysis Protein-protein interaction Intracellular protein localization 4. Integration of all data on proteins to reconstruct pathways and cellular systems, make predictions and discover new knowledge

Functional Genomics and Proteomics Proteomics studies biological systems based on global knowledge of protein sets (proteomes). Functional genomics studies biological functions of proteins, complexes, pathways based on the analysis of genome sequences. Includes functional assignments for protein sequences. GenomeTranscriptome ProteomeMetabolome

Proteomics Bioinformatics Analysis and integration of these data Data: Gene expression profiling Genome-wide analyses of gene expression (DNA Microarrays/Chips ) Data: Protein-protein interaction (Yeast Two-Hybrid Systems) Data: Structural genomics Determine 3D structures of all protein families Data: Genome projects (Sequencing)

Bioinformatics and Genomics/Proteomics Sequence, Other Data Pathways and Regulatory Circuits Hypothetical Cell Unknown Genes Putative Functional Groups

What’s in it for me? When an experiment yields a sequence (or a set of sequences), we need to find out as much as we can about this protein and its possible function from available data Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins More challenging for large sets of sequences generated by large-scale proteomics experiments The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments Sequence function

Bioinformatics as related to proteins 1. Sequence analysis Genome projects -> Gene prediction Protein sequence analysis Comparative genomics Protein sequence and family databases (annotation and classification) 2. Structural genomics 3. Data analysis and integration for: Large scale gene expression analysis Protein-protein interaction Intracellular protein localization 4. Integration of all data on proteins to reconstruct pathways and cellular systems, make predictions and discover new knowledge

Genomic DNA Sequence 5' UTRPromoter Exon1 IntronExon2 Intron Exon33' UTR A G G T A G Gene Recognition Exon2Exon1Exon3 C A C A C A A T T A T A Protein Sequence A T G A A T A A A Structure Determination Protein Structure Function Analysis Gene Network Metabolic Pathway Protein Family Molecular Evolution Family Classification G T Gene DNA Sequence Gene Protein Sequence Function Work with protein sequence, not DNA sequence

Most new proteins come from genome sequencing projects Mycoplasma genitalium proteins Escherichia coli - 4,288 proteins S. cerevisiae (yeast) - 5,932 proteins C. elegans (worm) ~ 19,000 proteins Homo sapiens ~ 30,000 proteins... and have unknown functions

Advantages of knowing the complete genome sequence All encoded proteins can be predicted and identified The missing functions can be identified and analyzed Peculiarities and novelties in each organism can be studied Predictions can be made and verified

The changing face of protein science 20 th century Few well-studied proteins Mostly globular with enzymatic activity Biased protein set 21 st century Many “hypothetical” proteins Various, often with no enzymatic activity Natural protein set Credit: Dr. M. Galperin, NCBI

Properties of the natural protein set Unexpected diversity of even common enzymes (analogous, paralogous enzymes) Conservation of the reaction chemistry, but not the substrate specificity Functional diversity in closely related proteins Abundance of new structures Credit: Dr. M. Galperin, NCBI

E. coli M. jannaschii S. cerevisiae H. sapiens Characterized experimentally Characterized by similarity Unknown, conserved Unknown, no similarity from Koonin and Galperin, 2003, with modifications

Protein Sequence Function From new genomes Automatic assignment based on sequence similarity: gene name, protein name, function To avoid mistakes, need human intervention (manual annotation) Functional annotation of proteins (protein sequence databases) Best annotated protein database: UniProt (unites Swiss-Prot, TrEMBL, and PIR)

Experimentally characterized –Up-to-date information, manually annotated (curated database!) Problems: avoid misinterpreting experimental results (e.g. suppressors, cofactors) “Knowns” = Characterized by similarity (closely related to experimentally characterized) – Make sure the assignment is plausible Function can be predicted – Extract maximum possible information – Avoid errors and overpredictions – Fill the gaps in metabolic pathways “Unknowns” (conserved or unique) – Rank by importance Objectives of functional analysis for different groups of proteins

Problems in functional assignments for “knowns” Previous low quality annotations lead to propagation of mistakes Biologically senseless annotations Arabidopsis: separation anxiety protein-like Helicobacter: brute force protein Methanococcus: centromere-binding protein Plasmodium: frameshift Propagated mistakes of sequence comparison

Problems in functional assignments for “knowns” Multi-domain organization of proteins ACT domain Chorismate mutase domainACT domain New sequence Chorismate mutase BLAST In BLAST output, top hits are to chorismate mutases -> The name “chorismate mutase” is automatically assigned to new sequence. ERROR ! Can be propagated, protein gets erroneous EC number, assigned to erroneous pathway, etc

Enzyme evolution: -Divergence in sequence and function -Non-orthologous gene displacement: Convergent evolution Low sequence complexity (coiled-coil, non-globular regions) Problems in functional assignments for “knowns”

Experimentally characterized “Knowns” = Characterized by similarity (closely related to experimentally characterized) – Make sure the assignment is plausible Function can be predicted (non-obvious predictions) – Extract maximum possible information – Avoid errors and overpredictions – Fill the gaps in metabolic pathways “Unknowns” (conserved or unique) – Rank by importance Objectives of functional analysis for different groups of proteins

Dealing with “hypothetical” proteins Computational analysis –Sequence and genome context analysis Structural analysis –Determination of the 3D structure Mutational analysis Functional analysis –Expression profiling –Tracking of cellular localization Possible Functional Prediction Experimental

Cluster analysis of protein families (family databases) Use of sophisticated database searches (PSI-BLAST, HMM) Detailed manual analysis of sequence similarities Functional prediction: sequence analysis in non-obvious cases

Proteins (domains) with different 3D folds are not homologous (unrelated by origin) Those amino acids that are conserved in divergent proteins within a (super)family are likely to be functionally important (catalytic or binding sites, ect). Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition Using sequence analysis: Hints

Prediction of 3D fold (if distant homologs have known structures!) and of general biochemical function is much easier than prediction of exact biological function Sequence analysis complements structural comparisons and can greatly benefit from them Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise Using sequence analysis: Hints Credit: Dr. M. Galperin, NCBI

Sequence analysis: often, only general function can be predicted Enzyme activity can be predicted, the substrate remains unknown (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases) Helix-turn-helix motif proteins (predicted transcriptional regulators) Membrane transporters

Phylogenetic distribution (comparative genomics) – Wide - most likely essential – Narrow - probably clade-specific – Patchy - most intriguing –(e.g., niche-specific, pathway type Domain association – “Rosetta Stone” (for multidomain proteins) Gene neighborhood (operon organization) Functional prediction: computational analysis beyond sequence Clues: specific to niche, pathway type

Using genome context for functional prediction Leucine biosynthesis

Functional Prediction: Role of Structural Genomics Protein Structure Initiative: Determine 3D Structures of All Proteins –Family Classification: Organize Protein Sequences into Families, collect families without known structures –Target Selection: Select Family Representatives as Targets –Structure Determination: X-Ray Crystallography or NMR Spectroscopy –Homology Modeling: Build Models for Other Proteins by Homology –Functional prediction based on structure

Structural Genomics: Structure-Based Functional Assignments Methanococcus jannaschii MJ0577 (Hypothetical Protein) Contains bound ATP => ATPase or ATP-Mediated Molecular Switch Confirmed by biochemical experiments

Crystal structure is not a function! Credit: Dr. M. Galperin, NCBI

Functional prediction: problem areas Identification of protein-coding regions Delineation of potential function(s) for distant paralogs Identification of domains in the absence of close homologs Analysis of proteins with low sequence complexity

Experimentally characterized –Up-to-date information, manually annotated “Knowns” = Characterized by similarity (closely related to experimentally characterized) – Make sure the assignment is plausible Function can be predicted – Extract maximum possible information – Avoid errors and overpredictions – Fill the gaps in metabolic pathways “Unknowns” (conserved or unique) – Rank by importance (e.g., using phylogenetic deistribution) Objectives of functional analysis for different groups of proteins

Functional annotation and protein classification Protein families are real and reflect evolutionary relationships Function often follows along the family lines Therefore, matching a new protein sequence to well-annotated and curated family provides information about this new protein and helps predicting its function. This is more accurate than comparing the new sequence to individual proteins in a database (search classification database vs search protein database) For large-scale accurate anotation, need “natural” protein classification Functional annotation needs to be both accurate and large-scale (efficient) Can protein classification help?

Protein Evolution Tree of Life & Evolution of Protein Families (Dayhoff, 1978) Can build a tree representing evolution of a protein family, based on sequences Orthologous Gene Family: Organismal and Sequence Trees Match Well

Protein Evolution Homolog –Common Ancestors –Common 3D Structure –Usually at least some sequence similarity (sequence motifs or more close similarity) Ortholog –Derived from Speciation Paralog –Derived from Duplication A ancestor Ax Az paralogs Ax 1 Az 1 Ax 2 Az 2 Species 1 Species 2

Levels of protein classification LevelExampleSimilarityEvolution FoldTIM-BarrelTopology of folded backbone Possible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Origin traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageEvolution by recent duplication and loss

Protein Family vs Domain Domain: Evolutionary/Functional/Structural Unit A protein can consist of a single domain or multiple domains. Proteins have modular structure. Recent domain shuffling: CM (AroQ type) SF SF SF SF SF PDH CM (AroQ type) PDT ACT PDH ACT PDT ACT SF001424

Protein Evolution: Sequence Change vs. Domain Shuffling If enough similarity remains, one can trace the path to the common origin What about these?

Practical classification of proteins: setting realistic goals We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR … Credit: Dr. Y. Wolf, NCBI

Complementary approaches Classify domains Allows to build a hierarchy and trace evolution all the way to the deepest possible level, the last point of traceable homology and common origin Can usually annotate only generic biochemical function Classify whole proteins Does not allow to build a hierarchy deep along the evolutionary tree because of domain shuffling Can usually annotate specific biological function (value for the user and for the automatic individual protein annotation)  Can map domains onto proteins  Can classify proteins when some of the domains are not defined

Levels of protein classification LevelExampleSimilarityEvolution FoldTIM-BarrelTopology of folded backbone Possible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Origin traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageEvolution by recent duplication and loss

Whole protein classification PIRSF Domain classification Pfam SMART CDD Mixed TIGRFAMS COGs Based on structural fold SCOP Protein classification databases

CM PDT ACT Protein family – domain – site (motif) InterPro is an integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, ProDom, TIGRFAMs, PIRSF, PANTHER SF Bifunctional chorismate mutase/prephenate dehydratase InterPro Entry Type defines the entry as a Family, Domain, Repeat, or Site Family = protein family. “Contains” field lists domains within this protein “Found in” field: for domain entries, lists families which contain this domain

Whole protein functional annotation is best done using annotated whole protein families [NiFe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme PIRSF Acylphosphatase – Znf x2 – YrdC - On the basis of domain composition alone, can not predict biological function related to Peptidase M22

PIRSF protein classification system  Basic concept: A network classification system based on evolutionary relationship of whole proteins A network classification system based on evolutionary relationship of whole proteins From Superfamilies to Subfamilies, flexible number of levels  Basic unit = PIRSF Family Homeomorphic (end-to-end similarity with common domain architecture) Homeomorphic (end-to-end similarity with common domain architecture) Homologous (Common Ancestry): Inferred by sequence similarity  Domains and motifs are mapped onto PIRSF  Advantages Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

Levels of protein classification LevelExampleSimilarityEvolution FoldTIM-BarrelTopology of folded backbone Possible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Origin traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageEvolution by recent duplication and loss

PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

PIRSF DAG Viewer (coming soon) PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names Network structure for PIRSF family classification system

PIRSF curation and annotation Curated family name Description of family Sequence analysis tools Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF Phylogenetic tree and alignment view allows further sequence analysis Uncurated ( Computationally generated clusters) Preliminary curation (preliminary membership, domains) Full curation: membership, family name, bibliography, description (opt.) …Scroll for other database cross-refs, etc

Protein classification systems can be used to: Provide accurate large-scale automatic annotation for new sequences Detect and correct genome annotation errors systematically Drive other annotations (active site etc) Improve sensitivity of protein identification, simplify detection of non-obvious relationships Provide basis for evolutionary and comparative research Discovery of new knowledge by using information embedded within families of homologous sequences and their structures

PIRSF-Based Protein Annotation in UniProt Rule-Based annotation system using curated PIRSFs Name Rules (PIRNR): transfer name from PIRSF to individual proteins –Define a subgroup if necessary –Protein Name (may differ from family name), synonyms, acronyms –EC, Misnomers, GO Terms, Function, etc Site Rules (PIRSR): Position-Specific Site Features (active sites, binding sites, modified sites, other functional sites)

Systematic correction of annotation errors: Chorismate mutase Chorismate Mutase (CM), AroQ class –PIRSF – CM (Prokaryotic type) [PF01817] –PIRSF – TyrA bifunctional enzyme (Prok) [PF01817-PF02153] –PIRSF – PheA bifunctional enzyme (Prok) [PF01817-PF00800] –PIRSF – CM (Eukaryotic type) [Regulatory Dom-PF01817] Chorismate Mutase, AroH class –PIRSF – CM [PF01817] AroH AroQ Euk AroQ Prok

Systematic correction of annotation errors: IMPDH IMPDH Misnomer in Methanococcus jannaschii IMPDH Misnomers in Archaeoglobus fulgidus

Impact of protein bioinformatics and genomics Single protein level –Discovery of new enzymes and superfamilies –Prediction of active sites and 3D structures Pathway level –Identification of “missing” enzymes –Prediction of alternative enzyme forms –Identification of potential drug targets Cellular metabolism level –Multisubunit protein systems –Membrane energy transducers –Cellular signaling systems