Presentation is loading. Please wait.

Presentation is loading. Please wait.

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:

Similar presentations


Presentation on theme: "FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:"— Presentation transcript:

1 FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
ANNOTATION AND FAMILY CLASSIFICATION Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

2 Overview Problem: Functional Analysis of Protein Sequences:
Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Problem: Functional Analysis of Protein Sequences: Homology-based (sequence analysis, structure analysis) Non-homology (genome context, phylogenetic distribution) Solution for Large-scale Annotation: Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families PIRSF Protein Classification System Full-length protein family classification based on evolution Highly annotated, optimized for annotation propagation Functional predictions for uncharacterized proteins Used to facilitate and standardize annotations in UniProt

3 Proteomics and Bioinformatics
Data: Gene expression profiling Genome-wide analysis of gene expression Data: Protein-protein interaction Data: Structural genomics 3D structures of all protein families Data: Genome projects (Sequencing) …. Bioinformatics Computational analysis and integration of these data Making predictions (function etc), reconstructing pathways

4 What’s In It For Me? Sequence function
When an experiment yields a sequence (or a set of sequences), we need to find out as much as we can about this protein and its possible function from available data Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins More challenging for large sets of sequences generated by large-scale proteomics experiments The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments Sequence function

5 Family Classification
Work with Protein, not DNA Sequence DNA Sequence Gene Protein Sequence Function Genomic DNA Sequence 5' UTR Promoter Exon1 Intron Exon2 Exon3 3' UTR A G T Gene Recognition C Protein Sequence Structure Determination Protein Structure Function Analysis Gene Network Metabolic Pathway Protein Family Molecular Evolution Family Classification Gene

6 The Changing Face of Protein Science
20th century Few well-studied proteins Mostly globular with enzymatic activity Biased protein set 21st century Many “hypothetical” proteins (Most new proteins come from genome sequencing projects, many have unknown functions) Various, often with no enzymatic activity Natural protein set Credit: Dr. M. Galperin, NCBI

7 Knowing the Complete Genome Sequence
Advantages: All encoded proteins can be predicted and identified The missing functions can be identified and analyzed Peculiarities and novelties in each organism can be studied Predictions can be made and verified Challenge: Accurate assignment of known or predicted functions (functional annotation)

8 E. coli M. jannaschii S. cerevisiae H. sapiens
Characterized experimentally Characterized by similarity Unknown, conserved Unknown, no similarity from Koonin and Galperin, 2003, with modifications

9 Functional Annotation for Different Groups of Proteins
Experimentally characterized Find up-to-date information, accurate interpretation Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and overpredictions Most value-added (fill the gaps in metabolic pathways, etc) “Unknowns” (conserved or unique) Rank by importance

10 How are Protein Sequences Annotated?
“regular approach” Protein Sequence Function Automatic assignment based on sequence similarity (best BLAST hit): gene name, protein name, function Large-scale functional annotation of sequences based simply on BLAST best hit has pitfalls; results are far from perfect To avoid mistakes, need human intervention (manual annotation) Quality vs Quantity

11 Functional Annotation for Different Groups of Proteins
Experimentally characterized Find up-to-date information, accurate interpretation Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and overpredictions Most value-added (fill the gaps in metabolic pathways, etc) “Unknowns” (conserved or unique) Rank by importance

12 Problems in Functional Assignments for “Knowns”
Misinterpreted experimental results (e.g. suppressors, cofactors) Biologically senseless annotations Arabidopsis: separation anxiety protein-like Helicobacter: brute force protein Methanococcus: centromere-binding protein Plasmodium: frameshift “Goofy” mistakes of sequence comparison (e.g. abc1/ABC) Multi-domain organization of proteins Low sequence complexity (coiled-coil, transmembrane, non-globular regions) Enzyme evolution: Divergence in sequence and function (minor mutation in active site) Non-orthologous gene displacement: Convergent evolution

13 Problems in Functional Assignments for “Knowns”: multi-domain organization of proteins
New sequence ACT domain BLAST Chorismate mutase Chorismate mutase domain ACT domain In BLAST output, top hits are to chorismate mutases -> The name “chorismate mutase” is automatically assigned to new sequence. ERROR ! (protein gets erroneous name, EC number, assigned to erroneous pathway, etc)

14 Problems in Functional Assignments for “Knowns”
Previous low quality annotations lead to propagation of mistakes

15 Functional Annotation for Different Groups of Proteins
Experimentally characterized Find up-to-date information, accurate interpretation Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and overpredictions Most value-added (fill the gaps in metabolic pathways, etc) “Unknowns” (conserved or unique) Rank by importance

16 Functional Prediction: I
Functional Prediction: I. Sequence and Structure Analysis (homology-based methods) in non-obvious cases: Sophisticated database searches (PSI-BLAST, HMM) Detailed manual analysis of sequence similarities Structure-guided alignments and structure analysis Often, only general function can be predicted: Enzyme activity can be predicted, the substrate remains unknown (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases) Helix-turn-helix motif proteins (predicted transcriptional regulators) Membrane transporters

17 Using Sequence Analysis:
Hints Proteins (domains) with different 3D folds are not homologous (unrelated by origin). Proteins with similar 3D folds are usually (but not always) homologous Those amino acids that are conserved in divergent proteins within a (super)family are likely to be functionally important (catalytic or binding sites, ect). Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition

18 Using Sequence Analysis:
Hints Prediction of 3D fold (if distant homologs have known structures!) and of general biochemical function is much easier than prediction of exact biological function Sequence analysis complements structural comparisons and can greatly benefit from them Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise Credit: Dr. M. Galperin, NCBI

19 Structural Genomics: Structure-Based Functional Predictions
Protein Structure Initiative: Determine 3D structures of all protein families Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome. Methanococcus jannaschii MJ0577 (Hypothetical Protein) Contains bound ATP => ATPase or ATP-Mediated Molecular Switch Confirmed by biochemical experiments

20 Crystal Structure is Not a Function!
Credit: Dr. M. Galperin, NCBI

21 Functional Prediction: II. Computational Analysis Beyond Homology
Phylogenetic distribution (comparative genomics) Wide most likely essential Narrow - probably clade-specific Patchy - most intriguing Domain association – “Rosetta Stone” Genome context (gene neighborhood, operon organization) Clues: specific to niche, pathway type

22 Using Genome Context for Functional Prediction
SEED analysis tool (by FIG) Embden-Meyerhof and Gluconeogenesis pathway: 6-phosphofructokinase (EC )

23 Functional Prediction: Problem Areas
Identification of protein-coding regions Delineation of potential function(s) for distant paralogs Identification of domains in the absence of close homologs Analysis of proteins with low sequence complexity

24 What to do with a new protein sequence
Basic: - Domain analysis (SMART = most sensitive; PFAM= most complete, CDD) BLAST Curated protein family databases (PIRSF, InterPro, COGs) Literature (PubMed) from links from individual entries on BLAST output (look for SwissProt entries first) If not sufficient: PSI-BLAST Refined PubMed search using gene/protein names, synonyms, function and other terms you found Genome neighborhood (prokaryotes) Advanced: Multiple sequence alignments (manual) Structure-guided alignments and structure analysis - Phylogenetic tree reconstruction

25 Prediction Verified: GGDEF domain
Case Study: Prediction Verified: GGDEF domain Proteins containing this domain: Caulobacter crescentus PleD controls swarmer cell - stalk cell transition (Hecht and Newton, 1995). In Rhizobium leguminosarum, Acetobacter xylinum, required for cellulose biosynthesis (regulation) Predicted to be involved in signal transduction because it is found in fusions with other signaling domains (receiver, etc) In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide regulator of cellulose synthase (signalling molecule). Multidomain protein with GGDEF domain was shown to have diguanylate cyclase activity (Tal et al., 1998) Detailed sequence analysis tentatively predicts GGDEF to be a diguanylate cyclase domain (Pei and Grishin, 2001) Complementation experiments prove diguanylate cyclase activity of GGDEF (Ausmees et al., 2001)

26 The Need for Classification
Problem: Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Manual annotation of individual proteins is not efficient Solution: Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families Good quality and large-scale Systematic correction of annotation errors Protein name standardization Functional predictions for uncharacterized proteins Facilitates: This all works only if the system is optimized for annotation

27 Levels of Protein Classification
Example Similarity Evolution Class / Structural elements No relationships Fold TIM-Barrel Topology of backbone Possible monophyly Domain Superfamily Aldolase Recognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Family Class I Aldolase High sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6-phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Traceable to a single gene in LCA Lineage-specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineage Recent duplication Domain classification is easiest to handle, but insufficient for annotation. Example on next slide.

28 With enough similarity, one can trace back to a common origin
Protein Evolution Domain: Evolutionary/Functional/Structural Unit Sequence changes Domain shuffling This would be easy if proteins all evolved in a gradual manner, but… With enough similarity, one can trace back to a common origin What about these?

29 Consequences of Domain Shuffling
PIRSF001501 PIRSF006786 CM = chorismate mutase PDH = prephenate dehydrogenase PDT = prephenate dehydratase ACT = regulatory domain CM (AroQ type) PDH CM? CM/PDH? PDH? CM (AroQ type) PDH PIRSF001499 PDH ACT PIRSF005547 PDT? PDT ACT PIRSF001424 CM/PDT? CM (AroQ type) PDT ACT PIRSF001500

30 Whole Protein = Sum of its Parts?
PIRSF006256 Peptidase M22 Acylphosphatase ZnF YrdC - On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyltransferase Despite the problem caused by domain shuffling, the previous slide actually showed a relatively easy case for annotation based on domain fusions. That is, the whole equals the sum of its parts. But this is not always the case. Full-length protein functional annotation is best done using annotated full-length protein families

31 Practical classification of proteins: setting realistic goals
We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR … Credit: Dr. Y. Wolf, NCBI

32 Complementary Approaches
Domain Classification Allows a hierarchy that can trace evolution to the deepest possible level, the last point of traceable homology and common origin Can usually annotate only general biochemical function Full-length protein Classification Cannot build a hierarchy deep along the evolutionary tree because of domain shuffling Can usually annotate specific biological function (preferred to annotate individual proteins) Can map domains onto proteins Can classify proteins even when domains are not defined

33 Levels of Protein Classification
Example Similarity Evolution Class / Structural elements No relationships Fold TIM-Barrel Topology of backbone Possible monophyly Domain Superfamily Aldolase Recognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Family Class I Aldolase High sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6-phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Traceable to a single gene in LCA Lineage-specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineage Recent duplication Domain classification is easiest to handle, but insufficient for annotation. Example on next slide.

34 Protein Classification Databases
Domain classification Pfam SMART CDD Full-length protein classification PIRSF Mixed TIGRFAMS COGs Based on structural fold SCOP InterPro: integrates various types of classification databases

35 InterPro Integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, ProDom, TIGRFAMs, PIRSF SF001500 Bifunctional chorismate mutase/ prephenate dehydratase CM PDT ACT

36 The Ideal System… Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology Allows for simultaneous use of the full-length protein and domain information (domains mapped onto proteins) Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families Expertly curated membership, family name, function, background, etc. Evidence attribution (experimental vs predicted)

37 PIRSF Classification System
PIRSF Classification System PIRSF: Reflects evolutionary relationships of full-length proteins A network structure from superfamilies to subfamilies Computer-assisted manual curation Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence conservation Network Structure: allows multiple parents allow more accurate propagation of annotation and development of standard protein nomenclature and ontology. Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

38 PIRSF Classification System
A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains. 6 Subfamilies (IGFBP-1 through 6): correspond to 6 types known in literature (Based on conserved intron/exon organization, sequence similarity, high binding affinity to IGFs, and function)

39 PIRSF Family Report: Curated Protein Family Information
Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF Alpha-crystallin is exclusively found in metazoans Phylogenetic tree and alignment view allows further sequence analysis

40 PIRSF Hierarchy and Network: DAG Viewer
Domain level Homeomorphic Family level Subfamily level

41 Alpha-Crystallin and Related Proteins
Alpha crystallin beta chain HSPs Alpha crystallin alpha chain

42 PIRSF Family Report: Curated Protein Family Information
Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF Phylogenetic tree and alignment view allows further sequence analysis

43 PIRSF Family Report (II)
Integrated value added information from other databases Mapping to other protein classification databases

44 PIRSF Protein Classification: Platform for Protein Analysis and Annotation
Matching a protein sequence to a curated protein family rather than searching against a protein database All information in one place Provides value-added information by expert curators, e.g., annotation of uncharacterized hypothetical proteins (functional predictions) Improves automatic annotation quality Serves as a protein analysis platform for broad range of users

45 Family-Driven Protein Annotation
Objective: Optimize for protein annotation PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous Hierarchy Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase)

46 Family-Driven Protein Annotation: Site Rules and Name Rules
Goal: Automatic annotation of sequences based on protein families to address the quality versus quantity problem Define conditions under which family properties propagate to individual proteins Propagate protein name, function, functional sites, EC, GO terms, pathway Enable further specificity based on taxonomy or motifs Account for functional variations within one PIRSF, including: - Lack of active site residues necessary for enzymatic activity - Certain activities relevant only to one part of the taxonomic tree - Evolutionarily-related proteins whose biochemical activities are known to differ

47 Overview Problem: Functional Analysis of Protein Sequences:
Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Problem: Functional Analysis of Protein Sequences: Homology-based (sequence analysis, structure analysis) Non-homology (genome context, phylogenetic distribution) Solution for Large-scale Annotation: Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families Automatic annotation of sequences based on protein families Systematic correction of annotation errors Name standardization in UniProt Functional predictions for uncharacterized proteins Facilitates:

48 Impact of Protein Bioinformatics and Genomics
Single protein level Discovery of new enzymes and superfamilies Prediction of active sites and 3D structures Pathway level Identification of “missing” enzymes Prediction of alternative enzyme forms Identification of potential drug targets Cellular metabolism level Multisubunit protein systems Membrane energy transducers Cellular signaling systems


Download ppt "FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:"

Similar presentations


Ads by Google