PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Pfam(Protein families )
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein Tertiary Structure Prediction
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Profile-profile alignment using hidden Markov models Wing Wong.
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.
1 Protein Structure Prediction Charles Yan. 2 Different Levels of Protein Structures The primary structure is the sequence of residues in the polypeptide.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Structure Prediction Dr. G.P.S. Raghava Protein Sequence + Structure.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Macromolecular structure
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Protein Structure Prediction. Historical Perspective Protein Folding: From the Levinthal Paradox to Structure Prediction, Barry Honig, 1999 A personal.
Representations of Molecular Structure: Bonds Only.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Proteins Structure Predictions Structural Bioinformatics.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Chapter 14 Protein Structure Classification
Protein Structure Prediction
Protein Structures.
Protein structure prediction.
Protein structure prediction
Presentation transcript:

PROTEOMICS 3D Structure Prediction

Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.

PrimarySecondaryTertiaryQuaternary Amino acid sequence Alpha helices & Beta sheets, loops. Packing of secondary elements. Packing of several polypeptide chains Protein Structures:

How Does a Protein Fold The classical nucleation-propagation model: –the first event (fast) is hydrophobic collapse accompanied by the formation of secondary structures. In this step domains are formed. –the second step (slow) is the precise ordering of the secondary elements: packing of hydrophobic core, domain arrangement, etc. The 3D structure is assumed to be the most stable structure - minimal free energy. –Local minimum or global minimum?

Prions Proteins found in mammals. Responsible for the mad cow disease. There is no difference in the sequence of a normal prion and an abnormal prion. The difference lies in the 3D structure. Disease is assumed to be propagated by the insertion of an abnormal prion, that is capable of changing the configuration of a normal prion to an abnormal prion. Conclusion: there are several stable configurations for a single protein.

PDB - Protein Data Base Contains proteins whose structure has been solved. Number of solved proteins: 19,225. Ratio of solved structures / proteins: 1/7 (SwissProt) - 1/40 (TrEMBL) The entry for each protein consists of the x,y,z coordinates of every atom. Tutorial

Prion Protein Domain from Mouse – Entry 1AG2: Ribbons Vs. Cylinders

Broad View of the protein world I Estimation: ~ ,000 protein families composed of members that share detectable sequence similarity. –A new sequence is expected to be similar to other sequences in the data base, and can be expected to share structural features with these proteins. Structure prediction: –>50% sequence identity imply similar structure. –>30% sequence identity imply common structural elements

Broad View of the protein world II There is a limited number of different 3D structures. –Comparing newly generated structures with previously found structures, the new structure often fold into alpha & beta elements in the same order and in the same spatial configuration as already known structures. Often there is no sequence similarity. Totally different sequences can fold into similar structures.

Three Main Approaches for Structural Prediction: Ab-Initio. Comparative Modeling. Fold Recognition. Example: A pathway for folding a 2-domain protein.

The Ab-Initio Method The Structural Prediction Problem: “Given a protein sequence, compute it’s structure”. Computation is based on energy calculation stemming from the position of each atom in space and its physical-chemical relations with other atoms. Theoretically possible. Astronomical, highly under-constrained search space. Biophysics complex and incomplete. Practically, next to impossible.

Comparative (Homology) Modeling Evolutionary related proteins (homologous) usually have similar structure. The similarity of structures is very high in core regions (helices & sheets). However, loops may vary even in pairs of homologous structures with high degree of sequence similarity. Thick backbone - known structure. Thin lines - modeled structure. Some side-chains are not positioned correctly, but some look good.

Structure similarity predicted from sequence similarity: Sander & Schneider (1991) aligned all the sequences in PDB. Developed a formula for structure similarity based on sequence similarity. Structure similarity depends on the length of the protein. Modeling Performance

Modeling Performance - Examples A protein of 10 amino acids requires 80% identity for a similar structure. A protein of length > 80 requires ~30% identity for common sub-structures. ~50% identity for a similar structure. ~80% identity for a similar structure in a very good resolution.

Fold Recognition Approaches Fold - a combination of secondary structural units in the same configuration. Protein structural classification uses fold as a basic level of classification.

Fold Family Relations Estimation 1: There are ,000 protein families, based on homology. Each family contains ~ one fold. Estimation 2: There are protein folds. Conclusions: 1. Many protein families share the same fold. 2. Different sequences are folded similarly. The common fold approach to structure prediction: Use the collection of determined structures to predict the structure of a protein.

How Condensed is a Fold? How many different sequences can result in the same fold for an average domain of 150 amino acids? –There are ~ different sequences –about are less than 20% identical. –Assume that only 1 in a million has a stable fold –Expected number of different folds is –About different sequences fold similarly.

Fold Recognition A fold is shared by family members, both close and distant (distance is related to sequence similarity) –the globin fold For a query protein - if its family members are identified, and their fold is known, we could assign it the same fold. Method 1: Which alignment algorithm detects close and distant relatives? PSI-BLAST

Fold Recognition - Threading Threading allows for identification of structure similarity without sequence similarity. The amino acid (aa) sequence of a query protein is examined for compatibility with the structural core of a known protein. “Given a protein structure, what sequences fold into it ?”

Threading The protein core is a very compact environment composed of alpha and beta secondary structures. Very hydrophobic, no place for water molecules, other aa, or aa with chemically different side chains. Side chains have many contacts with neighboring aa for stability. Threading matches the aa of the query with aa of a known structure: –If threading gives a good score, then the core of the query is assumed to fold similarly.

Threading Two main methods: –Contact potential method. –Structural profile (Environmental template). Contact potential method –the number of contact points and proximity between aa is analyzed for every known structure. –The query is checked against all the interactions in the core and their contribution to the stability of the structure. –The fold that results in the most energetically stable structure is chosen.

Threading - Structural Profile The environment of every aa in known structures is determined, including –the secondary structure, the area of the side-chain that is buried by closeness to other atoms, types of nearby chains, etc. Each position is classified into one of 18 types –6 representing increasing levels of residue burial and fraction of surface covered by polar atoms –combined with three classes of secondary structures. Each aa is assessed for its ability to fit into that type of site in the structure. –Buried group is matched well with hydrophobic aa.

Structural Profile Profile rows are the residues in the structure according to the 18 different types. Profile columns are the 20 aa + insertion + deletion. –If residue in inside loop - many substitutions are allowed, as well as insertions and deletions. The score for a given aa in a residue estimates the fitness of the aa to the residue type. How shall we find the best fitting region?

Structural Profile Dynamic programming algorithm finds the best match of a query sequence to a specific fold. –Statistical significance can be computed by doing the above for all sequences in the database. The same analysis will be repeated for each fold. The fold with the best statistically significant score is chosen.

Threading - Pros and Cons: Good results. Environmental properties may be more accurate then amino acid similarity matrices. Can lead to effective and fast implementations. Able to discover structural similarities impossible to detect by sequence searching methods. Requires the existence of already known proteins with similar structure.

CASP - Critical Assessment of Structure Prediction Competition among different groups for resolving the 3D structure of proteins that are about to be solved experimentally. Current state - only fragments are “solved”: ab-inito - the worst, but greatly improved in the last years. Modeling - performs very well when homologous sequences with known structures exist. Fold recognition - PSI-BLAST is used for training the threading procedures. Performs well.

A Clickable Structure Prediction Flowchart :

Protein Classification Proteins are classified to reflect both structural and evolutionary relatedness. The principal levels are: 1.Family: Clear evolutionary relationship. In general, > 30% pairwise residue identity between the proteins. 2.Superfamily: Probable common evolutionary origin. Combines families whose member proteins have low sequence identities, but whose structural and functional features suggest a common evolutionary origin. Structurally, superfamily members share a common fold.

SCOP - Structural Classification of Proteins Hierarchical classification of all proteins with known structures. Classification: Class - all alpha, all beta, alpha & beta (a/b), alpha + beta (a + b). Superfamily. Family. Fold - the major structural similarity unit. PDB entry for a protein.

Another protein structure classification database. Classification: Class - all alpha, all beta, alpha & beta (a/b), alpha + beta (a + b). Architecture - gross orientation of secondary structures, independent of connectivity. Topology - clusters structures according to their topological connections and numbers of secondary structures. Homologous superfamilies - clusters proteins with highly similar structures and functions. CATH- Class Architecture Topology Homologous Superfamily

PFAM - Protein Families Database that contains large collection of multiple sequence alignments and profile hidden Markov Models (profile HMMs). Profile HMM is a probabilistic model which describes a set of sequences. Widely used to describe related sequences. Defines domains - areas of homology that have a 3D structure independent of the rest of the protein.

Classification of all the proteins in the SWISSPROT and TrEMBL databases, into groups of related proteins.