Presentation on theme: "PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification."— Presentation transcript:
PROTEOMICS 3D Structure Prediction
Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
PrimarySecondaryTertiaryQuaternary Amino acid sequence Alpha helices & Beta sheets, loops. Packing of secondary elements. Packing of several polypeptide chains Protein Structures:
How Does a Protein Fold The classical nucleation-propagation model: –the first event (fast) is hydrophobic collapse accompanied by the formation of secondary structures. In this step domains are formed. –the second step (slow) is the precise ordering of the secondary elements: packing of hydrophobic core, domain arrangement, etc. The 3D structure is assumed to be the most stable structure - minimal free energy. –Local minimum or global minimum?
Prions Proteins found in mammals. Responsible for the mad cow disease. There is no difference in the sequence of a normal prion and an abnormal prion. The difference lies in the 3D structure. Disease is assumed to be propagated by the insertion of an abnormal prion, that is capable of changing the configuration of a normal prion to an abnormal prion. Conclusion: there are several stable configurations for a single protein.
PDB - Protein Data Base Contains proteins whose structure has been solved. Number of solved proteins: 19,225. Ratio of solved structures / proteins: 1/7 (SwissProt) - 1/40 (TrEMBL) The entry for each protein consists of the x,y,z coordinates of every atom. Tutorial
Prion Protein Domain from Mouse – Entry 1AG2: Ribbons Vs. Cylinders
Broad View of the protein world I Estimation: ~ ,000 protein families composed of members that share detectable sequence similarity. –A new sequence is expected to be similar to other sequences in the data base, and can be expected to share structural features with these proteins. Structure prediction: –>50% sequence identity imply similar structure. –>30% sequence identity imply common structural elements
Broad View of the protein world II There is a limited number of different 3D structures. –Comparing newly generated structures with previously found structures, the new structure often fold into alpha & beta elements in the same order and in the same spatial configuration as already known structures. Often there is no sequence similarity. Totally different sequences can fold into similar structures.
Three Main Approaches for Structural Prediction: Ab-Initio. Comparative Modeling. Fold Recognition. Example: A pathway for folding a 2-domain protein.
The Ab-Initio Method The Structural Prediction Problem: “Given a protein sequence, compute it’s structure”. Computation is based on energy calculation stemming from the position of each atom in space and its physical-chemical relations with other atoms. Theoretically possible. Astronomical, highly under-constrained search space. Biophysics complex and incomplete. Practically, next to impossible.
Comparative (Homology) Modeling Evolutionary related proteins (homologous) usually have similar structure. The similarity of structures is very high in core regions (helices & sheets). However, loops may vary even in pairs of homologous structures with high degree of sequence similarity. Thick backbone - known structure. Thin lines - modeled structure. Some side-chains are not positioned correctly, but some look good.
Structure similarity predicted from sequence similarity: Sander & Schneider (1991) aligned all the sequences in PDB. Developed a formula for structure similarity based on sequence similarity. Structure similarity depends on the length of the protein. Modeling Performance
Modeling Performance - Examples A protein of 10 amino acids requires 80% identity for a similar structure. A protein of length > 80 requires ~30% identity for common sub-structures. ~50% identity for a similar structure. ~80% identity for a similar structure in a very good resolution.
Fold Recognition Approaches Fold - a combination of secondary structural units in the same configuration. Protein structural classification uses fold as a basic level of classification.
Fold Family Relations Estimation 1: There are ,000 protein families, based on homology. Each family contains ~ one fold. Estimation 2: There are protein folds. Conclusions: 1. Many protein families share the same fold. 2. Different sequences are folded similarly. The common fold approach to structure prediction: Use the collection of determined structures to predict the structure of a protein.
How Condensed is a Fold? How many different sequences can result in the same fold for an average domain of 150 amino acids? –There are ~ different sequences –about are less than 20% identical. –Assume that only 1 in a million has a stable fold –Expected number of different folds is –About different sequences fold similarly.
Fold Recognition A fold is shared by family members, both close and distant (distance is related to sequence similarity) –the globin fold For a query protein - if its family members are identified, and their fold is known, we could assign it the same fold. Method 1: Which alignment algorithm detects close and distant relatives? PSI-BLAST
Fold Recognition - Threading Threading allows for identification of structure similarity without sequence similarity. The amino acid (aa) sequence of a query protein is examined for compatibility with the structural core of a known protein. “Given a protein structure, what sequences fold into it ?”
Threading The protein core is a very compact environment composed of alpha and beta secondary structures. Very hydrophobic, no place for water molecules, other aa, or aa with chemically different side chains. Side chains have many contacts with neighboring aa for stability. Threading matches the aa of the query with aa of a known structure: –If threading gives a good score, then the core of the query is assumed to fold similarly.
Threading Two main methods: –Contact potential method. –Structural profile (Environmental template). Contact potential method –the number of contact points and proximity between aa is analyzed for every known structure. –The query is checked against all the interactions in the core and their contribution to the stability of the structure. –The fold that results in the most energetically stable structure is chosen.
Threading - Structural Profile The environment of every aa in known structures is determined, including –the secondary structure, the area of the side-chain that is buried by closeness to other atoms, types of nearby chains, etc. Each position is classified into one of 18 types –6 representing increasing levels of residue burial and fraction of surface covered by polar atoms –combined with three classes of secondary structures. Each aa is assessed for its ability to fit into that type of site in the structure. –Buried group is matched well with hydrophobic aa.
Structural Profile Profile rows are the residues in the structure according to the 18 different types. Profile columns are the 20 aa + insertion + deletion. –If residue in inside loop - many substitutions are allowed, as well as insertions and deletions. The score for a given aa in a residue estimates the fitness of the aa to the residue type. How shall we find the best fitting region?
Structural Profile Dynamic programming algorithm finds the best match of a query sequence to a specific fold. –Statistical significance can be computed by doing the above for all sequences in the database. The same analysis will be repeated for each fold. The fold with the best statistically significant score is chosen.
Threading - Pros and Cons: Good results. Environmental properties may be more accurate then amino acid similarity matrices. Can lead to effective and fast implementations. Able to discover structural similarities impossible to detect by sequence searching methods. Requires the existence of already known proteins with similar structure.
CASP - Critical Assessment of Structure Prediction Competition among different groups for resolving the 3D structure of proteins that are about to be solved experimentally. Current state - only fragments are “solved”: ab-inito - the worst, but greatly improved in the last years. Modeling - performs very well when homologous sequences with known structures exist. Fold recognition - PSI-BLAST is used for training the threading procedures. Performs well.
A Clickable Structure Prediction Flowchart :
Protein Classification Proteins are classified to reflect both structural and evolutionary relatedness. The principal levels are: 1.Family: Clear evolutionary relationship. In general, > 30% pairwise residue identity between the proteins. 2.Superfamily: Probable common evolutionary origin. Combines families whose member proteins have low sequence identities, but whose structural and functional features suggest a common evolutionary origin. Structurally, superfamily members share a common fold.
SCOP - Structural Classification of Proteins Hierarchical classification of all proteins with known structures. Classification: Class - all alpha, all beta, alpha & beta (a/b), alpha + beta (a + b). Superfamily. Family. Fold - the major structural similarity unit. PDB entry for a protein.
Another protein structure classification database. Classification: Class - all alpha, all beta, alpha & beta (a/b), alpha + beta (a + b). Architecture - gross orientation of secondary structures, independent of connectivity. Topology - clusters structures according to their topological connections and numbers of secondary structures. Homologous superfamilies - clusters proteins with highly similar structures and functions. CATH- Class Architecture Topology Homologous Superfamily
PFAM - Protein Families Database that contains large collection of multiple sequence alignments and profile hidden Markov Models (profile HMMs). Profile HMM is a probabilistic model which describes a set of sequences. Widely used to describe related sequences. Defines domains - areas of homology that have a 3D structure independent of the rest of the protein.
Classification of all the proteins in the SWISSPROT and TrEMBL databases, into groups of related proteins.