Protein Structure.

Protein Structure

What We Are Going to Cover
Secondary Structure Prediction Chou-Fasman rules for individual amino acids Nearest neighbor approaches Machine learning techniques Neural networks Physical properties vs. statistics Measuring prediction quality Tertiary structure prediction Force fields Threading Homology modeling

Background In the 1950s Christian Anfinsen showed that pancreatic RNase could refold itself into its active configuration after denaturation, without any external guidance. This, and many confirming experiments on other proteins, has lead to the general belief that the amino acid sequence of a protein contains all the information needed to fold itself properly, without any additional energy input. For most proteins this process is very rapid, on the order of milliseconds. The primary driving forces are the need for hydrophobic side chains to minimize their interactions with water, and the formation of hydrogen bonds. Chaperone proteins assist in the folding of some proteins, and in the re-folding of mis-folded or aggregated proteins. This is often a consequence of elevated temperatures, and chaperone proteins were first discovered because they were highly expressed during heat shock in Drosophila. Prions are normal cellular proteins (PrPC) that can fold into an abnormal configuration (PrPSC) that causes other normal PrPC proteins to refold into the bad configuration, causing large aggregates of tightly packed beta sheets to form. If refolding is spontaneous, is the bad form a lower energy configuration than the good form? PrPC (left) and PrPSC (right)

Secondary Structure Linus Pauling defined the two main protein secondary structures in the 1950s: alpha helix and beta sheet. There are other related structures: for instance, the alpha helix has hydrogen bonds between the backbone –NH group of one alpha-carbon to the backbone C=O group of the alpha-carbon 4 residues earlier in the chain (i+4->i). There are also the 310 helix (i+3->i) and the π helix (i+5->i). The Dictionary of Protein Secondary Structure (DSSP) defines 8 states; the 4 mentioned above plus several forms of turn, plus “everything else”, called random coil. Real proteins have these structures bent and distorted, not laid out in theoretical perfection. Most prediction programs classify every amino acid in a chain into just 3 states: alpha helix, beta sheet, or random coil. These predictions can then be compared to X-ray crystallography results.

Assessing Prediction Accuracy
The first problem is classification: given a solved X-ray structure of a protein, how do you classify its amino acids into alpha-helix, beta-sheet, etc.? Three separate programs, DSSP, STRIDE, and DEFINE, all use different definitions, and thus come up with somewhat different results, mostly at the ends of the structures. Hydrogen-bonding patterns, dihedral angles (the  and  angles of the alpha carbons in the backbone), and interatomic distances compared with ideal structures. A second issue: source of protein structures to use for training. You want them to be unrelated to each other and to the sequences you will use for testing, a structural “nr” data set. The Protein Data Base (PDB) has a set called PDB_SELECT that does this. About 4000 chains (late 2008), all less than 25% sequence identity with high quality X-ray crystallography resolution (< 3.5 Angstroms). CASP (Critical Assessment of Sequence Predictions) is a series of contests where the sequences of a group of proteins whose structures have recently been determined but not published are released to anyone who wants to try predicting their structures. After a few months, they have a meeting an score everyone’s results. Currently, the CASP8 contest has just ended.

More on Prediction Accuracy
Q3 is the proportion of amino acids in a test dataset whose predicted state (alpha-helix, beta-strand, or random coil) matches its actual state. Note that you can’t use sequences from the training dataset for this. Given equal distributions of the 3 states, you would expect random guessing to give a Q3 score of 33%. A test using random guesses with an actual dataset gave 38%. Also, given the variations in automatically assigning types to known structures, scores above 85% are unlikely. Sov is a measure of how individual segments match that tries to avoid variation in end-of-segment predictions. It measures the percentage of times that there is an overlap between observed segments and predicted segments. It works on the individual protein or fold level, as opposed to Q3, which measures performance over a whole database. Similar Q3 scores are not always meanignful. Sov Q3 OBS: CHHHHHHHHHHC PRED1: CHCHCHCHCHCC PRED2: CCCHHHHHCCCC

Chou-Fasman Rules Gives the propensity of each amino acid to form or break alpha helices and beta strands. Originally developed in the 1970’s from a very small set of proteins (15!). Originally just a qualitative measure: “helix forming”, indifferent”, “helix-breaking”, etc. It has been made quantitative and extended to 14 structures, which involved some fairly large changes in parameter values. This method is just for individual amino acids in isolation, ignoring the neighborhood.

Extensions to Chou-Fasman
The GOR method uses a sliding window of 17 amino acids (8 before and 8 after the residue being predicted) with rules based on known sequences to predict each amino acid’s state. Based on self-information (each individual residue’s propensity to each type of secondary structure: approximately what the Chou-Fasman rules are), directional-information (how each other amino acid in the window affects the current residue regardless of what the current residue is) , and pair-information (how each other amino acid in the window affects the current residue considering what type of amino acid it is). Several improved versions consider pairs and triplets of amino acids, and increase the number of sequences analyzed to produce the statistics. Directional-information: Effects of the alpha-helix breaker proline (top) and the non-breaker methionine (bottom) 5 residues downstream from residue j, whose type is not specified.

Use of Multiple Alignments
It seems obvious that homologous proteins will have slightly different sequences with approximately the same secondary structure. This allows a better understanding of which residues are important for the different structures. The predictions for each homologous region can be averaged Alpha helices and beta strands are less likely to tolerate insertions and deletions than random coil. The Zpred program is a modification of the GOR program that uses multiple alignment information at each position in the aligned sequences to improve prediction. Start by predicting each sequence separately as in GOR, then average the predictions. Zpred also encodes the amino acid properties Venn diagram we have seen before. The Zvelibil conservation number is 1.0 if all homologous residues at a given position are identical, 0.9 if they are not identical but all within the same set in the Venn diagram, and lesser values for amino acids in different groups. A value of 0 is given if any sequence has a gap. A modification of this, the nearest neighbor approach, finds the best set of matching sequences (homologous or not) for just the region of the sliding window. The idea is that similar sequences will share the same fold even if the rest of the protein is different.

Neural Networks Neural networks are a common bioinformatics tool for machine learning and optimization. Based on neurobiology observations, primarily the retina of the eye. Often used in structure prediction, among others. Machine learning: automatic procedures to fit model parameters to a training dataset. Layers of nodes, with each node connected to all the nodes in the next layer. The top layer is the input, a model of the sequence being analyzed. The bottom layer is output: generally 3 nodes, one for alpha-helix, one for beta-strand, and one for random coil. A model with just an input and an output layer is called a perceptron. Usually there are one or more (and sometimes lots more, like 80) hidden layers that allow interactions between the input layers. Some networks allow feedback between layers.

More Neural Network Inputs. The most common way to code neural network input is to have one node for each type of amino acid (and often an additional one for a gap), multiplied by a node for each position in the sliding window. Thus, for a 13 residue window, the net would have 21 x 13 = 273 input nodes. Also, a few extra inputs encoding things like sequence length and distance form the N- and C-termini. Each input node “fires” either a 0 or a 1 output. Signal processing. Each node sends out the same signal to all the nodes in the next layer. The receiving node weights all of its inputs differently. The receiving node also can add in a bias factor to affect how it will respond. The node then adds the weighted input signals plus the bias and runs them through a response function (or transfer function) to determine its output. Usually all nodes (at least at a given layer) have the same transfer function. The output can be 0/1 or some numerical value. Weights applied to the possible outputs from the input layer at a hidden node. Black is positive, red is negative for alpha helix. Some possible transfer functions

Still More Neural Network
Parameterization of a neural network is a big issue. Essentially it comes down to assigning weights for the inputs to each node and assigning bias factors. This assumes a constant transfer function. Most of the weighting comes from the input node array. It’s a matter of using a training set where you know the input sequence and the secondary structure at each residue. You start with randomly assigned parameters and adjust them using an optimization procedure until the outputs match the known results or until no further improvement happens. It is quite common to feed the output of one neural network into another network. This allows predictions for one residue to influence the prediction for neighboring residues. Best done with numerical outputs as opposed to 0/1. For example, the predictions for alpha helix, beta strand and random coil are better reported as (0.76, 0.55, 0.37) instead of (1, 0, 0). It is also useful for allowing homologous sequences to influence each other: first run the predictions for each sequence separately, then use the second neural net to combine them. The first net is a sequence-to-structure net, and the second one is a structure-to-structure net.

Tertiary Structure Are we going to have time to say anything meaningful about this subject? It is the cutting edge, the hardest unsolved bioinformatics problem It is the basis of rational drug design. Ab initio (from first principles) structure predictions: just using the sequence and know physical and chemical properties, predict the protein’s final 3-D structure. This is very difficult and not wildly successful. More commonly used are homology modeling and threading. In homology modeling, an unknown protein’s structure is modeled in comparison to the known structure of a homologous protein. For threading, no homolog is used, but instead the protein’s sequence is matched to a library of known folds and Plots of potential energy (y-axis) and root-mean-square deviation of atomic coordinates from their actual position for a large number of ab initio structure prediction runs. The red arrow points to the best prediction. The left one, E. coli RecA protein worked well, but the right one,human Fyn tyrosine kinase, did not.

Force Fields Based an the Anfinsen RNAase experiments and many others like it, it is thought that proteins fold into the lowest free energy state. The problem is, the energy landscape is like a mountain range: there are vast numbers of possible folding states with many local minima separated by high peaks. “Free energy” has both an entropy term and an enthalpy term. Enthalpy is the energy stored in covalent bonds and non-covalent atomic interactions. Entropy is hard to calculate, so all this work is done using enthalpy, or potential energy. Various forces affect protein folding: covalent bonds, hydrogen bonds, polar and ionic interactions, van der Waals forces, solvent interaction. These forces can be described by a set of equations, collectively known as the “force field”. Solvent interactions are troublesome because they involve large numbers of independently-moving water molecules. Sometimes these are dealt with by using a statistical distribution rather than trying to model individual solvent molecules. Covalent bonds are most easily described using internal coordinates: bond lengths and angles. The potential energy of the bonds is easily described in these terms. On the other hand, non-covalent forces are more easily described using external coordinates: (x,y,z) numbers. This is because the forces involved vary with the distance separating the molecules. Thus the final force field needs to combine the two coordinate systems.

Threading When there is no homolog that has a known structure, the protein sequence can be compared to a library of protein folds. This is based on the theory that there are a limited number of possible folds, perhaps around The 35,000 protein structures in PDB (when the book was written) can be described in terms of folds. One type of threading method uses position-specific scoring matrices to describe the folds. From PSI-BLAST The matrices are asymmetric, since you are comparing a known to an unknown, rather than comparing two unknowns. For example, the penalty for substituting Lys for Arg is different from a Arg to Lys substitution. This can also be done by using a set of environment-specific scoring matrices for the various positions, similar to BLOSUM matrices. The other common threading method is explicitly calculate the potential energy of the sequence when forced into the fold. Dynamic programming methods make it possible to ignore wildly improbable configurations. These structures are the SH3 domain of dihydrofolate reductase and a kinase They have only 14.5% identical amino acids.

Homology Modeling When your sequence has a homologue with a known structure, it is possible to build a model using the known structure as a template. Originally done with wires and plastic balls! The more similar your protein is to the known structure, the better it works. Sperm whale myoglobin model Dependence of modeling accuracy on sequence identity. Red ones are ab initio predictions.

More Homology Modeling
Start by finding homologues with structures in PDB, using a BLAST search. Then do a careful hand-refined alignment. Then fit the highly conserved regions Model insertions as loops. Non-identical amino acids are predicted with rotamer libraries. Side chains can generally only take on a few configurations, based on “exhaustive” searches of known structures After the preliminary model is built, refinements can be made to minimize energy and fit things together as best as possible. Rotamer libraries for tyrosine (left) and phenylalanine (right). Each has 2 main positions with variants.

Protein Structure.

Similar presentations

Presentation on theme: "Protein Structure."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Structure.

Similar presentations

Presentation on theme: "Protein Structure."— Presentation transcript:

Similar presentations

About project

Feedback