Computational Molecular Biology

Computational Molecular Biology
Protein Structure: Introduction and Prediction

Protein Folding One of the most important problem in molecular biology
Given the one-dimensional amino-acid sequence that specifies the protein, what is the protein’s fold in three dimensions? My T. Thai

Overview Understand protein structures Why study protein folding:
Primary, secondary, tertiary Why study protein folding: Structure can reveal functional information which we cannot find from the sequence Misfolding proteins can cause diseases: mad cow disease Use in drug designs My T. Thai

Overview of Protein Structure
Proteins make up about 50% of the mass of the average human Play a vital role in keeping our bodies functioning properly Biopolymers made up of amino acids The order of the amino acids in a protein and the properties of their side chains determine the three dimensional structure and function of the protein My T. Thai

Amino Acid C R C H N O OH Building blocks of proteins Consist of:
An amino group (-NH2) Carboxyl group (-COOH) Hydrogen (-H) A side chain group (-R) attached to the central α-carbon There are 20 amino acids Primary protein structure is a sequence of a chain of amino acids Side chain C R C H N O OH Amino group Carboxyl group My T. Thai

Side chains (Amino Acids)
20 amino acids have side chains that vary in structure, size, hydrogen bonding ability, and charge. R gives the amino acid its identity R can be simple as hydrogen (glycine) or more complex such as an aromatic ring (tryptophan) My T. Thai

Chemical Structure of Amino Acids

How Amino Acids Become Proteins
Peptide bonds My T. Thai

Polypeptide amide nitrogen carbonyl carbon
More than fifty amino acids in a chain are called a polypeptide. A protein is usually composed of 50 to 400+ amino acids. We call the units of a protein amino acid residues. amide nitrogen carbonyl carbon My T. Thai

Side chain properties Carbon does not make hydrogen bonds with water easily – hydrophobic. These ‘water fearing’ side chains tend to sequester themselves in the interior of the protein O and N are generally more likely than C to h-bond to water – hydrophilic Ten to turn outward to the exterior of the protein My T. Thai

My T. Thai

Primary Structure Primary structure: Linear String of Amino Acids
Side-chain Backbone ... ALA PHE LEU ILE LEU ARG ... Each amino acid within a protein is referred to as residues Each different protein has a unique sequence of amino acid residues, this is its primary structure My T. Thai

Secondary Structure Refers to the spatial arrangement of contiguous amino acid residues Regularly repeating local structures stabilized by hydrogen bonds A hydrogen atom attached to a relatively electronegative atom Examples of secondary structure are the α–helix and β–pleated-sheet My T. Thai

Alpha-Helix Amino acids adopt the form of a right handed spiral
The polypeptide backbone forms the inner part of the spiral The side chains project outward every backbone N-H group donates a hydrogen bond to the backbone C = O group My T. Thai

Beta-Pleated-Sheet Consists of long polypeptide chains called beta-strands, aligned adjacent to each other in parallel or anti-parallel orientation Hydrogen bonding between the strands keeps them together, forming the sheet Hydrogen bonding occurs between amino and carboxyl groups of different strands My T. Thai

Parallel Beta Sheets My T. Thai

Anti-Parallel Beta Sheets
My T. Thai

Mixed Beta Sheets My T. Thai

Tertiary Structure The full dimensional structure, describing the overall shape of a protein Also known as its fold My T. Thai

Quaternary Structure Proteins are made up of multiple polypeptide chains, each called a subunit The spatial arrangement of these subunits is referred to as the quaternary structure Sometimes distinct proteins must combine together in order to form the correct 3-dimensional structure for a particular protein to function properly. Example: the protein hemoglobin, which carries oxygen in blood. Hemoglobin is made of four similar proteins that combine to form its quaternary structure. My T. Thai

Other Units of Structure
Motifs (super-secondary structure): Frequently occurring combinations of secondary structure units A pattern of alpha-helices and beta-strands Domains: A protein chain often consists of different regions, or domains Domains within a protein often perform different functions Can have completely different structures and folds Typically a 100 to 400 residues long My T. Thai

What Determines Structure
What causes a protein to fold in a particular way? At a fundamental level, chemical interactions between all the amino acids in the sequence contribute to a protein’s final conformation There are four fundamental chemical forces: Hydrogen bonds Hydrophobic effect Van der Waal Forces Electrostatic forces My T. Thai

Hydrogen Bonds Occurs when a pair of nucliophilic atoms such as oxygen and nitrogen share a hydrogen between them Pattern of hydrogen bounding is essential in stabilizing basic secondary structures My T. Thai

Van der Waal Forces Interactions between immediately adjacent atoms
Result from the attraction between an atom’s nucleus and it neighbor’s electrons My T. Thai

Electrostatic Forces Oppositely charged side chains con form salt-bridges, which pulls chains together My T. Thai

Experimental Determination
Centralized database (to deposit protein structures) called the protein Databank (PDB), accessible at Two main techniques are used to determine/verify the structure of a given protein: X-ray crystallography Nuclear Magnetic Resonance (NMR) Both are slow, labor intensive, expensive (sometimes longer than a year!) My T. Thai

X-ray Crystallography
A technique that can reveal the precise three dimensional positions of most of the atoms in a protein molecule The protein is first isolated to yield a high concentration solution of the protein This solution is then used to grow crystals The resulting crystal is then exposed to an X-ray beam My T. Thai

Disadvantages Not all proteins can be crystallized
Crystalline structure of a protein may be different from its structure Multiple maps may be needed to get a consensus My T. Thai

NMR The spinning of certain atomic nuclei generates a magnetic moment
NMR measures the energy levels of such magnetic nuclei (radio frequency) These levels are sensitive to the environment of the atom: What they are bonded to, which atoms they are close to spatially, what distances are between different atoms… Thus by carefully measurement, the structure of the protein can be constructed My T. Thai

Disadvantages Constraint of the size of the protein – an upper bound is 200 residues Protein structure is very sensitive to pH. My T. Thai

Computational Methods
Given a long and painful experimental methods, need computational approaches to predict the structure from its sequence. My T. Thai

Functional Region Prediction
My T. Thai

Protein Secondary Structure
My T. Thai

Tertiary Structure Prediction
My T. Thai

More Details on X-ray Crystallography
My T. Thai

Overview My T. Thai

Crystal A crystal can be defined as an arrangement of building blocks which is periodic in three dimensions My T. Thai

Crystallize a Protein Have to find the right combination of all the different influences to get the protein to crystallize This can take a couple hundred or even thousand experiments Most popular way to conduct these experiments Hanging-drop method My T. Thai

Hanging drop method The reservoir contains a precipitant concentration twice as high as the protein solution The protein solutions is made up of 50% of stock protein solution and 50% of reservoir solution Overtime, water will diffuse from the protein drop into the reservoir Both the protein concentration and precipitant concentration will increase Crystals will appear after days, weeks, months Now we only have to find the right combination of all the different influences to get our protein to crystallize. This can easily take a couple hundred or even thousand experiments. The most popular way to conduct these experiments in the so-called hanging-drop method. The reservoir (typically 1 ml) contains a precipitant concentration twice as high as the protein solution (typically 2-10 µl). The protein solution is made up of 50% of your stock protein solution and 50% of reservoir solution. Over time, water will diffuse from the protein drop into the reservoir, and both the protein concentration and the precipitant concentration will increase. And if the conditions are chosen right, crystals will appear after days, weeks, months, By the way, the hanging-drop setup is so popular, because the crystals are usually heavier than the solution. This means that once crystals are formed they sink to the bottom of the drop and don't stick to the surface of the cover slip. Therefore, they can be harvested easily. My T. Thai

Properties of protein crystal
Very soft Mechanically fragile Large solvent areas (30-70%) My T. Thai

A Schematic Diffraction Experiment
For a diffraction experiment we basically need an X-ray source, our crystal and a suitable detector. X-ray sources are either so-called sealed tubes or rotating anodes or synchrotrons. Tubes and anodes one can have in his home laboratory. They consist of a metal anode onto which electrons are shot. Upon impact of the electrons X-rays are produced. Synchrotrons are large rings in which charged particles are kept on a circular path. Whenever they are forced to change direction, X-rays are produced and emitted tangential to the path change. One of these machines is the DESY synchrotron in Hamburg, Germany, but there are many others around the world. The crystal has to be put into the X-ray beam and for this it has to be harvested from the crystallization drop. This can be done by placing the crystal into a small glass capillary or by picking it up with a small loop and shock-cooling it down to 100 K in a gaseous nitrogen stream. This is necessary to prevent the crystal from drying out during the experiment and to prevent radiation damage. Remember that X-radiation is ionizing and that it can cause damage to biological material. The crystal has to be rotated throughout the experiment. The reason for this is that it is a three dimensional object, which will give rise to a three dimensional diffraction pattern. And of course we want to record the complete pattern. The detector can be anything that is sensitive to X-rays. In the easiest case this is photographic film, but most often nowadays either electronic detectors or so-called imaging plates are used. The two most important question we should ask here are Why do we need X-rays? and What do we need crystals for? Well, X-rays are electromagnetic waves with a wavelength close to the distance of atoms in the protein molecules. If we want to get information about where the atoms are, we need to be able to resolve them, and therefore we need radiation that can do that. Crystals we need for three reasons: firstly, a single molecule could never be oriented and handled properly for a diffaction experiment, secondly, in a crystal we have approximately 10^15 molecules in the same orientation, so that we get a tremendous amplification of the diffraction and last but not least, crystals produce much simpler diffraction patterns than single molecules. We will see an example of that on the next slide. My T. Thai

Why do we need Crystals A single molecule could never be oriented and handled properly for a diffraction experiment In a crystal, we have about 1015 molecules in the same orientation so that we get a tremendous amplification of the diffraction Crystals produce much simpler diffraction patterns than single molecules My T. Thai

Why do we need X-rays X-rays are electromagnetic waves with a wavelength close to the distance of atoms in the protein molecules To get information about where the atoms are, we need to resolve them -> thus we need radiation My T. Thai

A Diffraction Pattern My T. Thai mythai@cise.ufl.edu
Here you can see a diffraction image of a thermolysin crystal taken at the synchrotron ELETTRA in Trieste, Italy. The whole image is shown on the left, and two close-ups on the right. One image constitutes just a small slice of the complete diffraction pattern, which often consists of dozens if not hundreds of such images. These diffraction images are not only beautiful, they are also very simple. We get something at certain spots (these are the X-ray reflections) and we get nothing in between. This is the result of the periodic arrangement of our molecules in the crystal. The images have to be processed, which means that the intensity of every single reflection has to be measured. In the bottom right panel you can see that the whole image consists of many, many single measuring points, which are also called pixels. For every pixel we can read the number of X-ray photons which have arrived on the detector. The intensity of one reflection can then be determined by adding up all photons which belong to the reflection. My T. Thai

My T. Thai

Resolution The primary measure of crystal order/quality of the model
Ranges of resolution: Low resolution (>3-5 Ao) is difficult to see the side chains only the overall structural fold Medium resolution (2.5-3 Ao) High resolution (2.0 Ao) My T. Thai

Some Crystallographic Terms
h,k,l: Miller indices (like a name of the reflection) I(h,k,l): intensity 2θ: angle between the x-ray incident beam and reflect beam My T. Thai

Diffraction by a Molecule in a Crystal
The electric vector of the X-ray wave forces the electrons in our sample to oscillate with the same wavelength as the incoming wave Now we are slowly getting to the heart of the method. When X-rays hit our sample, the electric vector of the X-ray wave forces the electrons in our sample to oscillate with the same wavelength as the incoming wave. Charges which oscillate become emitters of electromagnetic radiation again, they emit X-rays of the same wavelength in all directions in space. This is called elastic scattering or Thomson-scattering. The nuclei will of course also oscillate, but they are much, much heavier than the electrons, so that their contribution to the scattering is negligible. In some directions in space the emitted waves of ALL the electrons (atoms) in our sample add up to the described X-ray reflections and in other directions, they cancel each other out. This means (and this is very important!!!) that each electron (atom) in our sample contributes to every single reflection. And in order to understand how a set of electrons is related to its diffraction pattern, we first have to think about how to describe waves and how to add them up. My T. Thai

Description of Waves My T. Thai

Structure Factor Equation
fj: proportional to the number of electrons this atom j has One of the fundamental equations in X-ray Crystallography The Structure Factor Equation is one of the two fundamental equations in X-ray crystallography. The magnitude of the structure factor describes the amplitude of the X-ray reflection and its phase the phase of the reflection. With this equation at hand, we can now calculate from any given crystal structure what the corresponding diffraction pattern will look like. The small fj in the equation describes the contribution to the total scattering from each atom of type j. It is approximately proportional to the number of electrons this atom has. My T. Thai

The Phase From the measurement, we can only obtain the intensity I(hkl) of any given reflection (hkl) The phase α(hkl) cannot be measured My T. Thai

How to Determine the Phase
Small changes are introduced into the crystal of the protein of interest: Eg: soaking the crystal in a solution containing a heavy atom compound Second diffraction data set needs to be collected Comparing two data sets to determine the phases (also able to localize the heavy atoms) Traditionally the most important method to determine the phases of the X-ray reflections is the isomorphous replacement method. Small changes are introduced into the crystal of the protein of interest, for instance by soaking the crystal in a solution containing a heavy atom compound or by cocrystallizing the protein with the heavy atom compound (the subscript P stands for the un-derivatized Protein crystal, and the subscript PH for the Protein+Heavy Atom containing crystal). The assumption (and the prerequisite of the method) is that the heavy atom compound binds to the protein at a few defined places only and in high yield and that it leaves the overall protein structure unchanged. Now, a second diffraction data set needs to be collected from the derivatized crystal. The direct comparison of the two data sets reflection by reflection then leads first to the localization of the heavy atoms, and then to the elucidation of the phase of each reflection. The reason why heavy atoms need to be used is that they contain a large number of electrons, therefore only a few atoms will lead to measurable changes in the diffraction pattern. My T. Thai

Other Phase Determination Methods
This is just a list of the phase determination methods which are applicable in X-ray crystallography of biological macromolecules. 1. Single Isomorphous Replacement, Single Isomorphous Replacement with Anomalous Scattering, Multiple Isomorphous Replacement, Multiple Isomorphous Replacement with Anomalous Scattering These are the isomorphous replacement techniques discussed on the previous page Multiple Wavelength Anomalous Dispersion Here, the wavelength dependence of some atoms in the structure is exploited. This method is most often used in combination with selenomethionine containing proteins. It requires a tunable X-ray source, i.e. a synchrotron Single Wavelength Anomalous Scattering (sometimes also incorrectly called Single Wavelength Anomalous Dispersion) 4. Molecular Replacement This method requires a similar structure to be known, which can then be used to calculate starting phases Direct Methods This is the method of choice in small-molecule crystallography. In protein crystallography it is less important, because it requires very high resolution data, which are not routinely available from protein crystals. My T. Thai

Electron Density Map Once we know the complete diffraction pattern (amplitudes and phases), need to calculate an image of the structure The above equation returns the electron density (so we get a map of where the electrons are their concentration) Now this is the second fundamental equation in X-ray crystallography. This equation describes how to calculate an image of our structure, once we know the complete diffraction pattern with both structure factor amplitudes and phases. The resulting quantity is called electron density, which is nothing but some sort of electron concentration. Remember that the scattering was caused by the electrons of our sample, therefore what we will see is an image of where the electrons are. And of course, where the electrons are, there are the atoms. Something worth noting here is, that each single X-ray reflection contributes to the electron density at each point in the unit cell. Remember that we mentioned that each electron of our structure also contributed to each X-ray reflection. My T. Thai

Interpretation of Electron Density
Now, the electron density has to be interpreted in terms of atom identities and positions. (1): packing of the whole molecules is shown in the crystal (2): a chain of seven amino acids in shown with the resulting structure superimposed (3): the electron density of a trypophan side chain is shown The electron density is usually displayed on a computer graphics display in the so-called chicken-wire representation. This means that outside the chicken wire, the electron density is lower than a certain value and inside it is higher. In the slide you see three different magnifications. In the left image, the packing of whole molecules is shown in the crystal. Note how nicely one can separate the protein molecules (black) from the solvent areas (white). In the central picture a chain of seven amino acids is shown with the resulting structure superimposed, and in the right image the electron density of a trypophan side chain is shown. The next step for the crystallographer is now to interpret the electron density in terms of atomic positions. My T. Thai

Refinement and the R-Factor
Once a model of the whole molecule has been built into the observed electron density, we can calculate what the diffraction pattern of this model would look like making use of the Structure Factor equation (slide 51). And then we can compare this theoretical diffraction pattern with our observed one and see how different they are. This information is given by the R-factor. From now on we can try to improve the fit between our model and our experimental data by refinement as well as by changing our model on a graphics computer. At the end of this model building and refinement stage we will hopefully end up with a structure described by an R-factor of less than 20%. If all else is ok, then we have finally determined our structure. My T. Thai

Nuclear Magnetic Resonance
Concentrated protein solution (very purified) Magnetic field Effect of radio frequencies on the resonance of different atoms is measured. My T. Thai

My T. Thai

NMR Behavior of any atom is influenced by neighboring atoms
more closely spaced residues are more perturbed than distant residues can calculate distances based on perturbation My T. Thai

NMR spectrum of a protein
My T. Thai

Protein Structure: Secondary Prediction

Primary Structure: Symbolic Definition
A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting all amino acids A* - set of all finite sequences formed out of elements of A, called protein sequences Elements of A* are denoted by x, y, z …..i.e. we write x A*, y A*, zA*, … etc PROTEIN PRIMARY STRUCTURE: any x  A* is also called a protein sequence or protein sub-unit My T. Thai

Protein Secondary Structure (PSS)
Secondary structure: the arrangement of the peptide backbone in space. It is produced by hydrogen bondings between amino acids PROTEIN SECONDARY STRUCTURE consists of: protein sequence and its hydrogen bonding patterns called SS categories My T. Thai

Databases for protein sequences are expanding rapidly The number of determined protein structures (PSS – protein secondary structures) and the number of known protein sequences is still limited PSSP (Protein Secondary Structure Prediction) research is trying to breach this gap. My T. Thai

The most commonly observed conformations in secondary structure are: Alpha Helix Beta Sheets/Strands Loops/Turns My T. Thai

Turns and Loops Secondary structure elements are connected by regions of turns and loops Turns – short regions of non-, non- conformation Loops – larger stretches with no secondary structure. My T. Thai

Three secondary structure states
Prediction methods are normally assessed for 3 states: H (helix) E (strands) L (others (loop or turn)) My T. Thai

Secondary Structure 8 different categories: H:  - helix
G: 310 – helix I:  - helix (extremely rare) E:  - strand B:  - bridge T: - turn S: bend L: the rest My T. Thai

Three SS states: Reduction methods
Method 1, used by DSSP program: H(helix) ={ G (310 – helix), H (- helix)} E (strands) = {E (-strand), B (-bridge)} , L = all the rest Shortly: E,B => E; G,H => H; Rest => C Method 2, used by STRIDE program: H as in Method 1 E = {E (-strand), b (isolated  -bridge)}, My T. Thai

Three SS states: Reduction methods
Method 3, used by DEFINE program: H(helix) as in Method 1 E (strands) = {E (-strand)}, L = all the rest My T. Thai

Example of typical PSS Data
Sequence KELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEYNDRQGFVP Observed SS HHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEEEELLLHHH My T. Thai

PSS: Symbolic Definition
Given A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting amino acids and a protein sequence x  A* Let S ={ H, E, L} be the set of symbols of 3 states: H (helix), E (strands) and L (loop) and S* be the set of all finite sequences of elements of S. We denote elements of S* by e, e S* My T. Thai

PSS: Symbolic Definition
Any one-to-one function f : A* S* i.e. f  A* x S* is called a protein secondary structure (PSS) identification function An element (x, e)  f is a called protein secondary structure (of the protein sequence x) The element e  S* (of (x, e)  f ) is called secondary structure. My T. Thai

PSSP If a protein sequence shows clear similarity to a protein of known three dimensional structure then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms Why? homology modelling is much more accurate than secondary structure prediction for high levels of sequence identity. My T. Thai

PSSP Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable. It is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods. My T. Thai

Classification and Classifiers
Given a database table DB with a special atribute C, called a class attribute (or decision attribute). The values: C1, C2, ...Cn of the class atrribute are called class labels. Example: A1 A2 A3 A4 C 1 m g c1 v c2 b My T. Thai

The attribute C partitions the records in the DB: divides the records into disjoint subsets defined by the attributes C values, CLASSIFIES the records. It means we use the attributre C and its values to divide the set R of records of DB into n disjoint classes: C1={ rDB: C=c1} Cn={rDB: C=cn} Example (from our table) C1 = { (1,1,m,g), (1,0,m,b)} = {r1,r3} C2 = { (0,1,v,g)} ={r2} My T. Thai

An algorithm is called a classification algorithm if it uses the data and its classification to build a set of patterns. Those patterns are structured in such a way that we can use them to classify unknown sets of objects- unknown records. For that reason (because of the goal) the classification algorithm is often called shortly a classifier. The name classifier implies more then just classification algorithm. A classifier is final product of a data set and a classification algorithm. My T. Thai

Building a classifier consists of two phases: training and testing. In both phases we use data (training data set and disjoint with it test data set) for which the class labels are known for ALL of the records. We use the training data set to create patterns We evaluate created patterns with the use of of test data, which classification is known. The measure for a trained classifier accuracy is called predictive accuracy. The classifier is build i.e. we terminate the process if it has been trained and tested and predictive accuracy was on an acceptable level. My T. Thai

Classifiers Predictive Accuracy
PREDICTIVE ACCURACY of a classifier is a percentage of well classified data in the testing data set. Predictive accuracy depends heavily on a choice of the test and training data. There are many methods of choosing test and and training sets and hence evaluating the predictive accuracy. This is a separate field of research. My T. Thai

Accuracy Evaluation Use training data to adjust parameters of method until it gives the best agreement between its predictions and the known classes Use the testing data to evaluate how well the method works (without adjusting parameters!) How do we report the performance? Average accuracy = fraction of all test examples that were classified correctly My T. Thai

Accuracy Evaluation Multiple cross-validation test has to be performed to exclude a potential dependency of the evaluated accuracy on the particular test set chosen Jack-Knife: Use 129 chains for setting up the tool (training set) 1 for estimating the performance (testing) This has to be repeated 130 times until each protein has been used once for testing The average over all 130 tests gives an estimate of the prediction accuracy My T. Thai

PSSP Datasets Historic RS126 dataset. Contains126 sub-units with known secondary structure selected by Rost and Sander. Today is not used anymore CB513 dataset. Contains 513 sub-units with known secondary structure selected by Cuff and Barton in Used quite frencently in PSSP research HS17771 dataset. Created by Hobohm and Scharf. In March-2002 it contained 1771 sub-units Lots of authors has their own and “secret” datasets My T. Thai

Measures for PSSP accuracy
(for more information) Q3 :Three-state prediction accuracy (percent of succesful classified) Qi %obs: How many of the observed residues were correctly predicted? Qi %prd: How many of the predicted residues were correctly predicted? My T. Thai

Measures for PSSP Accuracy
Aij = number of residues predicted to be in structure type j and observed to be in type i Number of residues predicted to be in structure i: Number of residues observed to be in structure i: My T. Thai

Measures for SSP Accuracy
The percentage of residues correctly predicted to be in class i relative to those observed to be in class i The percentages of residues correctly predicted to be in class i from all residues predicted to be in i Overall 3-state accuracy My T. Thai

PSSP Algorithms There are three generations in PSSP algorithms
First Generation: based on statistical information of single amino acids (1960s and 1970s) Second Generation: based on windows (segments) of amino acids. Typically a window containes amino acids (dominating the filed until early 1990s) Third Generation: based on the use of windows on evolutionary information My T. Thai

PSSP: First Generation
First generation PSSP systems are based on statistical information on a single amino acid The most relevant algorithms: Chow-Fasman, 1974 GOR, 1978 Both algorithms claimed 74-78% of predictive accuracy, but tested with better constructed datasets were proved to have the predictive accuracy ~50% (Nishikawa, 1983) My T. Thai

Chou-Fasman method Uses table of conformational parameters determined primarily from measurements of the known structure (from experimental methods) Table consists of one “likelihood” for each structure for each amino acid Based on frequencies of residues in a-helices, b-sheets and turns Notation: P(H): propensity to form alpha helices f(i): probability of being in position 1 (of a turn) My T. Thai

Chou-Fasman Pij-values
My T. Thai

Chou-Fasman A prediction is made for each type of structure for each amino acid Can result in ambiguity if a region has high propensities for both helix and sheet (higher value usually chosen) My T. Thai

Chou-Fasman How it works:
1. Assign all of the residues the appropriate set of parameters 2. Identify a-helix and b-sheet regions. Extend the regions in both directions. 3. If structures overlap compare average values for P(H) and P(E) and assign secondary structure based on best scores. 4. Turns are calculated using 2 different probability values. My T. Thai

Assign Pij values 1. Assign all of the residues the appropriate set of parameters My T. Thai

Scan peptide for a-helix regions
2. Identify regions where 4 out of 6 have a P(H) >100 “alpha-helix nucleus” My T. Thai

Extend a-helix nucleus
3. Extend helix in both directions until a set of four consecutive residues with P(H) <100. Find sum of P(H) and sum of P(E) in the extended region If region is long enough ( >= 5 letters) and sum P(H) > sum P(E) then declare the extended region as alpha helix My T. Thai

Scan peptide for b-sheet regions
4. Identify regions where 3 out of 5 have a P(E) >100 “b-sheet nucleus” 5. Extend b-sheet until 4 continuous residues with an average P(E) < 100 6. If region average > 100 and the average P(E) > average P(H) then “b-sheet” My T. Thai

Overlapping Resolving overlapping alpha helix & beta sheet
Compute sum of P(H) and sum of P(E) in the overlap. If sum P(H) > sum P(E) => alpha helix If sum P(E) > sum P(H) => beta sheet My T. Thai

Turn Prediction An amino acid is predicted as turn if all of the following holds: f(i)*f(i+1)*f(i+2)*f(i+3) > Avg(P(i+k)) > 100, for k=0, 1, 2, 3 Sum(P(t)) > Sum(P(H)) and Sum(P(E)) for i+k, (k=0, 1, 2, 3) My T. Thai

PSSP: Second Generation
Based on the information contained in a window of amino acids (11-21 aa.) The most systems use algorithms based on: Statistical information Physico-chemical properties Sequence patterns Graph-theory Multivariante statistics Expert rules Nearest-neighbour algorithms My T. Thai

PSSP: First & Second Generation
Main problems: Prediction accuracy <70% SS assigments differ even between crystals of the same protein SS formation is partially determined by long-range interactions, i.e., by contacts between residues that are not visible by any method based on windows of adjacent residues My T. Thai

PSSP: First & Second Generation
Main problems: Prediction accuracy for b-strand 28-48%, only slightly better than random beta-sheet formation is determined by more nonlocal contacts than in alpha-helix formation Predicted helices and strands are usually too short Overlooked by most developers My T. Thai

Example of Second Generation
Example for typical secondary structure prediction of the 2nd generation. The protein sequence (SEQ ) given was the SH3 structure. The observed secondary structure (OBS ) was assigned by DSSP (H = helix; E = strand; blank = non-regular structure; the dashes indicate the continuation). The typical prediction of too short segments (TYP ) poses the following problems in practice. (i) Are the residues predicted to be strand in segments 1, 5, and 6 errors, or should the helices be elongated? (ii) Should the 2nd and 3rd strand be joined, or should one of them be ignored, or does the prediction indicate two strands, here? Note: the three-state per-residue accuracy is 60% for the prediction given. My T. Thai

PSSP: Third Generation
PHD: First algorithm in this generation (1994) Evolutionary information improves the prediction accuracy to 72% Use of evolutionary information: 1. Scan a database with known sequences with alignment methods for finding similar sequences 2. Filter the previous list with a threshold to identify the most significant sequences 3. Build amino acid exchange profiles based on the probable homologs (most significant sequences) 4. The profiles are used in the prediction, i.e. in building the classifier My T. Thai

Many of the second generation algorithms have been updated to the third generation My T. Thai

Due to the improvement of protein information in databases i.e. better evolutionary information, today’s predictive accuracy is ~80% It is believed that maximum reachable accuracy is 88%. Why such conjecture? My T. Thai

Why 88% SS assignments may vary for two versions of the same structure
Dynamic objects with some regions being more mobile than others Assignment differ by 5-15% between different X-ray (NMR) versions of the same protein Assignment diff. by about12% between structural homologues B. Rost, C. Sander, and R. Schneider, Redefining the goals of protein secondary structure predictions, J. Mol. Bio. My T. Thai

PSSP Data Preparation Public Protein Data Sets used in PSSP research contain protein secondary structure sequences. In order to use classification algorithms we must transform secondary structure sequences into classification data tables. Records in the classification data tables are called, in PSSP literature (learning) instances. The mechanism used in this transformation process is called window. A window algorithm has a secondary structure as input and returns a classification table: set of instances for the classification algorithm. My T. Thai

Window Consider a secondary structure (x, e).
where (x,e)= (x1x2 …xn, e1e2…en) Window of the length w chooses a subsequence of length w of x1x2 …xn, and an element ei from e1e2…en, corresponding to a special position in the window, usually the middle Window moves along the sequences x = x1x2 …xn and e= e1e2…en simultaneously, starting at the beginning moving to the right one letter at the time at each step of the process. My T. Thai

Window: Sequence to Structure
Such window is called sequence to structure window. We will call it for short a window. The process terminates when the window or its middle position reaches the end of the sequence x. The pair: (subsequence, element of e ) is often written in a form subsequence  H, E or L is called an instance, or a rule. My T. Thai

Example: Window Consider a secondary structure (x, e) and the window of length 5 with the special position in the middle (bold letters) Fist position of the window is: x = A R N S T V V S T A A …. e = H H H H L L L E E E Window returns instance: A R N S T  H My T. Thai

Example: Window Second position of the window is:
x = A R N S T V V S T A A …. e = H H H H L L L E E E Windows returns instance: R N S T V  H Next instances are: N S T V V  L S T V V S  L T V V S T  L My T. Thai

f(x1x2…xn)|{xi}= ei, i.e. f(x)|{xi}= ei
Symbolic Notation Let f be a protein secondary structure (PSS) identification function: f : A* S* i.e f  A* x S* Let x= x1x2…xn, e= e1e2…en, f(x)= e, we define f(x1x2…xn)|{xi}= ei, i.e. f(x)|{xi}= ei My T. Thai

Example:Semantics of Instances
Let x = A R N S T V V S T A A …. e = H H H H L L L E E E And assume that the windows returns an instance: A R N S T  H Semantics of the instance is: f(x)|{N}=H, where f is the identification function and N is preceded by A R and followed by S T and the window has the length 5 My T. Thai

Classification Data Base (Table)
We build the classification table with attributes being the positions p1, p2, p3, p4, p5 .. pw in the window, where w is length of the window. The corresponding values of attributes are elements of of the subsequent on the given position. Classification attribute is S with values in the set {H, E, L} assigned by the window operation (instance, rule). The classification table for our example (first few records) is the following. My T. Thai

Classification Table (Example)
x = A R N S T V V S T A A …. e = H H H H L L L E E E p1 p2 p3 p4 p5 S A R N T H V L Semantics of record r= r(p1, p2, p3,p4,p5, S) is : f(x)|{Vp3} = Vs where Va denotes a value of the attribute a. My T. Thai

Size of classification datasets (tables)
The window mechanism produces very large datasets For example window of size 13 applied to the CB513 dataset of 513 protein subunits produces about 70,000 records (instances) My T. Thai

Window Window has the following parameters:
PARAMETER 1 : i  N+, the starting point of the window as it moves along the sequence x= x1 x2 …. xn. The value i=1 means that window starts at x1, i=5 means that window starts at x5 PARAMETER 2: w  N+ denotes the size (length) of the window. For example: the PHD system of Rost and Sander (1994) uses two window sizes: 13 and 17. My T. Thai

Window PARAMETER 3: p  {1,2, …, w} where p is a special position of the window that returns the classification attribute values from S ={H, E, L} and w is the size (length) of the window PSSP PROBLEM: find optimal size w, optimal special position p for the best prediction accuracy My T. Thai

Window: Symbolic Definition
Window Arguments: window parameters and secondary structure (x,e) Window Value: (subsequence of x, element of e) OPERATION (sequence – to –structure window) W is a partial function W: N+  N+  {1,…, k} (A*  S* )  A*  S W(i, k, p, (x,e)) = (xi x(i+1)…. x(i+k-1), f(x)|{x(i+p)}) where (x,e)= (x1x2 ..xn, e1e2…en) My T. Thai

Neural network models machine learning approach
provide training sets of structures (e.g. a-helices, non a -helices) are trained to recognize patterns in known secondary structures provide test set (proteins with known structures) accuracy ~ 70 –75% Simulate the brain. Selection of training sets is extremely important. Different protein families, only one or two representative from each family. My T. Thai

Reasons for improved accuracy
Align sequence with other related proteins of the same protein family Find members that has a known structure If significant matches between structure and sequence assign secondary structures to corresponding residues My T. Thai

3 State Neural Network My T. Thai

Neural Network My T. Thai

Input Layer Most of approach set w = 17. Why?
Based on evidence of statistical correlation with secondary structure as far as 8 residues on either side of the prediction point The input layer consists of: 17 blocks, each represent a position of window Each block has 21 units: The first 20 units represent the 20 aa One to provide a null input used when the moving window overlaps the amino- or carboxyl-terminal end of the protein My T. Thai

Binary Encoding Scheme
Example: Let w = 5, and let say we have the sequence: A E G K Q…. Then the input layer is: A,C,D,E,F,G,…,N,P,Q,R,S.T,V,W,Y … 0 0… ….. 0 … ….. My T. Thai

Hidden Layer Represent the structure of the central aa
Encoding scheme: Can use two units to present: (1,0) = H, (0,1) = E, (0,0) = L Some uses three units: (1,0,0) = H, (0,1,0) = E, (0,0,1) = L For each connection, we can assign some weight value. This weight value can be adjusted to best fit the data (training) My T. Thai

Output Level Based on the hidden level and some function f, calculate the output. Helix is assigned to any group of 4 or more contiguous residues Having helix output values greater than sheet outputs and greater than some threshold t Strand (E) is assigned to any group of two or more contiguous resides, having sheet output values greater than helix outputs and greater than t Otherwise, assigned to L Note that t can be adjusted as well (training) My T. Thai

How PHD works Step 1. BLAST search with input sequence
Step 2. Perform multiple seq. alignment and calculate aa frequencies for each position My T. Thai

How PHD works Step 3. First Level: “Sequence to structure net”
Input: alignment profile, Output: units for H, E, L Calculate “occurrences” of any of the residues to be present in either an a-helix, b-strand, or loop. 1 2 3 4 5 6 7 H = 0.05 E = 0.18 L= 0.67 N=0.2, S=0.4, A=0.4 My T. Thai

How PHD works Step 4. Decision level
Step 3. Second Level: “Structure to structure net” Input: First Level values, Output: units for H, E, L Window size = 17 H = 0.59 E = 0.09 L= 0.31 E=0.18 Step 4. Decision level My T. Thai

Prepare Data for PHD Neural Nets
Starting from a sequence of unknown structure (SEQUENCE ) the following steps are required to finally feed evolutionary information into the PHD neural networks: a data base search for homologues (method Blast), a refined profile-based dynamic-programming alignment of the most likely homologues (method MaxHom) a decision for which proteins will be considered as homologues (length-depend cut-off for pairwise sequence identity) a final refinement, and extraction of the resulting multiple alignment. Numbers 1-3 indicate the points where users of the PredictProtein service can interfere to improve prediction accuracy without changes made to the final prediction method PHD . My T. Thai

PHD Neural Network My T. Thai

Prediction Accuracy My T. Thai

Where can I learn more? DSSP
Protein Structure Prediction Center Biology and Biotechnology Research Program Lawrence Livermore National Laboratory, Livermore, CA DSSP Database of Secondary Structure Prediction My T. Thai

Protein Structure: Tertiary Prediction via Threading

Objective Study the problem of predicting the tertiary structure of a given protein sequence My T. Thai

A Few Examples My T. Thai mythai@cise.ufl.edu actual predicted actual

Two Comparative Modeling
Homology modeling – identification of homologous proteins through sequence alignment; structure prediction through placing residues into “corresponding” positions of homologous structure models Protein threading – make structure prediction through identification of “good” sequence-structure fit We will focus on the Protein Threading. My T. Thai

Why it Works? Observations: Conjecture:
Many protein structures in the PDB are very similar Eg: many 4-helical bundles, globins… in the set of solved structure Conjecture: There are only a limited number of “unique” protein folds in nature My T. Thai

Threading Method General Idea: Sequence-Structure Alignment Problem:
Try to determine the structure of a new sequence by finding its best ‘fit’ to some fold in library of structures Sequence-Structure Alignment Problem: Given a solved structure T for a sequence t1t2…tn and a new sequence S = s1s2… sm, we need to find the “best match” between S and T My T. Thai

What to Consider How to evaluate (score) a given alignment of s with a structure T? How to efficiently search over all possible alignments? My T. Thai

Three Main Approaches Protein Sequence Alignment 3D Profile Method
Contact Potentials My T. Thai

Protein Sequence Alignment Method
Align two sequences S and T If in the alignment, si aligns with tj, assign si to the position pj in the structure Advantages: Simple Disadvantages: Similar structures have lots of sequence variability, thus sequence alignment may not be very helpful My T. Thai

3D Profile Method Actually uses structural information Main idea:
Reduce the 3D structure to a 1D string describing the environment of each position in the protein. (called the 3D profile (of the fold)) To determine if a new sequence S belongs to a given fold T, we align the sequence with the fold’s 3D profile First question: How to create the 3D profile? My T. Thai

Create the 3D Profile For a given fold, do:
For each residue, determine: How buried is it? Fraction of surrounding environment that is polar What secondary structure is it in (alpha-helix, beta-sheet, or neither) My T. Thai

Create the 3D profile 2. Assign an environment class to each position:
Six classes describe the burial and polarity criteria (exposed, partially buried, very buried, different fractions of polar environment) My T. Thai

Create the 3D Profile These environment classes depend on the number of surrounding polar residues and how buried the position is. There are 3 SS for each of these, thus have 18 environment classes My T. Thai

Create the 3D Profile 3. Convert the known structure T to a string of environment descriptors: 4. Align the new sequence S with E using dynamic programming My T. Thai

Scores for Alignment Need scores for aligning individual residues with environments. Key: Different aa prefer diff. environment. Thus determine scores by looking at the statistical data My T. Thai

Scores for Alignment Choose a database of known structures
Tabulate the number of times we see a particular residue in a particular environment class -> compute the score for each env class and each aa pair Choose gap penalties, eg. may charge more for gaps in alpha and beta environments… My T. Thai

Alignment This gives us a table of scores for aligning an aa sequence with an environment string Using this scoring and Dynamic Programming, we can find an optimal alignment and score for each fold in our library The fold with the highest score is the best fold for the new sequence My T. Thai

Contact Potentials Method
Take 3D structure into account more carefully Include information about how residues interact with each other Consider pairwise interactions between the position pi, pj in the fold For a given alignment, produce a score which is the sum over these interactions: My T. Thai

Problem Have a sequence from the database T = t1…tn with known positions p1…pn, and a new sequence S = s1…sm. Find 1 <= r1 < r2 < … < rn < m which maximize where ri is the index of the aa in S which occupies position pi This problem is NP-complete for pairwise interactions My T. Thai

How to Define that Score?
Use so-called “knowledge-based potentials”, which comes from databases of observed interactions. The general form: My T. Thai

How to Define the Score General Idea:
Define cutoff parameter for “contact” (e.g. up to 6 Angstroms) Use the PDB to count up the number of times aa i and j are in contact Several method for normalization. Eg. Normalization is by hypothetical random frequencies My T. Thai

Other Variations Many other variations in defining the potentials
In addition to pairwise potentials, consider single residue potentials Distance-dependent intervals: Counting up pairwise contacts separately for intervals within 1 Angstrom, between 1 and 2 Angstroms… My T. Thai

Threading via Tree-Decomposition
My T. Thai

Contact Graph Each residue as a vertex
One edge between two residues if their spatial distance is within given cutoff. Cores are the most conserved segments in the template template My T. Thai

Simplified Contact Graph
My T. Thai

Alignment Example My T. Thai

Calculation of Alignment Score
My T. Thai

Graph Labeling Problem
Each core as a vertex Two cores interact if there is an interaction between any two residues, each in one core Add one edge between two cores that interact. h f b d s m c a e i j k l Each possible sequence alignment position for a single core can be treated as a possible label assignment to a vertex in G D[i] = be a set of all possible label assignments to vertex i. Then for each label assignment A(i) in D[i], we have: My T. Thai

Tree Decomposition My T. Thai

Tree Decomposition [Robertson & Seymour, 1986]
Greedy: minimum degree heuristic a c d f e m k j i h g abd l a b c d f e m l k j i g h We can use a greedy algorithm to decompose a graph into some components. Always pick up a vertex with the minimum degree. This vertex and its neighbors form a decomposition component. We also add edges such that the neighbors of this vertex form a clique. Choose the vertex with minimum degree The chosen vertex and its neighbors form a component Add one edge to any two neighbors of the chosen vertex Remove the chosen vertex Repeat the above steps until the graph is empty My T. Thai

Tree Decomposition (Cont’d)
abd acd clk cdem defm fgh eij a b c d f e m l k j i h g Tree Decomposition ab ac clk c f fgh ij remove dem My T. Thai

Tree Decomposition-Based Algorithms
Xr Xp Xi Xj Xl Xq Xir Xji Xli Bottom-to-Top: Calculate the minimal F function 2. Top-to-Bottom: Extract the optimal assignment A tree decomposition rooted at Xr The score of component Xi Now assume that we have a tree decomposition of the contact graph, Let’s look at how to calculate the optimal label assignment. In this decomposition, Xr is the root component, Xir is the intersection between Xi and Xr. Xi has two child components Xj and Xl. If we fix the position assignment to all the residues in Xir, then the position assignment to the subtree rooted at Xi is independent of another part. Therefore, F(X_i, A(X_{ir}) be the optimal label assignment of the subtree rooted at Xi. The scores of subtree rooted at Xl The score of subtree rooted at Xi The scores of subtree rooted at Xj My T. Thai

Computational Molecular Biology

Similar presentations

Presentation on theme: "Computational Molecular Biology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Molecular Biology

Similar presentations

Presentation on theme: "Computational Molecular Biology"— Presentation transcript:

Similar presentations

About project

Feedback