Neural Networks for Protein Structure Prediction Dr. B Bhunia.

Neural Networks for Protein Structure Prediction Dr. B Bhunia

Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship (Phylogeny) 3-D fold model Protein sorting and sub-cellular localization Anchoring into the membrane Signal sequence (tags) Protein modifications  Some nascent proteins contain a specific signal, or targeting sequence that directs them to the correct organelle. ( ER, mitochondrial, chloroplast, lysosome, vacuoles, Golgi, or cytosol )

 Can we train the computers:  To detect signal sequences and predict protein destination?  T o identify conserved domains (or a pattern) in proteins?  To predict the membrane-anchoring type of a protein? ( Transmembrane domain, GPI anchor… )  T o predict the 3D structure of a protein?  Learning algorithms are good for solving problems in pattern recognition because they can be trained on a sample data set.  Classes of learning algorithms: -Artificial neural networks (ANNs) -Hidden Markov Models (HMM) Questions

Artificial neural networks (ANN)  Machine learning algorithms that mimic the brain. Real brains, however, are orders of magnitude more complex than any ANN.  ANNs, like people, learn by example. ANNs cannot be programmed to perform a specific task.  ANN is composed of a large number of highly interconnected processing elements (neurons) working simultaneously to solve specific problems.  The first artificial neuron was developed in 1943 by the neurophysiologist Warren McCulloch and the logician Walter Pits.

Determining protein Structure-Function  Direct measurement of structure  X-ray crystallography  NMR spectroscopy  Site-directed mutagenesis  Computer modeling  Prediction of structure  Comparative protein-structure modeling

Comparative protein-structure modeling  Goal: Construct 3-D model of a protein of unknown structure (target), based on similarity of sequence to proteins of known structure (templates) Blue : predicted model by PROSPECT Red : NMR structure  Procedure:  Template selection  Template–target alignment  Model building  Model evaluation

The Protein 3-D Database  The Protein DataBase (PDB) contains 3-D structural data for proteins  Founded in 1971 with a dozen structures  As of June 2004, there were 25,760 structures in the database. All structures are reviewed for accuracy and data uniformity.  Structural data from the PDB can be freely accessed at http://www.rcsb.org/pdb/  80% come from X-ray crystallography  16% come from NMR  2% come from theoretical modeling

High-throughput methods

Outline Goal is to predict “secondary structure” of a protein from its sequence Artificial Neural Network used for this task Evaluation of prediction accuracy

What is Protein Structure?

http ://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm

http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

Protein Structure  An amino acid sequence “folds” into a complex 3-D structure  Finding out this 3-D structure is a crucial and challenging task  Experimental methods (e.g., X-ray crystallography) are very tedious  Computational predictions are a possibility, but very difficult

What is “secondary structure”?

http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif “Strand”“Helix”

http://www.npaci.edu/features/00/Mar/protein.jpg “Strand” “Helix”

Secondary structure prediction The whole 3-D “tertiary” protein structure may be hard to predict from sequence But can we at least predict the secondary structural elements such as “strand”, “helix” or “coil”?

A survey of structure prediction The most reliable technique is “comparative modeling” –Find a protein P whose amino acid sequence is very similar to your “target” protein T –Hope that this other protein P does have a known structure –Predict a similar structure similar to that of P, after carefully considering how the sequences of P and T differ

A survey of structure prediction Comparative modeling fails if we don’t have a suitable homologous “template” protein P for our protein T “Ab initio” tertiary methods attempt to predict the structure without using a protein structure –Incorporate basic physical and chemical principles into the structure calculation –Gets very hairy, and highly computationally intensive The other option is prediction of secondary structure only (i.e., making the goal more modest) –These may be used to provide constraints for tertiary structure prediction

Secondary structure prediction Early methods were based on stereochemical principles Later methods realized that we can do better if we use not only the one sequence T (our sequence), but also a family of “related sequences” Search for sequences similar to T, build a multiple alignment of these, and predict secondary structure from the multiple alignment of sequence

What’s multiple alignment doing here ? Most conserved regions of a protein sequence are either functionally important or buried in the protein “core” More variable regions are usually on surface of the protein, –there are few constraints on what type of amino acids have to be here (apart from bias towards hydrophilic residues) Multiple alignment tells us which portions are conserved and which are not

http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png hydrophobic core

What’s multiple alignment doing here ? Therefore, by looking at multiple alignment, we could predict which residues are in the core of the protein and which are on the surface (“solvent accessibility”) Secondary structure then predicted by comparing the accessibility patterns associated with helices, strands etc. This approach (Benner & Gerloff) mostly manual

The PSI-PRED algorithm Position-specific iterated blast based secondary structure Prediction Given an amino-acid sequence, predict secondary structure elements in the protein Three stages: 1.Generation of a sequence profile (the “multiple alignment” step) 2.Prediction of an initial secondary structure (the neural network step) 3.Filtering of the predicted structure (another neural network step)

Generation of sequence profile A BLAST-like program called “PSI-BLAST” used for this step BLAST is a fast way to find high scoring local alignments PSI-BLAST is an iterative approach –an initial scan of a protein database using the target sequence T –align all matching sequences to construct a “sequence profile” –scan the database using this new profile Can also pick out and align distantly related protein sequences for our target sequence T

Creating a PSSM: Example NTEGEWI NITRGEW NIAGECC Amino acid frequencies at every position of the alignment:

Creating a PSSM: Example Amino acids that do not appear at a specific position of a multiple alignment must also be considered in order to model every possible sequence and have calculable log-odds scores. A simple procedure called pseudo-counts assigns minimal scores to residues that do not appear at a certain position of the alignment according to the following equation: Where –Frequency is the frequency of residue i in column j (the count of occurances). –pseudocount is a number higher or equal to 1. –N is the number of sequences in the multiple alignment.

In this example, N = 3 and let’s use pseudocount = 1: Score(N) at position 1 = 3/3 = 1. Score(I) at position 1 = 0/3 = 0. Readjust: Score(I) at position 1 -> (0+1) / (3+20) = 1/23 = 0.044. Score(N) at position 1 -> (3+1) / (3+20) = 4/23 = 0.174. The PSSM is obtained by taking the logarithm of (the values obtained above divided by the background frequency of the residues). To simplify for this example we’ll assume that every amino acid appears equally in protein sequences, i.e. f i = 0.05 for every i): PSSM Score(I) at position 1 = log(0.044 / 0.05) = -0.061. PSSM Score(N) at position 1 = log(0.174 / 0.05) = 0.541. Creating a PSSM: Example

The matrix assigns positive scores to residues that appear more often than expected by chance and negative scores to residues that appear less often than expected by chance. Creating a PSSM: Example

The sequence profile looks like this Has 20 x M numbers The numbers are log likelihood of each residue at each position

Preparing for the second step Feed the sequence profile to an artificial neural network But before feeding, do a simply “scaling” to bring the numbers to 0-1 scale

Intro to Neural nets (the second and third steps of PSIPRED)

Artificial Neural Network Supervised learning algorithm Training examples. Each example has a label –“class” of the example, e.g., “positive” or “negative” –“helix”, “strand”, or “coil” Learns how to predict the class of an example

Artificial Neural Network Directed graph Nodes or “units” or “neurons” Edges between units Each edge has a weight (not known a priori)

Layered Architecture Input here is a four-dimensional vector. Each dimension goes into one input unit http://www.akri.org/cognition/images/annet2.gif

Layered Architecture http://www.geocomputation.org/2000/GC016/GC016_01.GIF ( units )

What a unit (neuron) does Unit i receives a total input x i from the units connected to it, and produces an output y i = f i (x i ) where f i () is the “transfer function” of unit i w i is called the “bias” of the unit

Weights, bias and transfer function Unit takes n inputs Each input edge has weight w i Bias b Output a Transfer function f() Linear, Sigmoidal, or other

Weights, bias and transfer function Weights w ij and bias w i of each unit are “parameters” of the ANN. –Parameter values are learned from input data Transfer function is usually the same for every unit in the same layer Graphical architecture (connectivity) is decided by you. –Could use fully connected architecture: all units in one layer connect to all units in “next” layer

Where’s the algorithm? It’s in the training of parameters ! Given several examples and their labels: the training data Search for parameter values such that output units make correct predictions on the training examples “Back-propagation” algorithm –Read up more on neural nets if you are interested

Back to PSIPRED …

Step 2 Feed the sequence profile to the input layer of an ANN Not the whole profile, only a window of 15 consecutive positions For each position, there are 20 numbers in the profile (one for each amino acid) Therefore ~ 15 x 20 = 300 numbers fed Therefore, ~ 300 “input units” in ANN 3 output units, for “strand”, “helix”, “coil” –each number is confidence in that secondary structure for the central position in the window of 15

15 Input layer Hidden layer helix strand coil e.g., 0.18 0.09 0.67

Step 3 Feed the output of 1st ANN to the 2nd ANN Each window of 15 positions gave 3 numbers from the 1st ANN Take 15 successive windows’ outputs and feed them to 2nd ANN Therefore, ~ 15 x 3 = 45 input units in ANN 3 output units, for “strand”, “helix”, “coil”

Test of performance

Cross-validation Partition the training data into “training set” (two thirds of the examples) and “test set” (remaining one third) Train PSIPRED on training set, test predictions and compare with known answers on test set. What is an answer? –For each position of sequence, a prediction of what secondary structure that position is involved in –That is, a sequence over “H/S/C” (helix/strand/coil) How to compare answer with known answer? –Number of positions that match

Neural Networks for Protein Structure Prediction Dr. B Bhunia.

Similar presentations

Presentation on theme: "Neural Networks for Protein Structure Prediction Dr. B Bhunia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural Networks for Protein Structure Prediction Dr. B Bhunia.

Similar presentations

Presentation on theme: "Neural Networks for Protein Structure Prediction Dr. B Bhunia."— Presentation transcript:

Similar presentations

About project

Feedback