molecule's structure prediction

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
1 Lesson 5 Protein Prediction and Classification.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
1 September, 2004 Chapter 5 Macromolecular Structure.
Proteins Structural Bioinformatics. 2 3 Specific databases of protein sequences and structures  Swissprot  PIR  TREMBL (translated from DNA)  PDB.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
RNA Secondary Structure Prediction
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Computing for Bioinformatics Lecture 8: protein folding.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Protein Structures.
RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the.
Protein Structure Prediction Dr. G.P.S. Raghava Protein Sequence + Structure.
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Representations of Molecular Structure: Bonds Only.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
RNA Secondary Structure Prediction. 16s rRNA RNA Secondary Structure Hairpin loop Junction (Multiloop)Bulge Single- Stranded Interior Loop Stem Image–
Now playing: Frank Sinatra “My Way” A large part of modern biology is understanding large molecules like Proteins A large part of modern biology is understanding.
Secondary structure prediction
Doug Raiford Lesson 19.  Framework model  Secondary structure first  Assemble secondary structure segments  Hydrophobic collapse  Molten: compact.
RNA secondary structure RNA is (usually) single-stranded The nucleotides ‘want’ to pair with their Watson-Crick complements (AU, GC) They may ‘settle’
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Structure prediction: Ab-initio Lecture 9 Structural Bioinformatics Dr. Avraham Samson Let’s think!
Structural Bioinformatics
Motif Search and RNA Structure Prediction Lesson 9.
Proteins Secondary Structure Predictions
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Protein structure prediction Haixu Tang School of Informatics.
Proteins Structure Predictions Structural Bioinformatics.
Amino Acids. Amino acids are used in every cell of your body to build the proteins you need to survive. Amino Acids have a two-carbon bond: – One of the.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
1 4. Nucleic acids and proteins in one and more dimensions - second part.
Peptides to Proteins. What are PROTEINS? Proteins are large, complex molecules that serve diverse functional and structural roles within cells.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
RNAs. RNA Basics transfer RNA (tRNA) transfer RNA (tRNA) messenger RNA (mRNA) messenger RNA (mRNA) ribosomal RNA (rRNA) ribosomal RNA (rRNA) small interfering.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Protein Folding Notes.
Protein Folding.
Amino Acids (Foundation Block) 1 Lecture Dr. Usman Ghani
Protein Sequence Alignments
RNA Secondary Structure Prediction
Conformationally changed Stability
Introduction to Bioinformatics II
Protein Structure Prediction
Protein Structures.
RNA Secondary Structure Prediction
Conformationally changed Stability
RNA 2D and 3D Structure Craig L. Zirbel October 7, 2010.
Protein structure prediction.
Do now activity #5 How many strands are there in DNA?
Presentation transcript:

molecule's structure prediction

Outline RNA Protein RNA folding Dynamic programming for RNA secondary structure prediction Protein Secondary Structure Prediction Homology Modeling Protein Threading ab-initio

RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U 2 Hydrogen Bonds 3 Hydrogen Bonds – more stable RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U “wobble” pairing Bases can only pair with one other base. Image: http://www.bioalgorithms.info/

RNA Secondary Structure Pseudoknot Stem Interior Loop Single-Stranded Bulge Loop Junction (Multiloop) Hairpin loop Image– Wuchty

RNA secondary structure representation Circular representation: Bacillus Subtilis RNase P RNA

RNA secondary structure representation DotPlot representation of the same Bacillus Subtilis RNA folding: A dot is placed to represent a base pair

RNA secondary structure definition An RNA sequence is represented as: R = r1, r2, r3, …, rn (ri is the i-th nucleotide). Each ri belongs to the set {A, C, G, U}. A secondary structure on R is a set S of ordered pairs, written as i•j, 1≤i<j≤n, satisfying:

Computing RNA secondary structure Working hypothesis: The native secondary structure of a RNA molecule is the one with the minimum free energy Restrictions: No knots (ri,rj) , (rk,rl), i<k<j<l No close base pairs: (ri,rj) j – i > 3 (exclude “close” base pairs) Base pairs: A-U, C-G and G-U

Computing RNA secondary structure Tinoco-Uhlenbeck postulate: Assumption: the free energy of each base pair is independent of all the other pairs and the loop structures Consequence: the total free energy of an RNA is the sum of all of the base pair free energies

Independent Base Pairs Approach Use solution for smaller strings to find solutions for larger strings This is precisely the basic principle behind dynamic programming algorithms!

RNA folding: Dynamic Programming Notation: e(ri,rj) : free energy of a base pair joining ri and rj Bij : secondary structure of the RNA strand from base ri to base rj. Its energy is E(Bij) S(i,j) : optimal free energy associated with segment ri…rj S(i,j) = max -E(Bij) B

RNA folding: Dynamic Programming There are only four possible ways that a secondary structure of nested base pair can be constructed on a RNA strand from position i to j: i is unpaired, added on to a structure for i+1…j S(i,j) = S(i+1,j) j is unpaired, added on to a structure for i…j-1 S(i,j) = S(i,j-1)

RNA folding: Dynamic Programming i j paired, but not to each other; the structure for i…j adds together structures for 2 sub regions, i…k and k+1…j S(i,j) = max {S(i,k)+S(k+1,j)} i j paired, added on to a structure for i+1…j-1 S(i,j) = S(i+1,j-1)+e(ri,rj) i<k<j

RNA folding: Dynamic Programming Since there are only four cases, the optimal score S(i,j) is just the maximum of the four possibilities: To compute this efficiently, we need to make sure that the scores for the smaller sub-regions have already been calculated Dynamic Programming !!

RNA folding: Dynamic Programming Notes: S(i,j) = 0 if j-i < 4: do not allow “close” base pairs Reasonable values of e are -3, -2, and -1 kcal/mole for GC, AU and GU, respectively. In the DP procedure, we use 3, 2, 1 (or replace max with min) Build upper triangular part of DP matrix: - start with diagonal – all 0 - works outward on larger and larger regions - ends with S(1,n) Traceback starts with S(1,n), and finds optimal path that lead there.

j A U C G Initialisation: No close basepairs i

j A U C G 2 3 1 Propagation: C5….U9 : C5 unpaired: S(6,9) = 0 10 5 A U C G 2 3 1 Propagation: 1 C5….U9 : C5 unpaired: S(6,9) = 0 U10 unpaired: S(5,8)=0 C5-U10 paired S(6,8) +e(C,U)=0 C5 paired, U10 paired: S(5,6)+S(7,9)=0 S(5,7)+S(8,9)=0 5 10

j A U C G 2 3 5 6 1 Propagation: C5….G11 : C5 unpaired: S(6,11) = 3 10 A U C G 2 3 5 6 1 Propagation: 1 C5….G11 : C5 unpaired: S(6,11) = 3 G11 unpaired: S(5,10)=3 C5-G11 paired S(6,10)+e(C,G)=6 C5 paired, G11 paired: S(5,6)+S(7,11)=1 S(5,7)+S(8,11)=0 S(5,8)+S(9,11)=0 S(5,9)+S(10,11)=0 5 10

j 1 5 10 A U C G 2 3 5 6 8 10 12 1 Propagation: 1 5 i 10

j A U C G 2 3 5 6 8 10 12 1 Traceback: i

FINAL PREDICTION AUACCCUGUGGUAU Total free energy: -12 kcal/mol U G C G C U U G AUACCCUGUGGUAU Total free energy: -12 kcal/mol

Protein structure prediction

The sequence-structure gap The gap is getting bigger 200000 180000 160000 140000 120000 Sequences 100000 Structures 80000 60000 40000 20000

The protein folding problem The information for 3D structures is coded in the protein sequence Proteins fold in their native structure in seconds

Secondary Structure Prediction Given a primary sequence ADSGHYRFASGFTYKKMNCTEAA what secondary structure will it adopt ?

Backbone 佛爱和普西 A polypeptide chain. The R1 side chains identify the component amino acids. Atoms inside each quadrilateral are on the same plane, which can rotate according to angles  and  .

Protein structure

Secondary Structure Prediction Methods Chou-Fasman / GOR Method Based on amino acid frequencies Machine learning methods PHDsec and PSIpred

Chou and Fasman (1974) Success rate of 50% Name P(a) P(b) P(turn) Alanine 142 83 66 Arginine 98 93 95 Aspartic Acid 101 54 146 Asparagine 67 89 156 Cysteine 70 119 119 Glutamic Acid 151 037 74 Glutamine 111 110 98 Glycine 57 75 156 Histidine 100 87 95 Isoleucine 108 160 47 Leucine 121 130 59 Lysine 114 74 101 Methionine 145 105 60 Phenylalanine 113 138 60 Proline 57 55 152 Serine 77 75 143 Threonine 83 119 96 Tryptophan 108 137 96 Tyrosine 69 147 114 Valine 106 170 50 The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet  breaker) Success rate of 50%

Secondary Structure Method Improvements ‘Sliding window’ approach Most alpha helices are ~12 residues long Most beta strands are ~6 residues long Look at all windows, calculate a score for each window. If >threshold  predict this is an alpha helix/beta sheet TGTAGPOLKCHIQWMLPLKK

Improvements since 1980’s Success -> 75%-80% Adding information from conservation in MSA Smarter algorithms (e.g. Machine learning). Success -> 75%-80%

Machine learning approach for predicting Secondary Structure (PHD, PSIpred) Query SwissProt Step 1: Generating a multiple sequence alignment Query Subject Subject Subject Subject

Step 2: Additional sequences are added using a profile. We end up with a MSA which represents the protein family. Query seed MSA Query Subject Subject Subject Subject

Step 3: The sequence profile of the protein family is compared (by machine learning methods) to sequences with known secondary structure. Query seed Machine Learning Approach MSA Known structures Query Subject Subject Subject Subject

Neural Network architecture used in BetaTPred2

Predicting protein 3d structure Goal: 3d structure from 1d sequence An existing fold A new fold Fold recognition ab-initio Homology modeling

Homology Modeling Simplest, reliable approach Basis: proteins with similar sequences tend to fold into similar structures Has been observed that even proteins with 25% sequence identity fold into similar structures Does not work for remote homologs (< 25% pairwise identity)

Homology Modeling Given: A query sequence Q A database of known protein structures Find protein P such that P has high sequence similarity to Q Return P’s structure as an approximation to Q’s structure

Homology modeling needs three items of input: The sequence of a protein with unknown 3D structure, the "target sequence." A 3D “template” – a structure having the highest sequence identity with the target sequence ( >25% sequence identity) An sequence alignment between the target sequence and the template sequence

Fold recognition = Protein Threading Which of the known folds is likely to be similar to the (unknown) fold of a new protein when only its amino-acid sequence is known?

MTYKLILN …. NGVDGEWTYTE Protein Threading The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB Energy function – knowledge (or statistics) based rather than physics based Should be able to distinguish correct structural folds from incorrect structural folds Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments MTYKLILN …. NGVDGEWTYTE

Protein Threading Basic premise Statistics from Protein Data Bank (~2,000 structures) Chances for a protein to have a structural fold that already exists in PDB are quite good. The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Protein Threading Basic components: Structure database Energy function Sequence-structure alignment algorithm Prediction reliability assessment

ab-initio folding Goal: Predict structure from “first principles” Requires: A free energy function, sufficiently close to the “true potential” A method for searching the conformational space Advantages: Works for novel folds Shows that we understand the process Disadvantages: Applicable to short sequences only

Qian et al. (Nature: 2007) used distributed computing Qian et al. (Nature: 2007) used distributed computing* to predict the 3D structure of a protein from its amino-acid sequence. Here, their predicted structure (grey) of a protein is overlaid with the experimentally determined crystal structure (color) of that protein. The agreement between the two is excellent. *70,000 home computers for about two years.

Overall Approach No Yes Yes No Protein Sequence Multiple Sequence Alignment Database Searching Homologue in PDB No Secondary Structure Prediction Fold Recognition Yes Predicted Fold Yes Homology Modelling Sequence-Structure Alignment Ab-initio Structure Prediction No 3-D Protein Model

Thank you for learning with me!