RNA Secondary Structure Prediction Spring 2010. Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Slides:



Advertisements
Similar presentations
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
1 Appendix B: Solving TSP by Dynamic Programming Course: Algorithm Design and Analysis.
Protein Structure – Part-2 Pauling Rules The bond lengths and bond angles should be distorted as little as possible. No two atoms should approach one another.
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
6 -1 Chapter 6 The Secondary Structure Prediction of RNA.
Structural bioinformatics
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
CISC667, F05, Lec21, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction 3-Dimensional Structure.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of Constraints, June 2001 Alexandre Tchourbanov University.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Protein Structures.
RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
Supersecondary structures. Supersecondary structures motifs motifs or folds, are particularly stable arrangements of several elements of the secondary.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
ProteinShop: A Tool for Protein Structure Prediction and Modeling Silvia Crivelli Computational Research Division Lawrence Berkeley National Laboratory.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
RNA Secondary Structure Prediction. 16s rRNA RNA Secondary Structure Hairpin loop Junction (Multiloop)Bulge Single- Stranded Interior Loop Stem Image–
COSC 3101A - Design and Analysis of Algorithms 7 Dynamic Programming Assembly-Line Scheduling Matrix-Chain Multiplication Elements of DP Many of these.
Course 13 Curves and Surfaces. Course 13 Curves and Surface Surface Representation Representation Interpolation Approximation Surface Segmentation.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
RNA secondary structure RNA is (usually) single-stranded The nucleotides ‘want’ to pair with their Watson-Crick complements (AU, GC) They may ‘settle’
1 Prune-and-Search Method 2012/10/30. A simple example: Binary search sorted sequence : (search 9) step 1  step 2  step 3  Binary search.
Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI,Antonio PICCOLBONI and Giulio PAVESI Symposium on Discrete Algorithms, pp ,
Prediction of Secondary Structure of RNA
Doug Raiford Lesson 7.  RNA World Hypothesis  RNA world evolved into the DNA and protein world  DNA advantage: greater chemical stability  Protein.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Chemistry XXI Unit 3 How do we predict properties? M1. Analyzing Molecular Structure Predicting properties based on molecular structure. M4. Exploring.
Motif Search and RNA Structure Prediction Lesson 9.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Chapter 13 Backtracking Introduction The 3-coloring problem
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Protein backbone Biochemical view:
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
molecule's structure prediction
Chapter 14 Protein Structure Classification
The heroic times of crystallography
Predicting RNA Structure and Function
Structure Prediction dmitra 11/18/2018.
Comparative RNA Structural Analysis
Protein Structures.
3-Dimensional Structure
Protein structure prediction.
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
Dynamic Programming II DP over Intervals
The Selection Problem.
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

RNA Secondary Structure Prediction Spring 2010

Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

RNA: Hypothesis and exclusions  Single stranded chain  A RNA molecule is a string of n characters R=r 1 r 2 …r n such that r i  {A,C,G,U}  Predicting methods are based on the computation of minimun free-energy configurations  We exclude the knots.  A knot exists when (r i, r j )  S and (r k, r l )  S and i<k<j<l.

Combinatorial Solution  Enumerate all the possible structures  Compute the one with the lowest free-energy  Impossible to solve: Exponential in the number of bases!

Independent Base Pair  Assumption: the energy of a base pair is independent of all the others. Let  (r i,r j ) be the free energy of base pair (r i, r j ) We assume that  (r i,r j )=0 if i=j We consider secondary structures and we use a dynamic programming approach

Dynamic programming  Consider the string R i,j = R=r i r i+1 …r j we want to compute the secondary structure S i,j of minimum energy Possible Cases: 1. r j is base-paired with r i  then E(S i,j ) =  (r i,r j ) + E(S i,j-1 ) 2. r j does not pair with any base  then E(S i,j ) = E(S i,j-1 ) 3. r j is base-paired with r k and i  k  j – then split the string in R i,k-1 and R k,j and – E(S i,j ) = min{ E(S i,k-1 ) + E(S k,j )}

Dynamic Programming  The algorithm computes the matrix n x n using this minimization function  Complexity of the algorithm: O(n 3 )  O(n 2 ) to compute the matrix times  O(n) to compute each element of the matrix  note that each element has a variable number of elements to consider that can be n in the worst case

Structures with loops  Assumption: the free energy of a base pair depends on adjacent base pairs A loop is a set of all bases accessible from a base pair (r i,r j ) Consider (r i,r j ) in S and positions u, v, and w such that i  u  v  w  j. We say that r v is accessible from (r i,r j ), if (r u,r w ) is not a base pair in S for any u and w

LoopsLoops hairpin loop bulge on i interior loop helical region

Dynamic programming  Determine S i,j for R i,j = R=r i r i+1 …r j Possible Cases: 1. r i is not base-paired  then E(S i,j ) = E(S i+1,j ) 2. r j is not base-paired  then E(S i,j ) = E(S i,j-1 ) 3. r j forms a pair with r k and i  k  j – split the string in R i,k-1 and R k,j and – then E(S i,j ) = min { E(S i,k ) + E(S k+1,j ) } 4. r j forms a pair with r i  that means that there might be one or more loops between i and j  E(S i,j ) = E(L i,j )

The energy of a loop hairpin:  E(L i,j ) =  (r i,r j ) +  (j-i-1)   (k)=destabilizing free energy of hairpin loop of size k helical region:  E(L i,j ) =  (r i,r j ) +  + E(S i+1,j-1 )   =stabilizing free energy of adjacent base pair bulge on i:  E(L i,j ) = min k  1 {  (r i,r j ) +  (k) + E(S i+k+1,j-1 )}   (k) =destabilizing free energy of a bulge loop of size k bulge on j:  E(L i,j ) = min k  1 {  (r i,r j ) +  (k) + E(S i+1,j-k-1 )} interior loop:  E(L i,j ) = min k1,k2  1 {  (r i,r j ) +  (k 1 + k 2 ) + E(S i+1,j-1 )}   (k) =destabilizing free energy of a bulge loop of size K

Dynamic Programming E(S i+1,j ) E(S i,j-1 ) E(S i,j ) = minmin{ E(S i,k ) + E(S k+1,j ) } i<k<j E(L i,j )  Complexity of the algorithm: O(n 4 )  the complexity is worse because of the loops it takes constant time to compute the hairpin and helical region loop  O(n 2 )  it takes linear time to compute the bulge since k ranges between 1 and j-i.  O(n3)  it takes quadratic time to compute the interior loop since we are dealing with two parameters k1 and k2 therefore we have to look at a submatrix having O(n 2 ) elements.  O(n4)

Protein Folding: Example Amino Acid CH 3 H2NH2N H CC COOH Alpha Carbon Amino Group Carboxy Group Side Chain

Common Secondary Structures

Protein Folding Problem Assumption for all protein folding prediction methods: amino acid sequence completely and uniquely determines the folding. Problem: Given the amino acid sequence of a protein, we would like to process it and determine where exactly the  -helices,  -sheets and loops are, and how they arrange themselves in motifs and domains

Combinatorial approach 1. Enumerate all the possible foldings, 2. compute the free energy of each 3. choose the one with minimum free energy

From what we know…. 1. if we assume that the angles  and  between the alpha carbon and the neighboring atoms assume only 3 possible values, we have that a protein with 100 residue has (3 2 ) 100 possible configurations! 2. How do we compute the energy of a configuration? factors: shape, size, polarity of the molecules, relative strenght of the interaction at molecular level, ect…too many factors and not a defined agreement this problem applies to secondary structures as well. 3. What do we know? hydrophobic amino acid stay "inside" the protein, hydrophillic amino acid stay "outside" the protein; not enough information to make a prediction

Conclusions  No dynamic programming approach is know for protein secondary prediction  Programs based on neural nets to pattern- recognition based on statistical properties of residue in proteins are available  not as good as we desire.  new techniques must be developed  we discuss a branch and bound solution

Protein Threading Problem 1. Similar sequences should have similar structures  if A with a known protein structure, is similar to B at a sequence level, B structure should be nearly the same as A structure 2. Certain proteins are different at a sequence level but are structurally related,  ie. they have different kinds of loops but similar cores 3. Approach used for the solution: Branch and Bound

Core threading Loops are structures that are neither helices nor  - sheets Protein Core are either helices and sheets Motifs are simple combination of a few secondary structures (ex. helix-loop-helix)

Protein threading problem definition  Input: Given protein sequence A; Core structural model M; Score functions g1, g2.  Output: A threading T. In short: Align A to model T.  Given:  A: Protein sequence of length n: a 1, a 2, a 3, …, a n ;  M: m core segments C 1, C 2, C 3, …, C m ;  c 1, c 2, c 3, …, c m ; length of core segments;  l 1, l 2, l 3, …, l m-1 ; loop regions connecting core segments;  l 1max, l 2max, l 3max, …, l m-1max ; maximum lengths of loop regions;  l 1min, l 2min, l 3min, …, l m-1min ; minimum lengths of loop regions;  Properties of each amino acid;  f, g 1, g 2 : score functions to evaluate threading;

where, g 1 and g 2 are based on the given model M. g 1 shows how each segment corresponds to core segment i in the model, and g 2 deals with the interactions between segments. So to solve the threading problem, we have to decide on t 1, t 2, t 3, …, t m, so that the overall score is maximum. Thus the threading problem, or alignment problem, is converted to an optimization problem. Output: T: t 1, t 2, t 3, …, t m ; start locations for core segments; score function

Threading constraints spacing constraint order of the core constraint

Branch and Bound  Assume we are minimizing f(s) and we aleady know the value f(s) for some candidate solutions in a set.  Branch  divide the solution space according to some constraints. For example, partition X in X 1 (all solutions having a certain property) and X 2 (all solutions that do not)  Partition should be implicit, i.e. you do not explicitly enumerate the solutions  Bound  for every partition X, obtain a lower bound lb on the value of f(s) for every solution s  X.  If f(s) < lb then we can discard all the candidate solutions in X because we have a solution, s, that scores better than all solutions in X, otherwise we explore the set X

Branch and Bound-Issues  Constructing score function  Calculating lower bound  Choosing split segment  Choosing split point

Protein Threading: Branch and Bound 1. Set of all possible threadings defined by initial position bounds This implicitely defined by the number of aminoacids in A 2. Divide possible threadings into smaller sets, and compute new position bounds for each set 3. Compute a quick score lower bound for each set of threadings 4. Keep re-dividing the set with smallest lower bound, until set size if 1.

2. Division in smaller sets: Branch and Bound

3. Compute the lower bound: Branch and Bound  Given a set of threadings defined by position bounds, one possible score lower bound is this is an approximation

set of possible threadings: amino acids (1…100) (1…49) (50)(51…100) (1…23) (24)( ) (25…29)(30) (34…49)