Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Protein structure prediction Scoring matrices workshop review Learning objectives-Understand the basis of secondary structure prediction programs. Become.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Structure Prediction in 1D
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Protein Structure July 2, 2006 Learning objectives-Understand the basis of the secondary structure prediction program- Psi-PRED. Introduce the concept.
Single Motif Charles Yan Spring Single Motif.
Protein structure prediction May 24, 2005 Return of Quiz#3 Writing assignments-please hand in. Learning objectives-Understand the basis of secondary structure.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Motif searching and protein structure prediction May 26, 2005 Hand in written assignments today! Learning objectives-Learn how to read structure information.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
Protein structure prediction
Protein Sequence Alignment and Database Searching.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Module 3 Protein Structure Database/Structure Analysis Learning objectives Understand how information is stored in PDB Learn how to read a PDB flat file.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein structure prediction June 27, 2003 Learning objectives-Understand the basis of secondary structure prediction programs. Become familiar with the.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein structure prediction.
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary structure prediction programs. Become familiar with the databases that hold secondary structure information. Understand neural networks and how they help to predict secondary structure. Workshop-Predict secondary structure of p53. Homework #9-Due June 2

What is secondary structure? Three major types: Alpha Helical Regions Beta Strand Regions Coils, Turns, Extended (anything else)

Can we predict the final structure?

Some Prediction Methods ab initio methods Based on physical properties of aa’s and bonding patterns Statistics of amino acid distributions in known structures Chou-Fasman Sequence similarity to sequences with known structure PSIPRED

Chou-Fasman First widely used procedure Output-helix, strand or turn Percent accuracy: 60-65%

Psi-BLAST Predict Secondary Structure (PSIPRED) Three steps: 1) Generation of position specific scoring matrix. 2) Prediction of initial secondary structure 3) Filtering of predicted structure

Conformational parameters for α-helical, β-strand, and turn amino acids (from Chou and Fasman, 1978)

PSIPRED Uses multiple aligned sequences for prediction. Uses training set of folds with known structure. Uses a two-stage neural network to predict structure based on position specific scoring matrices generated by PSI-BLAST (Jones, 1999) First network converts a window of 15 aa’s into a raw score of h,e (sheet), c (coil) or terminus Second network filters the first output. For example, an output of hhhhehhhh might be converted to hhhhhhhhh. Can obtain a Q 3 value of 70-78% (may be the highest achievable)

Neural networks Computer neural networks are based on simulation of adaptive learning in networks of real neurons. Neurons connect to each other via synaptic junctions which are either stimulatory or inhibitory. Adaptive learning involves the formation or suppression of the right combinations of stimulatory and inhibitory synapses so that a set of inputs produce an appropriate output.

Neural Networks (cont. 1) The computer version of the neural network involves identification of a set of inputs - amino acids in the sequence, which transmit through a network of connections. At each layer, inputs are numerically weighted and the combined result passed to the next layer. Ultimately a final output, a decision, helix, sheet or coil, is produced.

Neural Networks (cont. 2) 90% of training set was used (known structures) 10% was used to evaluate the performance of the neural network after the training session.

Neural Networks (cont. 3) During the training phase, selected sets of proteins of known structure were scanned, and if the decisions were incorrect, the input weightings were adjusted by the software to produce the desired result. Training runs were repeated until the success rate is maximized. Careful selection of the training set is an important aspect of this technique. The set must contain as wide a range of different fold types as possible without duplications of structural types that may bias the decisions.

Neural Networks (cont. 4) An additional component of the PSIPRED procedures involves sequence alignment with similar proteins. The rationale is that some amino acids positions in a sequence contribute more to the final structure than others. (This has been demonstrated by systematic mutation experiments in which each consecutive position in a sequence is substituted by a spectrum of amino acids. Some positions are remarkably tolerant of substitution, while others have unique requirements.) To predict secondary structure accurately, one should place less weight on the tolerant positions, which clearly contribute little to the structure One must also put more weight on the intolerant positions.

15 groups of 21 units (1 unit for each aa plus one specifying the end) Row specifies aa position three outputs are helix, strand or coil Filtering network Provides info on tolerant or intolerant positions (Jones, 1999)

Example of Output from PSIPRED PSIPRED PREDICTION RESULTS Key Conf: Confidence (0=low, 9=high) Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence Conf: Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD

How to calculate Q3? Sequence: MEETHAPYRGVCNNM Actual Structure: CCCCCHHHHHHEEEE PSIPRED Prediction:CCCCCHHHHHHEEEH Q3 = 14/15 x 100 = 93%

Recognizing motifs in proteins. PROSITE is a database of protein families and domains. Most proteins can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

PROSITE Database Contains 1612 documentation entries. Signatures are produced by scanning the PROSITE database with your query. A “signature” of a protein allows one to place a protein within a specific function class based on structure and/or function. An example of an documentation entry in PROSITE is:

Signatures are produced from profiles and patterns. Profile-”a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.”

Sequences in one profile and the PSSM associated with the profile F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A C D E F G H I K L M N P Q R S T V W Y

How are the patterns constructed? ALRDFATHDDVCGK.. SMTAEATHDSVACY.. ECDQAATHEAVTHR.. Sequences necessary for structure or function are aligned manually by experts in field. Then a pattern is created. A-T-H-[DE]-X-V-X(4)-{ED} This pattern is translated as: Ala, Thr, His, [Asp or Glu], any, Val, any, any, any, any, any but Glu or Asp

Example of a pattern in a PROSITE record ID ZINC_FINGER_C3HC4; PATTERN. PA C-X-H-X-[LIVMFY]-C-X(2)-C-[LIVMYA]

Scanning the PROSITE database “Scan a sequence against PROSITE patterns and profiles” allows the user to scan the ProSite database to search for patterns and profiles. It uses dynamic programming to determine optimal alignments. If the alignment produces a high score (a hit), then the hit is shown to the user. If a “hit” is generated, the program gives an output that shows the region of the query that contains the pattern and a reference to the 3-D structure database if available.

Example of output from Prosite Scan

RPSBlast Reverse psi-blast, or rpsblast, is a program that searches a query protein sequence or protein sequences against a database of position specific scoring matrices. The PSSMs are from conserved protein sequences that have known functions/structure.

3D structure data The largest 3D structure database is the Protein Databank It contains over 20,000 records Each record contains 3D coordinates for macromolecules 80% of the records were obtained from X-ray diffraction studies, 20% from NMR.

ATOM 1 N ARG A N ATOM 2 CA ARG A C ATOM 3 C ARG A C ATOM 4 O ARG A O ATOM 5 CB ARG A C ATOM 6 CG ARG A C ATOM 7 CD ARG A C ATOM 8 NE ARG A N ATOM 9 CZ ARG A C ATOM 10 NH1 ARG A N ATOM 11 NH2 ARG A N Part of a record from the PDB

Quiz #3 prep BLAST Three steps Gapped BLAST Heuristic program Uses S-W algorithm for final scoring CLUSTAL W Pairwise alignments Difference matrix Guide tree Importance of having highly similar sequences Secondary Structure prediction Chou-Fasman PSIPRED Good for secondary str Protein analysis ProScan RPBlast