11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.

Slides:



Advertisements
Similar presentations
Proteins from Amino Acids
Advertisements

1 Amino acid and proteins Ghollam-Reza Moshtaghi-Kashanian Biochemistry Department Medical School Kerman University of Medical sciences.
Carbohydrates, Lipids, Proteins, and Nucleic Acids
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Protein Structure & Function
The Chemistry of Life Macromolecules
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Biology 107 Macromolecules II September 9, Macromolecules II Student Objectives:As a result of this lecture and the assigned reading, you should.
Polypeptides – a quick review A protein is a polymer consisting of several amino acids (a polypeptide) Each protein has a unique 3-D shape or Conformation.
Evaluating Hypotheses
Biology 107 Macromolecules II September 8, 2003.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
1. Primary Structure: Polypeptide chain Polypeptide chain Amino acid monomers Peptide linkages Figure 3.6 The Four Levels of Protein Structure.
Computing for Bioinformatics Lecture 8: protein folding.
Molecules of Life. Polymers Are Built of Monomers Organic molecules are formed by living organisms – have a carbon-based core – the core has attached.
Protein Structures.
Homework for next week Green q 1,2,3 p29 Do evaluation points from Biuret Practical Revise test on all work next week Bring evidence you have revised please.
Automatic assignment of NMR spectral data from protein sequences using NeuroBayes Slavomira Stefkova, Michal Kreps and Rudolf A Roemer Department of Physics,
Protein Tertiary Structure Prediction
Diverse Macromolecules. V. proteins are macromolecules that are polymers formed from amino acids monomers A. proteins have great structural diversity.
Proteins. You need to know that: Proteins have a variety of functions within all living organisms. The general structure of an amino acid Condensation.
1 Amide Bond Formation Amide bonds form upon reaction of carboxylic acids with ammonia, primary amines or secondary amines. When amide bonds form between.
Objectives E – Recall the different structures of proteins and the test for proteins. C – Describe how a peptide bond is formed. Describe the different.
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Molecules of Life II CHAPTER 3 Proteins Amino Acid Monomers Polypeptide (protein) Polymers Levels of Protein Structure Importance of Structure to Function.
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
PROTEINS. Learning Outcomes: B4 - describe the chemical structure of proteins List functions of proteins Draw and describe the structure of an amino acid.
Amino acids and proteins … for AS Biology. Amino acids Proteins are macromolecules consisting of long unbranched chains of amino acids. All amino acids.
Protein Structure (Foundation Block) What are proteins? Four levels of structure (primary, secondary, tertiary, quaternary) Protein folding and stability.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
PROTEINS The final product of the DNA blueprint Hemoglobin.
5.4: Proteins Introduction
Below is the database schema used by the RCSB Protein Data Bank Each box indicates a separate attribute set Bioinformatics databases are very large PROTEIN.
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Proteins Biochemistry Unit 1. What You Need to Know! How to recognize protein by its structural formula The cellular function of proteins The four structural.
Protein Structure  The structure of proteins can be described at 4 levels – primary, secondary, tertiary and quaternary.  Primary structure  The sequence.
Protein backbone Biochemical view:
PROTEIN STRUCTURE (Donaldson, March 10,2003) What are we trying to learn about genes and their proteins: Predict function for unknown protein by comparison.
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
AP Biology Proteins AP Biology Proteins Multipurpose molecules.
PROTEINS L3 BIOLOGY. FACTS ABOUT PROTEINS: Contain the elements Carbon, Hydrogen, Oxygen, and NITROGEN Polymer is formed using 20 different amino acids.
AP Biology Proteins AP Biology Proteins Multipurpose molecules.
L IPIDS © 2015 Pearson Education, Inc Fats are lipids that are mostly energy- storage molecules Lipids are water insoluble (hydrophobic, or water-
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
3.8 Fats are lipids that are mostly energy-storage molecules  Some fatty acids contain double bonds –This causes kinks or bends in the carbon chain because.
Proteins Proteins are the building materials for the body.
CHM 708: MEDICINAL CHEMISTRY
Protein Folding.
Protein Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form in a biologically functional.
Chapter 5 Proteins.
Protein Proteins are found throughout living organisms.
Chemical synthesis of Peptide
3.11 Proteins are essential to the structures and activities of life
Amino Acids and Proteins
Proteins Section 3.4.
The Chemistry of Life Proteins
Diverse Macromolecules
Protein Structures.
Protein Structure Chapter 14.
Introduction and Fundamentals of Protein Structure
Introduction and Fundamentals of Protein Structure
Proteins.
Biomolecules.
Proteins.
Four Levels of Protein Structure
Biomolecules.
Presentation transcript:

11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and Li Lu § † Dept. of Computer Science California State University ‡ Intel Corporation, Folsom CA § Sierra Systems Consultants Inc., Washington DC

11/9/99ICTAI-99, Chicago2 Introduction Advancement of medical sciences depends critically on understanding of structures of proteins, the fundamental molecules for all living organisms. Proteins have different structures based upon their locations (intracellular, extracellular, membrane, cytosolic, neuclear ) and functions (structural, enzyme, or antibodies, etc.) All protein molecules are polymers built up from 20 different amino acid residues linked end to end by peptide bonds.

11/9/99ICTAI-99, Chicago3 Protein Structures Primary structure is the linear sequences of amino acid. Secondary structure is the spatial relationship of amino acid residues that are close to one another in the linear sequence. Tertiary structure is the spatial relationship of residues that are far apart in the linear sequence. Quaternary structure is the way some proteins are packed together to form polypeptide chain.

11/9/99ICTAI-99, Chicago4 The Secondary Structure The function of every protein depends on its tertiary (3D) structure. Secondary structure plays a pivotal role between the final 3D structure and the linear amino acid sequence of a protein. Determining a protein’s secondary structure from its primary one would greatly help us unlock its 3D structure.

11/9/99ICTAI-99, Chicago5 Types of Secondary Structure  -helix: a rod-like structure.  -sheet: several regions of the polypeptide chain. turns: part where direction of the polypeptide chain is changed. coil: any part of the polypeptide chains not belonging to the above three.

11/9/99ICTAI-99, Chicago6 Protein Structure Example 1: p21Ras

11/9/99ICTAI-99, Chicago7 Protein Structure Example 2: MHC1

11/9/99ICTAI-99, Chicago8 State-of-the-Art in Protein Secondary Structure prediction Physical methods such as x-ray crystallography, or nuclear magnetic resonance, slow and expensive. There are 3 broad groups of secondary structure prediction methods: –empirical statistical methods, accuracy around 50% –stereochemical criteria based methods, accuracy 50% –machine learning based methods, accuracy up to %

11/9/99ICTAI-99, Chicago9 The Challenge The slow experimental determination of 3D structure vs. the fast accumulation of amino acid sequence data. Different amino acid sequences may yield similar 3D structure. Very difficult to predict 3D structure from its sequence of an unknown protein.

11/9/99ICTAI-99, Chicago10 Our Research Experiment To predict the secondary structure of an unknown protein, Spermidine/Spermine N 1 -Acetyltransferase (SSAT), a target of cancer chemotherapy. A machine learning tool called C5 (by J. Ross Quinlan), which is based on a decision tree learning method, is used for the prediction task.

11/9/99ICTAI-99, Chicago11 Comparison of ML Tools

11/9/99ICTAI-99, Chicago12 Prediction Considerations Use of functional similarity and sequence homology in selecting training proteins. Incorporation of amino acid hydrophobicity into the process. Choices of training set sizes and sequence attribute sizes.

11/9/99ICTAI-99, Chicago13 Selections of Training Proteins A set (FS) of 23 known proteins that are functionally similar to SSAT is selected. A set (SH) of 32 known proteins that have sequence homology to SSAT is selected. A third set (MX) is constructed that consists of proteins from both FS and SH.

11/9/99ICTAI-99, Chicago14 Incorporation of Hydrophobicity Hydrophobic character of each amino acid residue is incorporated into the prediction process. The levels considered in our experiments are: none (NH), residual-level (RH) and atomic-level (AH.) Two methods used in calculating the values.

11/9/99ICTAI-99, Chicago15 Decision Tree Based Learning Collect a large set of examples. Divide it into two disjoint sets: training set (TR) and test set (TT). Use the learning algorithm with TR to generate decision trees (if-then rules). Measure the percentage of examples in TT that are correctly classified by the trees (rules). Repeat the above steps for diff. sizes of TR and diff. randomly selected TR of each size.

11/9/99ICTAI-99, Chicago16 Training Sets and Test Sets Total number of cases for FS, SH and MX are 6288, 7165 and 13453, respectively. Selection of training set and test set: –Category 1: equal sized training/test sets. –Category 2: 20% of total cases for test set varying sized training set (25%, 50%, 75% and 100% of the remaining cases )

11/9/99ICTAI-99, Chicago17 Training/Test Sets in Category Size of training set four Size of training set three Size of training set two Size of training set one Size of the test set SHFSMX

11/9/99ICTAI-99, Chicago18 Sequence Attribute Sizes The size of sequence attributes indicates how many neighboring amino acid residues are included in a C5 case. Eight different sizes are considered in our experiments: 5, 9, 13, 17, 21, 25, 29, and 33).

11/9/99ICTAI-99, Chicago19 Results Six hundred runs are performed, each producing a decision tree as a classifier. Those runs are made with regard to the following factors: – Different data sets (FS, SH, MX). –Hydrophobicity attributes (NH, RH, AH). –Hydrophobicity value calculating methods. –Varying training set sizes and sequence attributes.

11/9/99ICTAI-99, Chicago20 Results (continued) Results obtained using training cases from SH are consistently better. Differences among three data sets (FS, SH, MX) are significantly different. Hydrophobicity and its calculation method choice do not show improvement in predictive accuracy. Error rate decreases as training set size increases. No significant difference among error rates of different sequence attribute sizes.

11/9/99ICTAI-99, Chicago21 Average Error Percentage Category two Category one SHFSMX

11/9/99ICTAI-99, Chicago22 Predicted Secondary Structure of SSAT

11/9/99ICTAI-99, Chicago23 Conclusions C5 can be used to predict protein secondary structure. The prediction accuracy depends critically on selection of training data. Training data selected based on sequence homology are superior to functional similarity or hydrophobicity. The SH classifier achieves 75% accuracy.

11/9/99ICTAI-99, Chicago24 Future Work Improve predictive accuracy by setting new data selection criteria. Develop on-line service for protein structure prediction.