Copyright  2003 limsoon wong Recognition of Protein Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? March 2004.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent.
Progress in Transmembrane Protein Research 12 Month Report Tim Nugent.
Structural Classification and Prediction of Reentrant Regions in Alpha-Helical Transmembrane Proteins: Application to Complete Genomes Håkan Viklunda,
Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent 6 Month.
Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent.
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Ece Cell Structures Chapter three. Cells and Viruses There are basically three type of biological units: prokaryotic cells, eukaryotic cells and.
Protein Sorting & Transport Paths of Protein Trafficking Nuclear Protein Transport Mitochondrial & Chloroplast Transport Experimental Systems Overview.
Lysosomes: Digestive Compartments
Corrections. SEQUENCE 4 >seq4 MSTNNYQTLSQNKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEH EFDIIAYKTTFWRTFFFYALSFGTCGIFRLFLHWFPKRLIQFRGKRCSVE NADLVLVVDNHNRYDICNVYYRNKSGTDHTVVANTDGNLAELDELRWFKY.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
The Cell Wall Cell wall: a rigid structure that gives support to the cell. Cell walls are made of different materials: some plant cell walls are made.
Biochemical aspects. Learning objectives At the end of lecture student should be able to Describe the structure of cell membrane Explain molecular basis.
Unit 7 Endomembranes. SECRETORY PATHWAY: Unit 7 Secretory Pathway Proteins are synthesized on the Rough ER. Move via vesicles to Golgi Move via vesicles.
MOVEMENT ACROSS MEMBRANES
Membranes. Assumed Knowledge Membrane components Membrane structure Membrane properties Membrane functions Membrane-bound organelles.
Javad Jamshidi Fasa University of Medical Sciences Proteins Into membranes and Organelles and Vesicular Traffic Moving.
Cell Structure and Function Chapter 3 Basic Characteristics of Cells Smallest living subdivision of the human body Diverse in structure and function.
Protein Sorting ISAT 351, Spring 2004 College of Integrated Science and Technology James Madison University.
Chapter 1 What is a Cell? By Benjamin Lewin. 1.1 Introduction Cells arise only from preexisting cells. Every cell has genetic information whose expression.
Prediction of protein localization and membrane protein topology Gunnar von Heijne Department of Biochemistry and Biophysics Stockholm Bioinformatics Center.
Microscopy In a light microscope (LM), visible light passes through a specimen and then through glass lenses, which magnify the image The quality of an.
The Microscopic World of Cells
Review For Final I. Should I take the final? Can’t hurt you Calculate your average and determine what you need to change your grade.
Lecture 19: Membrane Proteins Architecture of Membrane Proteins Fluid Mosaic Model Protein Targeting.
Chapter 11 Membrane Structures. Plasma Membrane The ‘container’ for the cell –Holds the cytoplasm and organelles together Barrier for the cell –Bacteria.
PREDICTION OF PROTEIN FEATURES Beyond protein structure (TM, signal/target peptides, coiled coils, conservation…)
Lecture 7 - Intracellular compartments and transport II
Eukaryotes vs Prokaryotes Plasma Membrane.  All cells contain organelles  Small, specialized structures  Has a specific function in the cell  Prokaryotes.
Lecture 2: Protein sorting (endoplasmic reticulum) Dr. Mamoun Ahram Faculty of Medicine Second year, Second semester, Principles of Genetics.
Cell Structure DO NOW: Read over todays lab!
Overview of Cells Prokaryotes vs Eukaryotes The Cell Organelles The Endosymbiotic Theory.
Cell Structure Cell Theory Structures of Prokaryotic and Eukaryotic Cells.
Cell membranes, Membrane lipids, Membrane proteins.
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
CHAPTER 12 Membrane Structure and Function. Biological Membranes are composed of Lipid Bilayers and Proteins -Biological membranes define the external.
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen
Protein Functional Annotation Dr G.P.S. Raghava. Annotation Methods Annotation by homology (BLAST) requires a large, well annotated database of protein.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
How Cells Are Put Together Chapter 3. Cell Theory Every organism is composed of one or more cells Cell is smallest unit with properties of life Continuity.
3.1 Cell Theory KEY CONCEPT Cells are the Basic unit of life.
Cellular compartmentalization Pages Q1 Name at least two of the three protein complexes involved in the electron transport chain?
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
3.1 Cell Theory KEY CONCEPT Cells are the Basic unit of life.
Functions of the plasma membrane 1.Holds the cell together 2.Controls what goes in and out (diffusion, osmosis, active transport) 3.Protects the cell.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
Endomembrane System Yasir Waheed NUST Center of Virology & Immunolgy National University of Sciences &Technology.
Date: November 18, 2015 Aim #27: How does the structure of the cell membrane contribute to its function? HW: 1) Quiz next Tuesday 11/24 and Wednesday 11/25.
Protein Properties Function, structure Residue features Targeting Post-trans modifications BIO520 BioinformaticsJim Lund Reading: Chapter , 11.7,
HMMs and SVMs for Secondary Structure Prediction
CELL MEMBRANE DR.IMRANA EHSAN. Structure and function of cell components (i) Carbohydrates (ii) Lipids (iii) Proteins (iv) Nucleic Acids (v) Membranes.
Cell Theory -The cell is the structural and functional unit of life Human adults are made up of an estimated 100,000,000,000,000 cells Organismal activity.
1 Computational Approaches(1/7)  Computational methods can be divided into four categories: prediction methods based on  (i) The overall protein amino.
Protein targeting or Protein sorting Refer Page 1068 to 1074 Principles of Biochemistry by Lehninger & Page 663 Baltimore Mol Cell Biology.
Topic 2.4 MEMBRANES Draw and Label a Membrane cholesterol.
4.4 Eukaryotic cells are partitioned into functional compartments  Membranes within a eukaryotic cell partition the cell into compartments, areas where.
Cells. The Cell Theory All living things are made of _______. Cells are the basic unit of structure and function. New cells are produced from _________cells.
Copyright © by Holt, Rinehart and Winston. All rights reserved. ResourcesChapter menu Cell Structure Chapter 3 Table of Contents Section 1 Looking at.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
E NDOMEMBRANOUS S YSTEMS By; Ayesha Shaukat. Functions of Rough ER  Many types of cells secrete proteins produced by ribosomes attached to rough ER.
Prediction of protein features. Beyond protein structure
Membranes in cells Chapter 2.3.
MOVEMENT ACROSS MEMBRANES
Protein Structure Prediction
Intracellular Compartments and Transport
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
7.2 Cell Structure.
Presentation transcript:

Copyright  2003 limsoon wong Recognition of Protein Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? March 2004

Copyright  2003 limsoon wong Lecture Plan Membrane proteins Subcellular localization

Copyright  2003 limsoon wong Recognition of Transmembrane Helices

Copyright  2003 limsoon wong Eukaryotic Cells Eukaryotic cells have membrane-bound compartments with specialized functions

Copyright  2003 limsoon wong Lipids & Membrane Membrane is a double layer of lipids and associated proteins which define subcellular compartments or enclose the cell Lipids consist of a “polar head group” and long-chain fatty acids This dual nature promotes formation of lipid bilayers “Hydrophobic tails” are shielded from aqueous environment Water-soluble (i.e., charged or polar) molecules cant pass through this impermeable barrier Permeability across the bilayer is regulated by membrane proteins that span the bilayer and function like channels or pores

Copyright  2003 limsoon wong all-  -barrel Membrane Proteins Two types of membrane proteins: Integral vs peripheral Two types of integral membrane proteins: all-  vs  -barrel

Copyright  2003 limsoon wong Topography & Topology topography: predict location of transmembrane segment topology: predict location of N- and C- termini wrt lipid bilayer We focus on topography prediction for all-  membrane proteins Lipid molecules

Copyright  2003 limsoon wong Datasets Jayasinghe et al. Protein Sci, 10: , 2001 –59 high resolution membrane proteins – Moller et al. Bioinformatics, 16: , 2000 –151 low resolution membrane proteins Jones et al., Biochem., 33(10): , 1994 –38 multi-spanning and 45 single-spanning membrane proteins –topologies experimentally determined Sonnhammer et al., ISMB, 6: , 1998 –108 multi-spanning and 52 single-spanning membrane proteins –most of experimentally determined topologies, but less reliably determined than Jones et al.

Copyright  2003 limsoon wong Monne et al., JMB, 288: , 1999: Turn Propensity Scale for TM Helices E. coli Lep protein contains two TM domains (H1, H2) and C-terminal doman P2 Translocation of P2 to lumenal side is easy to test by glycoslation Replace H2 by 40 residue poly-L segment LIK 4 L 21 XL 7 VL 10 Q 3 P The poly-L segment can form either one long TM or 2 closely-spaced TM helices, depending on what is substituted for X ER

Copyright  2003 limsoon wong Monne et al., JMB, 288: , 1999: Turn Propensity Scale for TM Helices Using the poly-L segment, measure “turn” propensity of the 20 amino acids by substituting them for the X in the poly-L segment Hydrophobic residues (I, V, L, F, C, M, A) do not induce turn Charged and polar residues (except S & T) induce turn Exercise: –What are the charged/polar residues? –What could be reason of S & T not inducing turn? glycoslated non-glycoslated

Copyright  2003 limsoon wong Monne et al., JMB, 288: , 1999 In all-  membrane proteins, –hydrophobic residues prefer membrane env and have low turn propensity –charged & polar residues induce turn formation to avoid membrane interior  prediction of TM helix  distinction of 1 long TM helix vs 2 closely spaced TM helices Monne et al., JMB, 288: , 1999: Turn Propensity Scale for TM Helices

Copyright  2003 limsoon wong Monne et al., JMB, 288: , 1999 Inside of cellular membrane is hydrophobic Segment of protein that spans membrane is expected to contain many hydrophobic amino acids  Locate segments that have high average “hydrophobicity” score Wiess et al, ISMB, 1: , 1993 Hydrophobicity Approach

Copyright  2003 limsoon wong Wiess et al, ISMB, 1: , 1993 Hydrophobicity Approach find a segment of 10 to 70aa with hp > 0.71 expand to longer segment with hp > 0.35 mark this segment as TM repeat above starting from position after previous segment Caveats: –may be unable to distinguish hydrophobic core of nonmembrane proteins vs. transmembrane regions –what are the right thresholds? Adjustable thresholds

Copyright  2003 limsoon wong An Example: Bacteriorhodopsin 1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag 61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp 121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii 181 gtegagvvgl gvetlafmvl dvta 7 transmembrane helices

Copyright  2003 limsoon wong An Example: Bacteriorhodopsin 1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag 61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp 121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii 181 gtegagvvgl gvetlafmvl dvta After applying hydrophobicity scale...

Copyright  2003 limsoon wong An Example: Bacteriorhodopsin Compute hydrophobicity score, hp > 7 1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag 61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp 121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii 181 gtegagvvgl gvetlafmvl dvta TM identified: 6/7, TM FP: 0 TM residue identified: 62/117, TM residue FP: 4

Copyright  2003 limsoon wong An Example: Bacteriorhodopsin Expand segment, maintain hp > 5, avoid low hydrophobicity 1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag 61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp 121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii 181 gtegagvvgl gvetlafmvl dvta TM identified: 6/7, TM FP: 0 TM residue identified: 100/117, TM residue FP:15

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, A HMM Approach There are 3 main locations of a residue: –TM helix core (viz., in hydrophobic tail of membrane –TM helix cap (viz., in head of membrane) cytoplasmic vs non-cytoplasmic side of the helix core –loops cytoplasimc vs non-cytoplasmic (short) vs non-cytoplasmic (long)  So needs HMM with 7 states Exercise: What is the 7th state for? cyto non-cyto

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, Architecture cyto non-cyto Each state has an associated probability distribution over the 20 amino acids characterizing the variability of amino acids in the region it models

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, Architecture The first 3 and last 2 core states have to be traversed. But all other core states can be bypassed. This models core regions of residues

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, Architecture The states of globular, loop, & cap regions. The caps are 5 residues each. Since core is residues, this allows for helices residues long To model bias in amino acid usage near cap To model neutral amino acid distribution

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, Training the HMM Stage 1: Baum-Welch is used for maximum likelihood estimation from “diluted” labeled training data. As precise end of TM is only approximately known, we “dilute” by unlabeling 3 residues on each side of a helix boundary to accommodate this Stage 2: Baum-Welch is used for maximum likelihood estimation from “relabeled” training data. The original training data are diluted as by unlabeling 5 residues on each side of a helix boundary. Model from Stage 1 is used to produce “relabeled training data” by relabeling this part under constraints of remaining labels Stage 3: Model from Stage 2 is further tuned by a method for “discriminative” training, to maximize probability of correct prediction (Krogh, ISMB, 5: , 1997)

Copyright  2003 limsoon wong Krogh, ISMB, 5: , 1997: Discriminative HMM Training

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, Example Non-cytoplasmic Cytoplasmic TM segment Datasets Jones et al., Biochem., 33(10): , 1994 Sonnhammer et al., ISMB, 6: , 1998

Copyright  2003 limsoon wong Sonnhammer et al., ISMB, 6: , 1998: TMHMM, Accuracy (10-CV) All TM segments & their orientation correctly predicted All TM segments correctly predicted, ignoring orientation precision Jones et al Sonnhammer et al

Copyright  2003 limsoon wong NNHMM1HMM2 ENSEMBLE Martelli et al. Bioinformatics, 19:i205--i211, 2003 ENSEMBLE

Copyright  2003 limsoon wong ENSEMBLE: The Neural Network Part The NN part is a cascade shown above, a la Rost et al., Protein Science, 1995 h1h1 h2h2 h5h5 HMM LOOP Input layer 17*2 inputs hidden units 17 * 20 input units Feed-forward back-propagation neural network

Copyright  2003 limsoon wong ENSEMBLE: The HMM1 Part HMM1 models the hydrophobic nature of most TM helices, a la Krogh et al. JMB 2001 & Sonnhammer et al., ISMB 1998

Copyright  2003 limsoon wong ENSEMBLE: The HMM2 Part HMM2 models TM helices that are mix of hydrophobic and hydrophilic residues, ala Martelli et al., Bioinformatics 2002.

Copyright  2003 limsoon wong NNHMM1HMM2 ENSEMBLE ENSEMBLE: Predicting if a residue is in TM  NN(p,i) = NN(H,p,i)  NN(L,p,i)  HMM 1 (p,i) = AP 1 (H,p,i)  AP 1 (I,p,i)  AP 1 (O,p,i)  HMM 2 (p,i) = AP 2 (H,p,i)  AP 2 (I,p,i)  AP 2 (O,p,i) E(p,i) = (  NN(p,i) +  HMM 1 (p,i) +  HMM 2 (p,i)) / 3 position helix loop (inner I, outer O) E(p,i) > 0 means residue i of protein p is in TM helix

Copyright  2003 limsoon wong Ensemble: Topography Prediction Fariselli et al., Bioinformatics, 2003 NNHMM1HMM2 ENSEMBLE MaxSubSeq TM helix found by MaxSubSeq but would be missed w/o it This path is taken means positions m to j form a helix

Copyright  2003 limsoon wong Ensemble: Topography Prediction Results A prediction is considered correct if (a) the number of TM segments is correct and (b) the overlap between a predicted and a real TM segment > 8aa

Copyright  2003 limsoon wong Topology Prediction: Postive-Inside Rule Gavel et al., FEBS, 282:41--46, 1991 Positively- charged residues (Lys and Arg) are enriched more than 2 fold in stromal vs luminal loops

Copyright  2003 limsoon wong Topology Prediction: Ensemble “positive-inside” rule

Copyright  2003 limsoon wong Ensemble: Topology Prediction Results

Copyright  2003 limsoon wong Short Break

Copyright  2003 limsoon wong Subcellular Localization

Copyright  2003 limsoon wong Compartments and Sorting Eukaryotic cells requires proteins be targeted to their subcellular destinations Protein sorting is determined by specific amino acid sequences, or “signals”, within the protein Secretory pathway targets proteins to plasma membrane, some membrane- bound organelles such as lysosomes, or to export proteins from the cell

Copyright  2003 limsoon wong Secretory Pathway The secretory pathway consists of the endoplasmic reticulum (ER), Golgi apparatus and transport vesicles The transport vesicles carry proteins from one compartment to the other Exocytosis is mediated by fusion of secretory vesicles with the plasma membrane. Endocytosis is the opposite of exocytosis and involves the uptake of extracellular material by pinching off vesicles from the plasma membrane The contents of the endocytic vesicles are delivered to the lysosomes by membrane fusion Lysosomes contain hydrolytic enzymes that breakdown macromolecules into the smaller subunits which can be utilized by the cell for its own biosynthesis

Copyright  2003 limsoon wong Datasets Reinhartdt & Hubbard, NAR, 26: , 1998 –2427 eukaryotic proteins for 4 locations (cytoplasmic, extracellular, nuclear,& mitochondrial) –997 prokaryotic proteins for 3 locations (cytoplasmic, extracellular, & periplasmic) Park & Kanehisa, Bioinformatics, 19: , 2003 –7589 eukaryotic proteins from 709 organisms for 12 locations (chloroplast, cytoplasmic, cytoskeleton, ER, extracellular, golgi, lysosomal, mitochondrial, nuclear, peroxisomal, plasma membrane, vacuolar) Chou & Cai, JBC., 277: , 2002 –2191 proteins for 12 locations Emanuelsson et al., JMB, 300: , 2000 Gardy et al., NAR, 31: , 2003

Copyright  2003 limsoon wong Common Eukaryotic Protein Sorting Signals For a comprehensive list of cellular localization sites, see

Copyright  2003 limsoon wong Schematic View of Sorting Signals cleavage site ~25aa

Copyright  2003 limsoon wong Sequence Logos of SP, mTP, & cTP SP signal peptide mTP mitochondrial transfer peptide cTP chloroplast transit peptide

Copyright  2003 limsoon wong Neural Network Approach: TargetP Emanuelsson et al., JMB, 300: , 2000 cTP, mTP, SP –4 hidden units –feedforward NNs –input windows: 55aa (cTP), 35aa (mTP), 27aa (SP) sparsely encoded Integrating Network –0 hidden unit –feedforward NN –input is taken from the outputs of cTP, mTP, SP networks over 100aa at N-terminal cTP: chloroplast transit peptide, mTP: mitochondria transfer peptide, SP: signal peptide

Copyright  2003 limsoon wong TargetP: Performance Dataset: Emanuelsson et al., JMB, 2000

Copyright  2003 limsoon wong Expert System Approach: PSORT Horton & Nakai, ISMB, 1997 A simplified version of the decision tree that PSORT uses to check and reason over various sorting signals

Copyright  2003 limsoon wong A Refinement: PSORT-B Gardy et al., NAR, 31: , 2003 SCL- BLAST MotifsHMMTOP Outer Membrane Protein SubLocC Signal Peptides Bayesian Network Localization sites or “unknown” Sites considered –cytoplasm –inner membrane –periplasm –outer membrane –extracellular space

Copyright  2003 limsoon wong PSORT-B: SCL-BLAST Homology to a protein of known localization is good indicator of a protein’s actual localization site  BLAST target protein against a database of proteins whose localization sites are known  Return localization sites of hits at E-value of 10e -10 over 80% of length

Copyright  2003 limsoon wong PSORT-B: Motifs Some motifs in PROSITE may be able to identify subcellular localization with 100% precision  Scan target protein against a database of such motifs (28 such 100%-precision motifs are known)  Return localization sites corresponding to the motif hits

Copyright  2003 limsoon wong PSORT-B: HMMTOP  -helical transmembrane region is reliable indicator of localization to inner membrane  Scan target protein for transmembrane  helices using HMMTOP  Return localization site as “inner membrane” if >2  helices found

Copyright  2003 limsoon wong PSORT-B: Outer Membrane Proteins Outer-membrane proteins have characteristics  - barrel structure  Identify freq seq occurring only in  -barrel proteins (279 such freq seq known)  Scan target protein for these freq seq  Return localization site as “outer membrane” if >2 such freq seq found

Copyright  2003 limsoon wong PSORT-B: SubLocC Overall amino acid composition is useful for recognizing cytoplasmic proteins  Trained SVM on overall amino acid composition to predict cytoplasmic vs non- cytoplasmic, as in SubLoc  Analyze target protein’s amino acid composition using this SVM

Copyright  2003 limsoon wong PSORT-B: Signal Peptides Presence of signal peptide at N- terminal means protein not cytoplasmic  Train HMM and SVM to recognize signal peptides and their cleavage sites  If high-confidence cleavage site found by HMM in first 70aa of target protein, then “non-cytoplasmic”  If low-confidence cleavage site found, pass candidate signal peptide to SVM to confirm  If confirmed, then “non-cytoplasmic”  Otherwise, “unknown”

Copyright  2003 limsoon wong PSORT-B: Bayesian Network Bayesian Network integrates results from the 6 modules Produces a score for each of the 5 possible localization sites If a site scores >7.5, then predicts as a localization site of the target protein If no site scores >7.5, then makes no prediction

Copyright  2003 limsoon wong PSORT-B: Performance of Individual Modules Dataset: Gardy et al., NAR, 2003

Copyright  2003 limsoon wong PSORT-B: Performance wrt Localization Sites PSORT-B is a considerable improvement over original PSORT Dataset: Gardy et al., NAR, 2003

Copyright  2003 limsoon wong PSORT vs PSORT-B: Some Remarks PSORT considers various signal/features in a top-down way driven by its reasoning tree PSORT-B generates all signal/features in a bottom-up way, then integrate them for decision making using Bayesian Network Machine learning “beats” human expert? Probably the number of features/rules needed is too much/complicated

Copyright  2003 limsoon wong Amino acid composition of proteins residing in different sites are different

Copyright  2003 limsoon wong Amino Acid Composition Differences each cellular location has own characteristic physio-chemical environment proteins in each location have adapted thru evolution to that environment thus reflected in the protein structure and amino acid composition If the above is true, the amino acid composition differences wrt cellular location sites should be more pronounced on protein surfaces than protein interior Exercise: Why?

Copyright  2003 limsoon wong Adaptation of Protein Surfaces Andrade et al., JMB, 1998 Proportion of j th amino acid type in i th protein To test the theory of adaptation of protein surfaces to subcellular localization, we do a plot of 3 types of composition vectors along their first two principal components

Copyright  2003 limsoon wong Adaptation of Protein Surfaces Andrade et al., JMB, 1998 Total amino acid composition vector Surface amino acid composition vector Interior amino acid composition vector Clearly total & surface composition vectors show better separation than interior composition vectors

Copyright  2003 limsoon wong Amino Acid Composition This means can use amino acid composition vectors, especially those from protein surfaces, to predict subcellular localization! Let’s see how this turn out….

Copyright  2003 limsoon wong Neural Networks: NNPSL Reinhardt & Hubbard, NAR, 26: , 1998 Input 1 Input 20 cytoplasmic extracellular mitochodrial nuclear fraction of each amino acid in the input protein

Copyright  2003 limsoon wong NNPSL: Performance Outputs NNPSL have values 0 to 1. The difference (  ) between the highest and the next highest nodes can be used as a reliability index 0 <  < <  < <  < <  < <  < 1 Dataset: Reinhardt & Hubbard, NAR, 1998

Copyright  2003 limsoon wong Performance Emanuelsson, BIB, 3: , 2002 (940 proteins) (2738 proteins) Dataset: Emanuelsson et al., JMB, 2000

Copyright  2003 limsoon wong Markov Chain Yuan, FEBS Letters, 451:23--26, 1999 Why?

Copyright  2003 limsoon wong Markov Chain: Performance NNPSL4th Order Markov (Eukaryotic) Dataset: Reinhardt & Hubbard, NAR, 1998

Copyright  2003 limsoon wong Support Vector Machines: SubLoc Hua & Sun, Bioinformatics, 17: , 2001 extracellular vs rest nuclear vs rest cytoplasmic vs rest mitochondrial vs rest Argmax X X-vs-rest SVM The SVMs use polynomial kernel with d = 9 (prokaryotic), K(X i,X j ) = (X i ·X j + 1) d RBF kernel with  =16 (eukaryotic), K(X i, X j ) = exp(-  |X i - X j | 2 20-dimensional vector giving amino acid composition of the input protein

Copyright  2003 limsoon wong SubLoc: Performance NNPSL SubLoc (Eukaryotic) Dataset: Reinhardt & Hubbard, NAR, 1998

Copyright  2003 limsoon wong SubLoc: Robustness of Amino Acid Composition Approach Amazingly, accuracy of SubLoc is virtually unaffected when the first 10, 20, 30, & 40 amino acids in a protein are deleted Amino acid composition is a robust indicator of subcellular localization, and is insensitive to errors in N-terminal sequences

Copyright  2003 limsoon wong Amino Acid Composition: Taking it Further How about pairs of consecutive amino acids? (a.k.a 2-grams) How about 3- grams, …, k-grams? How about pseudo amino acid composition? How about presence of entire functional domains? (I.e. think of the presence/absence of a functional domain as a summary of amino acid sequence info...)

Copyright  2003 limsoon wong Functional Domain Composition Chou & Cai, JBC, 277: , 2002 Training seqs of various localization sites BLAST against db of known functional domains (SBASE-A) amino acid composition + Train SVM using these vectors x i = 1 means ith domain is present

Copyright  2003 limsoon wong Functional Domain Composition: Performance Not so good Why?  Number of known domains in SBASE-A too small  Need to handle situation where a protein has no hit in known domains Dataset: Reinhardt & Hubbard, NAR, 1998

Copyright  2003 limsoon wong Functional Domain Composition Cai & Chou, BBRC, 305: , 2003 Training seqs of various localization sites BLAST against db of known functional domains (Interpro) NN-5875D: Train k-NN (k=1) using these vectors or, if no hit found Pseudo amino acid composition Amino acid composition NN-40D: Train k-NN (k=1) using these vectors If a protein got a hit in Interpro, use NN-5875D; else use NN-40D

Copyright  2003 limsoon wong Functional Domain Composition: Performance Dataset: Reinhardt & Hubbard, NAR, 1998

Copyright  2003 limsoon wong Notes

Copyright  2003 limsoon wong References (Transmembrane) Wiess et al. “Transmembrane segment prediction from protein sequence data”, ISMB, , 1993 Gavel et al. “The positive-inside rule applies to thylakoid membrane proteins”, FEBS 282:41--46, 1991 Monne et al. “A turn propensity scale for transmembrane helices”, JMB, 288: , 1999 Sonnhammer et al. “A hidden Markov model for predicting transmembrane helices in protein sequences”, ISMB, 6: , 1998 Martelli et al. “An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins”, Bioinformatics, 19(suppl):i205--i211, 2003

Copyright  2003 limsoon wong References (Transmembrane) Von Heijne. “Membrane protein structure prediction”, JMB, 225: , 1992 Jacoboni et al. “Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network- based predictor”, Protein Sci., 10: , 2001 Martelli et al. “a sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins”, Bioinformatics, 18:S46--S53, 2002 Moller et al. “Evaluation of methods for the prediction of membrane spanning regions”, Bioinformatics, 17: , 2001 Fariselli et al. “MaxSubSeq: an algorithm for segment- length optimization. The case study of the transmembrane spanning segments”, Bioinformatics, 19: , 2003

Copyright  2003 limsoon wong References (Transmembrane) Rost et al. “Transmembrane helices predicted at 95% accuracy”, Protein Sci., 4: , 1995 Krogh et al. “Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes”, JMB, 305: , 2001 Andersson et al. “Different positively charged amino acids have similar effectson the topology of a polytopic transmembrane protein in E. coli”, JBC, 267: , 1992

Copyright  2003 limsoon wong References (Subcellular Localization) Horton & Nakai, “Better prediction of protein cellular localization sites with the k-nearest neighbours classifier”, ISMB, 5: , 1997 Gardy et al., “PSORT-B: Improving protein subcellular localization for Gram-negative bacteria”, NAR, 31: , 2003 Emanuelsson, “Predicting protein subcellular localization from amino acid sequence information”, BIB, 3: , 2002 Andrade et al., “Adaptation of protein surfaces to subcellular location”, JMB, 276: , 1998 Yuan, “Prediction of protein subcellular locations using Markov chain models”, FEBS Letters, 451:23--26, 1999

Copyright  2003 limsoon wong References (Subcellular Localization) Emanuelsson et al., “ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites”, Protein Sci., 8: , 1999 Emanuelsson et al., "Predicting subcellular localization of proteins based on their N-terminal amino acid sequence", JMB, 300: , 2000 Hua & Sun, “Support vector machine approach for protein subcellular localization prediction”, Bioinformatics, 17: , 2001 Reinhardt & Hubbard, “Using neural networks for prediction of the subcellular location of proteins”, NAR, 26: , 1998

Copyright  2003 limsoon wong References (Subcellular Localization) Cai & Chou, “Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition”, BBRC, 305: , 2003 Chou & Cai, “Using functional domain composition and support vector machines for prediction of protein subcellular location”, JBC, 277: , 2002 Park & Kanehisa, “Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs”, Bioinformatics, 19: , 2003

Copyright  2003 limsoon wong References (PTM) Eisenhaber et al. “Post-translational GPI-lipid anchor modification of proteins in kingdoms of life: analysis of protein sequence data from complete genomes”, Protein Engineering,14(1):17-25, 2001 Eisenhaber et al. “Automated annotation of GPI anchor sites: case study C. elegans”,Trends Biochem Sci., 25(7): , 2000 Eisenhaber et al. “Prediction of potential GPI-modification sites in proprotein sequences”, JMB, 292(3): , 1999 Eisenhaber et al. “Sequence properties of GPI-anchored proteins near the omega-site: constraints for the polypeptide binding site of the putative transamidase”, Protein Engineering, 11(12): , 1998 Not Used