Laxman Yetukuri T : Modeling of Proteomics Data

Slides:



Advertisements
Similar presentations
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Advertisements

1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Protein Sequencing and Identification by Mass Spectrometry.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Fa 06CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
ProReP - Protein Results Parser v3.0©
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
My contact details and information about submitting samples for MS
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Temple University MASS SPECTROMETRY FURTHER INVESTIGATIONS Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.
Protein identification. Peptide Mass Fingerprinting In situ digestion Peptide extraction MALDI-MS Putative Candidates Score 1. Larval serum protein 2.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Identify proteins. Proteomic workflow Trypsin A typical sample We add a solution of 50 mM NH 4 HCO 3 (pH 7.8) containing trypsin ( µg/µl). Volume.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Post translational modification n- acetylation Peptide Mass Fingerprinting (PMF) is an analytical technique for identifying unknown protein. Proteins to.
Mass Spectrometry makes it possible to measure protein/peptide masses (actually mass/charge ratio) with great accuracy Major uses Protein and peptide identification.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
A Database of Peak Annotations of Empirically Derived Mass Spectra
Bioinformatics Solutions Inc.
Proteomics Informatics David Fenyő
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Bioinformatics for Proteomics
Proteomics Informatics David Fenyő
Kuen-Pin Wu Institute of Information Science Academia Sinica
Presentation transcript:

Laxman Yetukuri T-61.6070: Modeling of Proteomics Data PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Outline Motivation Basics: MS and MS/MS for Protein Identification Computational Framework of Database Search Scoring Algorithms PepHMM MOWSE Results Summary

Motivation Proteomics studies- dynamic and context sensitive Speed and accuracy of omics-driven methods High throughput MS-based approaches Real analysis starts with protein identification Protein identification is challenging The heart of protein identification algorithm is scoring function

Protein Identification Is Challenging Sample Contamination Imperfect Fragmentation Post translational Modifications Low signal to noise ratio Machine errors

Basics: MS and MS/MS for protein Identification Trypsin Digest Liquid Chromatography Mass Spectrometry Precursor selection + collision induced dissociation (CID) MS/MS

Computational Problem Nesvizhskii and Aebersold, Drug Discovery Today, 2004, 9, 173-181

Peptide Fragmentation: b & y ions bi yn-i yn-i-1 -HN-CH-CO-NH-CH-CO-NH- Ri CH-R’ i+1 R” i+1 bi+1

Peptide Fragmentation: b & y ions … 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 y4 b5 y8 b6 b8 b7 b9 y9 m/z 250 500 750 1000

Peptide Fragmentation with other ions xn-i bi yn-i ci zn-i yn-i-1 -HN-CH-CO-NH-CH-CO-NH- Ri CH-R’ i+1 ai R” i+1 bi+1

Peptide Identification Two main methods for tandem MS: De novo interpretation Sequence database search

De Novo Interpretation 100 250 500 750 1000 m/z % Intensity E L F KL SGF G D

Sequence Database Search Widely used approach Compares peptides from a protein sequence database with experimental spectra Scoring function summarise the comparison Critical for any search engine Score each peptide against spectrum Cross correlation (SEQUEST) MOWSE scoring and its extensions (MASCOT) Probabilistic scoring systems (OMSSA, OLAV, ProbID…..) PepHMM is HMM based probabilistic scoring function

Computational Framework for pepHMM MSDB based peptide extraction Hypothetical spectrum generation b,y,y-H2O,b-H2O,b2+ and y2+ Computing probabilistic scores Initial classification :Match, missing or noise Compute pepHMM scores (discussed later) Compute Z-score Compute E-score

Contents of pepHMM Model PepHMM combines the information on correlation among the ions, peak intensity and match tolerance Input – sets of matches, missing and noise Model is based on b and y ions Each match is associated with observation (T,I) Observation state = observed (T,I) Hidden state =True assignement of the observations

Model Structure Four possible assignments corresponding to four hidden states

Model Computation Goal: Calculate highest score peptide in the database Let a path in HMM be represents configuration of states, probability of the path

Model Computation… Considering all possible paths Forward algorithm: Probability of all possible Paths from the first position to state v at postion i

Emmission Probabilities Probability of observing (Tb,Ib) and (Ty, Iy) for the state 1 at position i ---Normal distribution ---Exponential distribution

MOWSE Scoring System MOWSE Algorithm is implemented in MASCOT software Where mi,j -elements of MOWSE frequence matrix

Data Sets ISB data set: A,B mixtures of 18 different proteins with modifications/relative amounts Analysed using SEQUEST and other in-house Software Data set is curated Final data set with charge 2+ for trypsin digestion contains 857 spectra 5-fold cross validation by random selection -Training set :687 spectra -Testing set : 170 spectra EM algorithm is used for estimating parameters

Results: Distributions of Ions Noise b and y ions Match Tolerance Parameter estimates

Comparative Studies Dat set selection repeated 10 times to select both training and test data set For each group parameters are similar values Prediction is considered correct if the peptide has highest score

Independent Data Set A.Y’s Lab: The other independent data set for comparing with other tools like SEQUEST and MASCOT size of data set =20,980 spectra

False/True Positive Rates

Summary Developed probabilistic scoring function called pepHMM for improving protein identifications PepHMM outperform other tools like MASCOT with low false postive rate (always?) Can this handle other type of ions other than b and y ions Need to handle post translational modifications