Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Slides:



Advertisements
Similar presentations
Two bioinformatics applications of dynamic Bayesian networks
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Supervised Learning Recap
1336 SW Bertha Blvd, Portland OR 97219
Segmentation and Fitting Using Probabilistic Methods
Markov Networks.
Profiles for Sequences
Visual Recognition Tutorial
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Each results report will contain:
This work is licensed under a Creative Commons Attribution 4.0 International License. Oliver Kohlbacher, Sven Nahnsen, Knut Reinert COMPUTATIONAL PROTEOMICS.
Scaffold Download free viewer:
Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington.
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
Radial Basis Function Networks
Machine learning methods for protein analyses William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Multiple testing correction
Proteomics Informatics – Data Analysis and Visualization (Week 13)
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
EM and expected complete log-likelihood Mixture of Experts
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Flat clustering approaches
John Lafferty Andrew McCallum Fernando Pereira
EEE502 Pattern Recognition
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Hanyang Univ. Introduction to Data Analyses for Mass Spectrometry-based Proteomics 1.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
MassMatrix Search Results Explained
Protein Identification via Database searching
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Proteomics Informatics David Fenyő
Peptide & Protein Identification by MS/MS
Proteomics Informatics –
Expectation-Maximization & Belief Propagation
NoDupe algorithm to detect and group similar mass spectra.
Bioinformatics for Proteomics
Processing of fragment ion information in DTA files to remove isotope ions and noise. Processing of fragment ion information in DTA files to remove isotope.
High level view of the MAE algorithm.
Markov Networks.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Presentation transcript:

Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Re-ranking identified spectra (Keller Analytical Chemistry 2002) (Anderson J Proteome Research 2003) (Käll Nature Methods 2007)

EAMPK EAMPK? This is the problem we set out to solve

Modified problem: Is this peptide assignment correct? m/z Intensity VVVTGLGMLSPVGNTVESTWK

Peptide-spectrum match features Total peptide mass Charge (+1, +2 or +3) Total ion current Peak count Preliminary SEQUEST score (Sp) Sp rank Cross-correlation score (XCorr) Change in XCorr (delta Cn) Mass difference Percent of theoretical peaks matched Percent of observed peaks matched Percent of peptide fragment ion current matched Percent sequence identity between top and second- ranked peptides

Uses linear discriminant analysis rather than SVM. Uses a four-dimensional feature space (XCorr, delta Cn, ln SpRank, delta mass). Uses EM to fit distributions to the discriminants of the two classes, yielding a probability. Learns a simple, independent probability model of the number of tryptic termini. Publicly available software, PeptideProphet, is widely used.

Peptide-spectrum matches against the real database Peptide-spectrum matches against the shuffled database q=0.01

Features

2780 PSMs PSMs 8050 PSMs PSMs 1% FDR PSMs

Cleaving with elastase

Variation by data set Black lines are q = 0.01 Yellow line is y=x Red line = equal q value thresholds Elastase data setChymotrypsin data set

Percolator best match SEQUEST best match

Protein identification

The protein ID problem Proteins Peptides Spectra EEAMPFKCYCYGGLGKCYCLLIGKFTEILYCDLNRVNILLGLPK

The peptide-to-protein mapping is many-to-many

≥ 0.90 Proteins (X)Peptides (Y)Spectra (D) One- and two-peptide rules use a simple threshold

≥ 0.90 Proteins (X)Peptides (Y)Spectra (D) Select the minimum number of proteins to explain the peptides IDPicker

ProteinProphet Proteins (X)Peptides (Y)Spectra (D) Use an EM-like procedure…

Proteins (X)Peptides (Y)Spectra (D) ProteinProphet

Proteins (X)Peptides (Y) 0.8 x x 0.91 Spectra (D) 0.7 x x x x 0.97 ProteinProphet

EM-like algorithm E-step M-step All proteins containing peptide i Probability of protein n Weight of link from peptide i to protein n Maximum probability assigned to peptide i

Nested Mixture Model Proteins (X)Peptides (Y)Spectra (D) Modeled as mixture of present and absent Model number of matches conditional on protein states Model distribution of scores conditional on peptide states (Li Ann Applied Science 2010)

Shen et al Li et al Model the MS/MS process generatively (forward) using free parameters. Sum over all possible protein and peptide states to get posterior probabilities. Use Expectation-maximization to get parameter estimates. Model the MS/MS process generatively using an existing static peptide detectability model. Use Markov chain Monte Carlo to estimate posterior probabilities. Generatively model: Y | X D | Y Perform inference to get Pr(X | D) The emergence of graphical Bayesian methods

Fido Fido performs exact calculations on a Bayesian network model

Barista uses a neural network to score PSMs Input units: 17 PSM features Hidden units Output unit PSM feature vector

The Barista model includes spectra, peptides and proteins R1R1 R2R2 R3R3 E1E1 E2E2 E3E3 E4E4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 Proteins Peptides Spectra Neural network score function Number of peptides in protein R

Model Training Search against a database containing real (target) and shuffled (decoy) proteins. For each protein, the label y  {+1, -1} indicates whether it is a target or decoy. Hinge loss function: L(F(R),y) = max(0, 1-yF(R)) Goal: Choose parameters W such that F(R) > 0 if y = 1, F(R) < 0 if y = -1. repeat Pick a random protein (R i, y i ) Compute F(R i ) if (1 – yF(R i )) > 0 then Make a gradient step to optimize L(F(R i ),y i ) end if until convergence

Barista performs well in target/decoy evaluation

Why does Barista work well? Sources of information loss during two-stage analysis: Spectra that are not confidently assigned to a peptide during the initial search are lost. Also lost are lower-ranked peptides that match a given spectrum, corresponding to –the correct peptide when the top-ranked peptide is incorrect, or –a second correct peptide when the spectrum is chimeric. A single score is less informative than a rich feature vector describing the PSM.