Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Random Forest Predrag Radenković 3237/10
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.
ProteinPilot ™ Software © 2008 Applera Corporation and MDS Inc.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
1336 SW Bertha Blvd, Portland OR 97219
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.
Smart Templates for Chemical Identification in GCxGC-MS QingPing Tao 1, Stephen E. Reichenbach 2, Mingtian Ni 3, Arvind Visvanathan 2, Michael Kok 2, Luke.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Facts and Fallacies about de Novo Sequencing & Database Search.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
VarDetect: a nucleotide sequence variation exploratory tool VarDetect Chumpol Ngamphiw 1, Supasak Kulawonganunchai 2, Anunchai Assawamakin 3, Ekachai Jenwitheesuk.
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Patricia HernandezGeneva, 28 th September 2006 Swiss Bio Grid: Proteomics Project (PP)
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
PeptideShaker Overview What makes PeptideShaker special? - proteomics: shaken, not stirred! 1)Free, open-source and platform independent! 2)Focus on user-friendliness.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.
Open source tools for data analysis
Bottom-Up Proteomics Data collection
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Sequence Based Analysis Tutorial
Protein Identification Using Mass Spectrometry
Generalized Protein Parsimony
Presentation transcript:

Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center

What is PepArML? Meta-search using seven search engines: Mascot; X!Tandem Native, K-Score, S-Score; OMSSA; Myrimatch; InsPecT + MSGF Automatic target + decoy searches Automatic construction of search configuration Automatic spectra and sequence (re-)formatting Heterogeneous cluster, grid, cloud computing Centralized scheduler Shared and private computational resources Integration with NSF TeraGrid and AWS 2

What is PepArML? A peptide identification result combiner Selects best identification, per spectrum Model-free, auto-train machine-learning Estimates false-discovery-rates Format output as pepXML and protXML In use: more than 23M spectra, 1.4M search jobs, and 1TB in spectra and results. PepArML identifies significantly more spectra than single search engines. Recovers more proteins with fewer replicates 3

PepArML Performance 4 LCQQSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1

PepArML Advantages Can accommodate new search engines or spectrum and peptide features easily Learns the specific characteristics of each dataset from scratch! Provides a platform for comparison of single search engine results with common FDR estimation procedure. 5

Search Engine Info. Gain 6

Precursor & Digest Info. Gain 7

Retention Time & Proteotypic Peptide Properties Info. Gain 8

Search Engine Independent FDR Estimation Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded Competitive vs Global Competitive controls some spectral variation Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions? 9

PepArML Disadvantages Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is uninterpretable for peptide identification Require two decoy-searches to “calibrate” confidence as FDR Each spectrum searched ~ 21 times! 10

PepArML Disadvantages Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is uninterpretable for peptide identification Require two decoy-searches to “calibrate” confidence as FDR Can we eliminate the internal decoy? Reduce search phase by 33% 11

PepArML Workflow Select high-quality IDs Guess true proteins from search results Label spectra & train Calibrate confidence Guess true proteins from ML results Iterate! Estimate FDR using (external) decoy 12

Select High-Quality Unanimous Peptide Identifications Require fast and easy, but comparable search-engine metric. 13 min decoy hitsmin z-score

Simulate Decoy Results by Sampling Target Results 14 Target Decoy Sampled Target

Simulate Decoy Results by Sampling Target Results 15 Target Decoy Sampled Target

Sampled Target Approximates Decoy Calibration Sample 75% non-training “false” target results Rescale to # of spectra Approximates FDR well- enough to replace internal decoy 16

Decoy-free PepArML results 17 LCQQSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1

Conclusions PepArML can significantly boost the number of spectra, peptides, and proteins identified Give it a try – free! Nothing to install! A common FDR framework facilitates head-to- head comparison of search engines and FDR estimation techniques Sampled target results can substitute for decoy results (internally) Reduces search time by 33% 18

19 Acknowledgements Growing list of PepArML users Fenselau lab (Maryland) Graham lab (JHU) Genovese lab ( Bologna University, Italy) Dr. Brian Balgley Bioproximity Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science Funding: NIH/NCI