Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Each results report will contain:
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
WERST – Methodology Group
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Data Mining and Decision Support
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Isotope Labeled Internal Standards in Skyline
NTU & MSRA Ming-Feng Tsai
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
Database Search Algorithm for Identification of Intact Cross-Links in Proteins and Peptides Using Tandem Mass Sepctrometry 신성호.
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Semi-Supervised Clustering
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
A Database of Peak Annotations of Empirically Derived Mass Spectra
Protein Identification via Database searching
Protein Identification Using Mass Spectrometry
Generalized Protein Parsimony
Presentation transcript:

Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Peptide Identifications Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard? Hard to determine “good” scores Significance estimates are unreliable Need more ids from weak spectra Each search engine has its strengths and weaknesses Search engines give different answers 2

Mascot Search Results 3

Translation start-site correction Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins Goo, et al. MCP GdhA1 gene: Glutamate dehydrogenase A1 Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0 prediction(s) 4

Halobacterium sp. NRC-1 ORF: GdhA1 K-score E-value vs 10% FDR Many peptides inconsistent with annotated translation start site of NP_

Translation start-site correction 6

Search engine scores are inconsistent! 7 Mascot Tandem

Common Algorithmic Framework – Different Results Pre-process experimental spectra Charge state, cleaning, binning Filter peptide candidates Decide which PSMs to evaluate Score peptide-spectrum match Fragmentation modeling, dot product Rank peptides per spectrum Retain statistics per spectrum Estimate E-values Apply empirical or theoretical model 8

Comparison of search engines No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 9 4% OMSSA 10% 2% 5%9% 69% 2% X!Tandem Mascot

Simple approaches (Union) Different search engines confidently identify different spectra: Due to search space, spectral processing, scoring, significance estimation Filter each search engine's results and union Union of results must be more complete But how to estimate significance for the union? What if the results for same spectra disagree? Need to compensate for reduced specificity How much? 10

Union of filtered peptide ids 11 Mascot Tandem

Union of filtered peptide ids 12 Mascot Tandem

Union of filtered peptide ids 13 Mascot Tandem

Simple approaches (Intersection) Different search engines agree on many spectra Agreement is unexpected given differences Filter each search engine's results and take the intersection Intersection of results must be more significant But how to estimate significance for the intersection? What about the borderline spectra? Need to compensate for reduced sensitivity How and how much? 14

Intersection of filtered peptide ids 15 Mascot Tandem

Intersection of filtered peptide ids 16 Mascot Tandem

Intersection of filtered peptide ids 17 Mascot Tandem

Combine / Merge Results Threshold peptide-spectrum matches from each of two search engines PSMs agree → boost specificity PSMs from one → boost sensitivity PSMs disagree → ????? Sometimes agreement is "lost" due to threshold... How much should agreement increase our confidence? Scores easy to "understand" Difficult to establish statistical significance How to generalize to more engines? 18

Consensus and Multi-Search Multiple witnesses increase confidence As long as they are independent Example: Getting the story straight Independent "random" hits unlikely to agree Agreement is indication of biased sampling Example: loaded dice Meta-search is relatively easy Merging and re-ranking is hard Example: Booking a flight to Boston! Scores and E-values are not comparable How to choose the best answer? Example: Best E-value favors Tandem! 19

Search for Consensus Running many search engines is hard! Identifications must have every opportunity to agree: No failed searches, matched search parameters, sequence databases, spectra But the search engines all use: Varying spectral file formats, different parameter specifications for mass tolerance, modifications, pre- processing for sequence databases, different charge- state handling, termini rules Decoy searches must also use identical parameters 20

Searching for Consensus Initial methionine loss as tryptic peptide? Missing charge state handling? X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications? Precursor mass tolerance (Da vs ppm) Semi-tryptic only (no fully-tryptic mode). 21

Configuring for Consensus Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Must strive to ensure that each search engine is presented with the same search criteria, despite different formats, syntax, and quirks. Search engine configuration must be automated. 22

Results Extraction for Consensus Must be able to unambiguously extract peptide identifications from results Spectrum identifiers / scan numbers Modification identifiers Protein accessions How should we handle E-values vs. probabilities vs. FDR (partitioned)? Cannot rely on these to be comparable Must use consistent, external significance calibration 23

Search Engine Independent FDR Estimation Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded Competitive vs Global Competitive controls some spectral variation Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions? 24

Search Execution for Consensus Running many search engines take time 7 x 3 searches of the same spectra! Some search engines require licenses or specific operating systems How to use grid/cloud computing effectively? Cannot assume a shared file-system Search engines may crash or be preempted Machine may "disappear" Machine may consistently fail searches 25

Combining Multi-Search Results Treat search engines as black-boxes Generate PSMs + scores, features Apply machine learning / statistical modeling to results Use multiple match metrics Combine/refine using multiple search engines Agreement suggests correctness 26

Machine Learning / Statistical Modeling Use of multiple metrics of PSM quality: Precursor delta, trypsin digest features, etc Often requires "training" with examples Different examples will change the result Generalization is always the question Scores can be hard to "understand" Difficult to establish statistical significance e.g. PeptideProphet/iProphet Weighted linear combination of features Number of sibling searches 27

Available Tools PeptideProphet/iProphet Part of trans-proteomic-pipeline suite Scaffold Commercial reimplementation of PP/iP PepArML Publicly available from the Edwards lab Lots of in-house stuff… Result combining mentioned in talks, lots of papers, etc. but no public tools 28

Peptide 8 Peptide 7 For Each Spectrum Get Mascot Identification Get SEQUEST Identification Get X!Tandem Identification Peptide 1 Peptide 3 Peptide 4 Peptide 5 Peptide 6 Peptide 2 p=76% p=81% p=56% Agreement score Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made Brian Searle

PepArML Strategy Meta-Search for Multi-Search: Automatic configuration of searches Automatic preprocessing of sequence databases Automatic spectral reformatting Automatic execution of search on local or remote computing resources (AWS/grid/NFS). Result Combining: Decoy-based FDR significance estimation Unsupervised, model-free, machine-learning 30

Peptide Identification Meta-Search Simple unified search interface for: Mascot, X!Tandem, K- Score, S-Score, OMSSA, MyriMatch, InsPecT+MSSGF Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid, Cloud 31

Grid-Enabled Peptide Identification Meta-Search 32 Amazon Web Services University Cluster Edwards Lab Scheduler & 80+ CPUs Secure communication Heterogeneous compute resources Single, simple search request Scales easily to 250+ simultaneous searches

PepArML Combiner Peptide identification arbiter by machine learning Unifies these ideas within a model-free, combining machine learning framework Unsupervised training procedure 33

PepArML Overview 34 X!Tandem Mascot OMSSA Other PepArML Feature extraction

Dataset Construction 35 T F T X!TandemMascotOMSSA T ……

Voting Heuristic Combiner Choose PSM with most votes Break ties using FDR Select PSM with min. FDR of tied votes How to apply this to a decoy database? Lots of possibilities – all imperfect Now using: 100*#votes – min. decoy hits 36

Supervised Learning 37

Search Engine Info. Gain 38

Precursor & Digest Info. Gain 39

Retention Time & Proteotypic Peptide Properties Info. Gain 40

Application to Real Data How well do these models generalize? Different instruments Spectral characteristics change scores Search parameters Different parameters change score values Supervised learning requires (Synthetic) experimental data from every instrument Search results from available search engines Training/models for all parameters x search engine sets x instruments 41

Model Generalization 42

Unsupervised Learning 43

Unsupervised Learning Performance 44

Unsupervised Learning Convergence 45

PepArML Performance 46 LCQQSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1

Conclusions Combining search results from multiple engines can be very powerful Boost both sensitivity and specificity Running multiple search engines is hard Statistical significance is hard Use empirical FDR estimates...but be careful...lots of subtleties Consensus is powerful, but fragile Search engine quirks can destroy it "Witnesses" are not independent 47