Algorithms and Computation: Bottom-Up Data Analysis Workflows

Algorithms and Computation: Bottom-Up Data Analysis Workflows
Nathan Edwards Georgetown University Medical Center

Changing landscape Experimental landscape Computational landscape
Spectra, sensitivity, resolution, samples Computational landscape Data-size, cloud, resources, reliability Data-size and false positive identifications Controlling for false proteins/genes Improving peptide identification sensitivity Machine-learning, multiple search engines Filtered PSMs as a primary data-type

Changing Experimental Landscape
Instruments are faster… More spectra, better precursor sampling Sensitivity improvements… More fractionation (automation), deeper precursor sampling, ion optics Resolution continues to get better… Accurate precursors (fragments) make a big difference Analytical samples per study… Fractionation, chromatography, automation improvements

Clinical Proteomic Tumor Analysis Consortium (NCI)
Comprehensive study of genomically characterized (TCGA) cancer biospecimens by mass-spectrometry-based proteomics workflows ~ 100 clinical tumor samples per study Colorectal, breast, ovarian cancer CPTAC Data Portal provides Raw & mzML spectra; TSV and mzIdentML PSMs; protein reports; experimental meta-data

…from Edwards et al., Journal of Proteome Research, 2015
CPTAC Data Portal …from Edwards et al., Journal of Proteome Research, 2015

CPTAC/TCGA Colorectal Cancer (Proteome)
Vanderbilt PCC (Liebler) 95 TCGA samples, 15 fractions / sample Label-free spectral count / precursor XIC quant. Orbitrap Velos; high-accuracy precursor 1425 spectra files ~600 Gb/~129 Gb (mzML.gz) Spectra: ~ 18M; ~ 13M MS/MS ~ 4.6M PSMs at 1% MSGF+ q-value

Changing Computational Landscape
Single computer operating on a single spectral data-file is no longer feasible MS/MS search is the computational bottleneck Private computing clusters are quickly obsolete Need $$ to upgrade every 3-4 years Personnel costs for cluster administration and management Cloud computing gets faster and cheaper over time… …but requires rethinking the computing model

PepArML Meta-Search Engine
Simple, unified, peptide identification search parameterization and execution: Mascot, MSGF+, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch Cluster, grid, and cloud scheduler: Reliable batch spectra conversion and upload, Automated distribution of spectra and sequence, Job-failure tolerant with result-file validation Machine-learning-based result combining: Model-free – heterogeneous features Adapts to the characteristics of each dataset

PepArML Meta-Search Engine
Georgetown & Maryland HPC Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 48+ CPUs Under the hood, the user interacts with a scheduler, uploading spectra, and specifying a meta-search. The compute clients, local or remote, contact the scheduler to get spectra and jobs to compute. Single, simple search request Amazon Web Services

Run all of the search engines!

Search Engine Running Time
Which (combination of) search engine(s) should I use?

Fault Tolerant Computing
Spot instances can be preempted for those willing to pay more Spot prices are cheaper (7¢/hour vs 46¢/hour)

Identifications per $$
How long will a specific job take? How much memory / data-transfer is needed? What is a good decomposition size? What cloud-instance to use? Wall-clock time can be significantly reduced: …but management overhead costs too. Cost of total compute may even increase. Failed analyses cost too!

Data-Scale and False Positives
Big datasets have more false positive proteins and genes! CPTAC Colorectal Cancer (CDAP) 4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene ~ 10,000 genes identified…

Data-Scale and False Positives
Big datasets have more false positive proteins and genes! CPTAC Colorectal Cancer (CDAP) 4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene ~ 10,000 genes identified… …but ~ 40% gene FDR

Simple decoy protein model
Decoy peptides hit decoy proteins uniformly. Each decoy peptide represents an independent trial. Binomial distribution on size of protein database number of decoy peptides Big-datasets have more decoy peptides!

Example Large: 10,000 proteins, 100,000 peptides
Small: 1,000 proteins, 10,000 peptides

Data-Size and False Positives
CPTAC Colorectal Cancer 1% FDR PSMs, but ~ 25% peptide FDR ~ 25,000 decoy peptides on ~ 20,000 genes Control of gene FDR requires even more stringent filtering of PSMs. If we require strong evidence in all 95 samples: No decoy genes, but less than 1000 genes identified. Bad scenario: PDHA1 and PDHA2 in CPTAC Breast Cancer – shared and unique peptides PDHA2 is testes specific!

Improved Sensitivity Machine-learning models
Use additional metrics for good identifications Combining multiple search engines Agreement indicates good identifications Both approaches successful at boosting ids, particularly when adaptable to each dataset. Watch for the use of decoys in training the model. Both have scaling issues and lack transparency …may add noise to comparisons

PepArML Performance Standard Protein Mix Database
LCQ QSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1

Search Engine Info. Gain

Precursor & Digest Info. Gain

Filtered PSMs as Primary Data
For large enough spectral datasets, we might choose best effort peptide identification Filtered PSMs become primary data Spectral counts become more quantitative Need linear-time spectra → PSM algorithm We work less hard to identify all spectra? Output as genome alignments, BAM files? How should PSMs be represented to maximize their utility? What about decoy peptide identifications?

Nascent polypeptide-associated complex subunit alpha

Pyruvate kinase isozymes M1/M2
2.5 x 10-5

Questions?

Algorithms and Computation: Bottom-Up Data Analysis Workflows

Similar presentations

Presentation on theme: "Algorithms and Computation: Bottom-Up Data Analysis Workflows"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms and Computation: Bottom-Up Data Analysis Workflows

Similar presentations

Presentation on theme: "Algorithms and Computation: Bottom-Up Data Analysis Workflows"— Presentation transcript:

Similar presentations

About project

Feedback