2 3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: 10.1021/pr100677g.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Interpreting MS/MS Proteomics Results
1 st MS 2 2 nd 3 rd 4 th 5 th 6 th 10 th 9 th 8 th 7 th Relative Intensity Fill Times Scan Times “shotgun sequencing”
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Mass Spectrometry in a drug discovery setting Claus Andersen Senior Scientist Sienabiotech Spa.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics Workshop Part III: Protein Quantitation
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
© 2010 SRI International - Company Confidential and Proprietary Information Quantitative Proteomics: Approaches and Current Capabilities Pathway Tools.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Novel Algorithms for the Quantification Confidence in Quantitative Proteomics with Stable Isotope Labeling* Novel Algorithms for the Quantification Confidence.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Quantification of Membrane and Membrane- Bound Proteins in Normal and Malignant Breast Cancer Cells Isolated from the Same Patient with Primary Breast.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Protein quantitation I: Overview (Week 5). Fractionation Digestion LC-MS Lysis MS Sample i Protein j Peptide k Proteomic Bioinformatics – Quantitation.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
MassMatrix Search Results Explained
Protein Identification via Database searching
Paulo Costa Carvalho ::
Mass spectrometry-based proteomics
Proteomics Informatics David Fenyő
Peptide & Protein Identification by MS/MS
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Bioinformatics for Proteomics
Shotgun Proteomics in Neuroscience
Proteomics Informatics David Fenyő
Kuen-Pin Wu Institute of Information Science Academia Sinica
Presentation transcript:

2

3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: /pr100677g

4

5 Editorial “There has been an unprecedented improvement in the quality and quantity of commercial proteomics data generation technologies, making data generation more accessible to many researchers. However, more and more discoveries will be led by researchers in command of the skills necessary to mine and extensively interpret the volumes of data. Already the ability to generate data vastly outpaces our ability to interpret it, and the lack of expertise in interpreting data is the current gating factor in the advancement of proteomics sciences. Proteomics scientists with training solely in data generation techniques will be shut out of more and more research opportunities. Nuno Bandeira, July 2011 Editorial “There has been an unprecedented improvement in the quality and quantity of commercial proteomics data generation technologies, making data generation more accessible to many researchers. However, more and more discoveries will be led by researchers in command of the skills necessary to mine and extensively interpret the volumes of data. Already the ability to generate data vastly outpaces our ability to interpret it, and the lack of expertise in interpreting data is the current gating factor in the advancement of proteomics sciences. Proteomics scientists with training solely in data generation techniques will be shut out of more and more research opportunities. Nuno Bandeira, July 2011

Eduards AM, Nature, Feb 2011

7

pcarvalho.com9

10 Mass / Charge TimeTime

A FYLK m/z AFYALK NH 2 COOH (precursor) 2+ (B) (Y)

AF YLK A m/z AFYLK NH 2 COOH A FYLK (precursor) 2+ (B) (Y)

AFY LK L m/z AFYK NH 2 COOH A FYLK AF YLK (precursor) 2+ (B) (Y)

m/z AFYLK NH 2 COOH K AFYL A FYLK AF YLK AFY LK (precursor) 2+ (B) (Y)

15

17 M/Z MS/MS Intensity Q G D F V L E T S K H A G I I L V L G T S V G V V K E D A S P E

18 Na S et al., MCP, 2008

19

20 ProLuCID Xtandem OMSSA Andromeda SEQUEST Mascot …

Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA NPC Progress Meeting (February 2nd, 2006) Illustrated by Toni Boudreault

M/z Intensity RITPEA H2OH2O B-type, A-type, Y-type Ions All these peaks are seen together simultaneously and we don’t even know …

M/z Intensity What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques,

M/z Intensity There’s noise, producing a final spectrum that looks like …

M/z Intensity ….This, on a good day. And so it’s actually fairly difficult to …

26

Known Ion Types B-type ions A-type ions Y-type ions We knew a couple of things about peptide fragmentation. Not only do we know to expect B, A, and Y ions, but…

Known Ion Types B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH 3 ions B- or Y-type -H 2 O ions 100% 20% 100% 50% 20% … likelihood of seeing each type of ion, where generally B and Y ions are most prominent.

If we know the amino acid sequence of a peptide, we can guess what the spectra should look like! So it’s actually pretty easy to guess what a spectrum should look like if we know what the peptide sequence is.

ELVISLIVESK Model Spectrum *Courtesy of Dr. Richard Johnson So as an example, consider the peptide ELVIS LIVES K that was synthesized by Rich Johnson in Seattle

Model Spectrum We can create a hypothetical spectrum based on our rules

B/Y type ions (100%) A type ions B/Y -NH 3 /-H 2 O (20%) B/Y +2H type ions (50%) Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, and other stragglers are at 20%.

Model Spectrum So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…

Model Spectrum We can find where the overlap is between the hypothetical and the actual spectra…

Model Spectrum And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.

1977 Shotgun sequencing invented, bacteriophage fX174 sequenced Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced 1996 Yeast Genome sequenced 2000 Human Genome draft Sequencing Explosion … Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing And the idea was … for use in tandem mass spectrometry.

SEQUEST.… instead of searching all possible peptide sequences, search only those in genome databases. Now, in the post- genomic world this seems like a pretty trivial idea, but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time.

SEQUEST Model Spectrum For a scoring function they decided to use Cross-Correlation, Like so. which basically sums the peaks that overlap between hypothetical and the actual spectra

SEQUEST Model Spectrum And then they shifted the spectra back and ….

SEQUEST Model Spectrum They used this number, also called the Auto-Correlation, as their background. … Forth so that the peaks shouldn’t align.

SEQUEST XCorr Gentzel M. et al Proteomics 3 (2003) Offset (AMU) Correlation Score Cross Correlation (direct comparison) Auto Correlation (background) This is another representation of the Cross Correlation and the Auto Correlation.

SEQUEST XCorr Cross Correlation (direct comparison) Auto Correlation (background) XCorr = Gentzel M. et al Proteomics 3 (2003) Offset (AMU) Correlation Score The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 150 AMU range. The XCorr is high if the direct comparison is significantly greater than the background, which is obviously good for peptide identification.

SEQUEST DeltaCn and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use. It measures how good the XCorr is relative to the next best match. And this XCorr is actually a pretty robust method for estimating how accurate the match is, As you can see, this is actually a pretty crude calculation.

44 * Show an MS2 file

45 ProLuCID is a fast and sensitive tandem mass spectra-based protein identification program recently developed in the Yates laboratory at The Scripps Research Institute.

Show ProLuCID Runner Carvalho PC et al; unpublished 46

Search Engine (e.g. ProLuCID, SEQUEST, etc) Workflow MS PSM Database

48

49

50 In the beginning… spectrumscoresproteinpeptide sort by match score SEQUEST XCorr > 2.5 dCn > 0.1 Mascot Score > 45 X!Tandem Score < 0.01 Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size. Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size.

51 There has to be a better way The threshold model has these problems, which PeptideProphet, DTASelect and others try to solve: Poor sensitivity/specificity trade-off, unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: – different instruments (QTOF, TOF-TOF, IonTrap) – ionization sources (electrospray vs MALDI) – sample complexities (2D gel spot vs MudPIT) – different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on.

52 Creating a discriminant score spectrumscoresproteinpeptide sort by match score PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr,  Cn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score.

pcarvalho.com53 Scaffold:: Proteome Software

54 correctly identifies everything, with no error Keller et al, Anal Chem 2002 This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified). The ideal is zero error and everything identified (sensitivity = 100%). PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors. This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified). The ideal is zero error and everything identified (sensitivity = 100%). PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors.

55 “correct” “incorrect” Discriminant score (D) Number of spectra in each bin This histogram shows the distributions of correct and incorrect matches. PeptideProphet assumes that these distributions are standard statistical distributions. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. This histogram shows the distributions of correct and incorrect matches. PeptideProphet assumes that these distributions are standard statistical distributions. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. Mixture of distributions

56 Sequências alvo Decoys rotulados } { Estratégia decoy para FDR Resultado busca Elias and Gygi, Nature Methods, 2007

pcarvalho.com57 SVM - example

58 Summary: “The use of iProphet in the TPP increases the number of correctly identified peptides at a constant false discovery rate (FDR) as compared to both PeptideProphet and another state-of-the art tool Percolator.”

59 Maximizing proteins under a given FDR

60

61 Target Sequences Labeled Decoys } { New FDR strategy Resultado search Unlabeled Decoyd U- Decoy

SpectraPeptidesProteins (FDR)UL FDR SEPro104,65417, (0.9%)1% Scaffold88,97015,4061,160 (2.3%)2% Table I. Scaffold A refers to a 99% confidence level for proteins, 95% for peptides. Scaffold B refers to 95 and 80%, respectively for proteins and peptides.

64

65

66

67

68

69

70

71

Thermo

Picture from Strassberger et al, JOP, 2010 * Search for examples in xcalibur Scan How to deal with different charge states???? Subject to random sampling; what are its immplications?

74 Differential Analysis Marginal Cases (found in only 1 condition) Differential (found in both)

75 Venn Diagrams of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 and B2. Panels A, B, and C consider only proteins that appeared in one or more, two or more, or in all three replicates, respectively.

76 Venn Diagrams of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 (A) and B2 (B). R1, R2, and R3 refer to the replicates from 59 each state.

77 What proteins can be considered as statistically different for marginal cases?

Venn Diagram of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 and B2. Proteins that could not be statistically claimed to be differentially expressed in one of the two states according to the proposed Bayesian approach (those forwhich p-value 0.05) were automatically filtered out during the generation of the Venn Diagram. Carvalho PC et al; Bioinformatics 2011

79 Differential Analysis Marginal Cases (found in only 1 condition) Differential (found in both)

80 } } Estrategia Tradicional - Data Dependent Analysis (DDA) Nova estrategia – Extended Data Independent Analysis (XDIA)

81

82

84

DDAXDIA

AAA, BBBB Time Peptide Mass:

88

89

Pinpoint differentially expressed proteinsVenn Diagrams Gene Ontology AnalysisFind trends in time-course experiments Carvalho PC et al., Current Protocols in Bioinformatics, 2010

91

Finding Statistically Differentially Expressed Proteins / Data Analysis PatternLab for proteomics (Trends, Venn Diagrams, Differential Statistics, Gene Ontology Analysis, etc..) Protein Quantitation Search Engine Processor / SEProQ Protein Identification / Quality control ProLuCID => Search Engine Processor Search Engine Preprocessing YADA XDIA ProcessorCPM Experimental: Data acquisition using the mass spectrometer DDAXDIA