ISA 2013. 05. 28 Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.

Slides:



Advertisements
Similar presentations
AP STATISTICS LESSON 11 – 1 (DAY 3) Matched Pairs t Procedures.
Advertisements

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
1336 SW Bertha Blvd, Portland OR 97219
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Heuristic alignment algorithms and cost matrices
PSY 307 – Statistics for the Behavioral Sciences
Comparison and Combination of Ear and Face Images in Appearance-Based Biometrics IEEE Trans on PAMI, VOL. 25, NO.9, 2003 Kyong Chang, Kevin W. Bowyer,
7-2 Estimating a Population Proportion
1.  Why understanding probability is important?  What is normal curve  How to compute and interpret z scores. 2.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Chapter 11: Inference for Distributions
Scaffold Download free viewer:
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Measurement and Data Quality
Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.
Albert Morlan Caitrin Carroll Savannah Andrews Richard Saney.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Lecture 14 Dustin Lueker. 2  Inferential statistical methods provide predictions about characteristics of a population, based on information in a sample.
Week 8 Chapter 8 - Hypothesis Testing I: The One-Sample Case.
CHAPTER 18: Inference about a Population Mean
Instructor Resource Chapter 5 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
통계적 추론 (Statistical Inference) 삼성생명과학연구소 통계지원팀 김선우 1.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Section 7-3 Estimating a Population Mean: σ Known.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
The Single-Sample t Test Chapter 9. t distributions >Sometimes, we do not have the population standard deviation. (that’s actually really common). >So.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
Estimating a Population Mean. Student’s t-Distribution.
Confidence Intervals for a Population Mean, Standard Deviation Unknown.
Confidence Interval Estimation For statistical inference in decision making: Chapter 9.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Confidence Intervals. Point Estimate u A specific numerical value estimate of a parameter. u The best point estimate for the population mean is the sample.
10.1 Estimating with Confidence Chapter 10 Introduction to Inference.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Copyright © 2009 Pearson Education, Inc t LEARNING GOAL Understand when it is appropriate to use the Student t distribution rather than the normal.
Hanyang Univ. Introduction to Data Analyses for Mass Spectrometry-based Proteomics 1.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein Identification via Database searching
CHAPTER 8 Estimating with Confidence
Evaluation of measuring tools: reliability
Elementary Statistics
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Proteomics Informatics –
CHAPTER 18: Inference about a Population Mean
CHAPTER 18: Inference about a Population Mean
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Chapter 7 Lecture 3 Section: 7.5.
Presentation transcript:

ISA Kim Hye mi

Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet Target/Decoy Protein assignment & validation Output Interpretation Quantitation 2

Introduction Target-decoy search strategy is effective way Proteome researcher must devise way to distinguish correct from incorrect peptide identifications Target-decoy search strategy is simple to implement ‘Target’ protein sequence database The protein mixture to be analyzed ‘Decoy’ database By reversing the target protein sequences Minimizing the number of peptide sequences in common between the target and decoy 3

Introduction Target-decoy search strategy With FP estimations possible to derive other measurements that help evaluate and compare scoring methods and data sets. peptide-spectral matches (PSMs) are correct or incorrect the composite target-decoy database evaluates FP rates in large PSM populations. 4

Introduction Measurements derived from decoy database search results 5

Introduction Target-decoy search strategy The two assumptions mentioned are reasonable Target and decoy databases do not overlap Target and decoy false positives are equally likely Concatenated database searches are preferable to separate searches Estimating theoretical error of target-decoy false positive rates Alternate decoy database constructions can be similarly effective 6

Assumption 1 : target and decoy databases do not overlap If decoy hits are incorrect, they are not present in target database Very short peptides were found in both target and decoy database Practically no (0.02%) peptides with lengths greater than eight amino acids were in common between target and decoy database. 7

Assumption 1 : target and decoy databases do not overlap International Protein Index sequence database 8

Assumption 2 : target and decoy false positives are equally likely The validity of this assumption can be tested in two ways. The search algorithm must be presented with equal numbers of target and decoy peptides. The number of necessarily incorrect peptide hits should be equally distributed between target and decoy hits. 9

Assumption 2 : target and decoy false positives are equally likely The distributions of considered peptides were practically the same between target-decoy peptides regardless of mass tolerance. Target database Decoy database 10

Assumption 2 : target and decoy false positives are equally likely Comparing these curves indicated substantial correspondence between target-and decoy-derived peptides 11

Assumption 2 : target and decoy false positives are equally likely Indicating that top-ranked peptides showed a strong bias toward target database hits Unlike lower-ranked matches. We extended this idea by modifying MS/MS spectra to prevent any correct identifications from being made. 12

Concatenated database searches are preferable to separate searches Decoy sequences are searched separately Target and decoy sequence cannot compete for the top-ranked score Decoy searches may often receive elevated scores relative to other top- ranked hits Search MS/MS spectra once against a single database Consist of target and decoy sequences 13

Concatenated database searches are preferable to separate searches Separate searching method force one To assume all peptide assignments are incorrect Below the score at which decoy hits outnumber target hits Leading to an overestimated FP rates 14

Concatenated database searches are preferable to separate searches Separate searching overestimates FP rate Separate search cannot estimate correct identifications When decoy hit outnumber target hit(0.8 – 2.3) Target and decoy sequences compete Making it possible to estimate the distribution of low-scoring correct identifications(0.8 – 2.3) 15

Concatenated database searches are preferable to separate searches Direct comparison of FP rates Separate database searches can overestimate FP rates by > 35% relative to concatenated searches 16

Estimating theoretical error of target-decoy false positive rates 17 One criticism of the target-decoy approach is that one can never know exactly which or how many selected PSMs are incorrect. expect these estimations substantially deviate from the actual number of FPs when the number of returned hits is very small or the number of returned decoy hits is very large.

Estimating theoretical error of target-decoy false positive rates Based on these findings, it was possible to place confidence intervals on target- decoy estimations given the number of total hits returned and the estimated precision rate derived from the decoy hits FP rate 를 시뮬레이션 하기 위해 작성 The program randomly assigned each of the remaining incorrect hits a ’target’ or ‘decoy’ state Larger standard deviation indicates less reliable precision rate estimations. 18

Estimating theoretical error of target-decoy false positive rates 앞의 그림을 로그변환 시킨 그래프 19

Estimating theoretical error of target-decoy false positive rates The relationship between the slopes and precision The slopes of these lines are related to the underlying precision rate 20

Estimating theoretical error of target-decoy false positive rates The relationship between the slopes and precision suggest that the expected standard deviation of a precision rate estimation can be calculated from the precision rate and sample size 21

Estimating theoretical error of target-decoy false positive rates The relationship between the standard deviation(σ) of error and the sample size(N) 22

Estimating theoretical error of target-decoy false positive rates The expected standard deviation of error 그러므로 error 에 대한 예상표준편차는 다음의 식을 이용해 나타낼 수 있다. 23

Alternate decoy database constructions can be similarly effective Protein sequence reversal Modified sequence reversal method Two stochastic method Random Markov chain model 24

Alternate decoy database constructions can be similarly effective Both stochastic databases produced more peptides Constrained to have similar amino acid compositions as target database 25

Alternate decoy database constructions can be similarly effective Incorrect identification were equally distributed Both stochastic methods performed essentially identically to one another The distribution target and decoy sequences being incorrectly matched Not desired 50%, but decidedly skewed(63%) For estimating FP identification, use factor 1.6(≈1/0.63) 26