Protein Identification by Database Searching John Cottrell Matrix Science.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

1 st MS 2 2 nd 3 rd 4 th 5 th 6 th 10 th 9 th 8 th 7 th Relative Intensity Fill Times Scan Times “shotgun sequencing”
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
MS-Viewer – A Web Based Spectral Viewer For Database Search Results Peter R. Baker 1, Alma L. Burlingame 1 and Robert J. Chalkley 1 1 Mass Spectrometry.
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Protein Identification using Mass Spectrometry
Protein Sequencing and Identification by Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. MS Identification Dr. Juan Antonio VIZCAINO PRIDE Group coordinator PRIDE team, Proteomics.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
My contact details and information about submitting samples for MS
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
Facts and Fallacies about de Novo Sequencing & Database Search.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Mueller LN, Brusniak MY, Mani DR, Aebersold R
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Session III How we analyzed proteomic data? 台大生技教改暑期課程.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
In-Gel Digestion Why In-Gel Digest?
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Anti-Importin  3 NIH3T3 cells Primary hippocampal neurons Mouse ES cells Anti-Importin  1 Anti-Importin  1 NIH3T3 cells Primary hippocampal neurons.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Eat Raw & Fresh: Introducing isotopic Mass-to-charge Ratio and Envelope Fingerprinting (iMEF) and ProteinGoggle for Protein Database Search Zhixin(Michael)
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
The observed and theoretical peptide sequence information Cal.MassObserved. Mass ±da±ppmStart Sequence EndSequenceIon Score C.I%modification FLPVNEK.
Mascot Example Slides. MS/MS Database Search Example Data: BSAonespectra.mgf (one spectra) Database: bovine Fixed modifications: Carboxymethyl(C )
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
2014 생화학 실험 (1) 6주차 실험조교 : 류 지 연 Yonsei Proteome Research Center 산학협동관 421호
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.
MS Libraries for Forensics: DART-MS and GC-MS
Mass Spectrometry 101 (continued) Hackert - CH 370 / 387D
MassMatrix Search Results Explained
Protein Identification via Database searching
Mass spectrometry-based proteomics
Protein Identification using MS/MS Data
Proteomics Informatics David Fenyő
Peptide & Protein Identification by MS/MS
Protein Identification by Peptide Mass Fingerprinting
Proteomics Informatics –
A, high resolution MS/MS spectrum (lower panel) of 1435
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Presentation transcript:

Protein Identification by Database Searching John Cottrell Matrix Science

Protein Identification by Database Searching Three ways to use mass spectrometry data for protein identification 1.Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein

Protein Identification by Database Searching

PMF Servers on the Web ASCQ_ME: Bupid: Mascot: MassSearch: MS-Fit (Protein Prospector): PepMAPPER: Profound (Prowl): cgi/profound.exe Mowse, PeptideSearch, Protocall, Aldente, XProteo

Protein Identification by Database Searching Search Parameters database taxonomy enzyme missed cleavages fixed modifications variable modifications protein MW estimated mass measurement error

Protein Identification by Database Searching

 Henzel, W. J., Watanabe, C., Stults, J. T., JASMS 2003, 14,

Protein Identification by Database Searching Peptide Mass Fingerprint Fast, simple analysis High sensitivity Need database of protein sequences not ESTs or genomic DNA Sequence must be present in database or close homolog Not good for mixtures especially a minor component.

Protein Identification by Database Searching H – N – C – C – N – C – C – N – C – C – N – C – C – OH R1R1 R2R2 R3R3 R4R4 OOO HHHHHHHH O a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3 x 3 y 3 z 3 x 2 y 2 z 2 x 1 y 1 z 1 H+H+  Roepstorff, P. and Fohlman, J. (1984). Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11, 601.

Protein Identification by Database Searching Three ways to use mass spectrometry data for protein identification 1.Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein 2.Sequence Query Mass values combined with amino acid sequence or composition data

Protein Identification by Database Searching  Mann, M. and Wilm, M., Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem (1994).

Protein Identification by Database Searching tag( ,GWSV, )

Protein Identification by Database Searching Mascot MS-Seq (Protein Prospector) MultiIdent (TagIdent, etc.) PeptideSearch, Spider Sequence Tag Servers on the Web

Protein Identification by Database Searching

Sequence Tag Rapid search times Essentially a filter Error tolerant Match peptide with unknown modification or SNP Requires interpretation of spectrum Usually manual, hence not high throughput Tag has to be called correctly Although ambiguity is OK tag(977.4,[Q|K][Q|K][Q|K]EE,1619.7).

Protein Identification by Database Searching Three ways to use mass spectrometry data for protein identification 1.Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein 2.Sequence Query Mass values combined with amino acid sequence or composition data 3.MS/MS Ions Search Uninterpreted MS/MS data from a single peptide or from a complete LC-MS/MS run

Protein Identification by Database Searching  Eng, J. K., McCormack, A. L. and Yates, J. R., 3rd., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom (1994) SEQUEST

Protein Identification by Database Searching MS/MS Ions Search Servers on the Web Inspecthttp://proteomics.ucsd.edu/LiveSearch/ Mascothttp:// MS-Tag (Protein Prospector) Omssahttp://pubchem.ncbi.nlm.nih.gov/omssa/index.htm PepFrag (Prowl) PepProbehttp://bart.scripps.edu/public/search/pep_probe/search.jsp RAId_DbShttp:// html Sonar (Knexus) X!Tandem (The GPM) Not on-lineByonic, Crux, greylag, MassMatrix, Myrimatch, Paragon, Peaks, PepSplice, pFind, Phenyx, ProbID, ProLuCID, ProteinLynx GS, Sequest, SIMS, SpectrumMill

Protein Identification by Database Searching

MS/MS Ions Search Easily automated for high throughput Can get matches from marginal data Can be slow No enzyme Many variable modifications Large database Large dataset MS/MS is peptide identification Proteins by inference.

Protein Identification by Database Searching Search Parameters

Protein Identification by Database Searching Search Parameters Sequence Database

Protein Identification by Database Searching Search Parameters Sequence Database Swiss-Prot (~500,000 entries) High quality, non-redundant NCBInr, UniRef100 (~19,000,000 entries) Comprehensive, non-identical EST databases (>400,000,000 entries) Very large and very redundant Sequences from a single genome A consensus sequence Peptides are lost at exon-intron boundaries (Entry counts are from mid-2012)

Protein Identification by Database Searching Search Parameters Taxonomy Swiss-Prot 2010_08 Mammalia (mammals)=65104 Primates=26940 Homo sapiens (human)=20292 Other primates=6648 Rodentia (Rodents)=25473 Mus.=16358 Mus musculus (house mouse)=16307 Rattus=7533 Other rodentia=1582 Other mammalia=12691

Protein Identification by Database Searching Search Parameters Mass Tolerances Most search engines support separate mass tolerances for precursors and fragments May allow fixed units (Da, mmu) or proportional (ppm, %) Some search engines can correct for selection of 13 C peak Unless search engine performs some type of re-calibration, need to provide conservative estimate of mass accuracy, not precision This doesn’t have to be a guessing game. Run a standard, then look at the error graphs for strong matches

Protein Identification by Database Searching Search Parameters Enzyme can be Fully specific Non-specific (“no enzyme”) Some search engines support Limited number of missed cleavage points Semi-specific enzymes Enzyme mixtures

Protein Identification by Database Searching Search Parameters Common peak list formats DTA (Sequest) PKL (Masslynx) MGF (Mascot) mzData (.XML) mzML (.mzML)

Protein Identification by Database Searching Search Parameters Modifications Fixed / static / quantitative modifications cost nothing Variable / differential / non-quantitative modifications are very expensive

Protein Identification by Database Searching Search Parameters Modifications Common artefacts Carbamylation+43N-term, KUrea in digest buffer Deamidation+1NLow pH Pyro-glutamic acid-17Q at N-termLow pH Pyro-carbamidomethyl or carboxymethyl Cys +40C at N-termLow pH, delta is relative to unmodified C Oxidation+16M (many other residues also) Gels Over alkylation+57N-term, WIodacetamide Over alkylation+58N-term, WIodoacetic acid

Protein Identification by Database Searching Site Analysis

Protein Identification by Database Searching Site Analysis

Protein Identification by Database Searching Site Analysis AscoreBeausoleil S.A., et al. (2006) Nat. Biotechnol. 24, 1285–1292 MaxQuantCox J. & Mann M. (2008) Nat. Biotechnol. 26, Olsen J.V., et al. (2006) Cell 127, 635–48 Inspect MS-Alignment PTMFinder Tanner S., et al. (2008) J. Proteome Res. 7, 170–181 Payne S., et al. (2008) J. Proteome Res. 7, 3373–3381 Tsur D., et al. (2005) Nat. Biotechnol. 23, 1562–1567 Tanner S., et al. (2005) Anal. Chem. 77, PhosphoScoreRuttenberg B.E., et al. (2008) J. Proteome Res. 7, DebunkerLu B., et al. (2007) Anal. Chem. 79, SloMo - ETD/ECDBailey C.M., et al. (2009) J. Proteome Res. 8, ModifiCombSavitski M.M., et al. (2006) Mol. Cell. Proteomics 5, 935–48 Delta ScoreSavitski M. M., et al. (2010) Mol. Cell. Proteomics mcp.M

Site Analysis Protein Identification by Database Searching

Multi-pass Searches Implemented under a variety of names X!Tandem:Model refinement Mascot:Error tolerant search Spectrum Mill:Search saved hits, homology mode, unassigned single mass gap Phenyx:2-rounds Paragon:Thorough ID, fraglet-taglet

Protein Identification by Database Searching Scoring Score Total matches Incorrect matches Correct matches

Protein Identification by Database Searching Scoring Receiver Operating Characteristic

Protein Identification by Database Searching Sensitivity & Specificity

Protein Identification by Database Searching Sensitivity & Specificity Search a “decoy” database Decoy entries can be reversed or shuffled or randomised versions of target entries Decoy entries can be separate database or concatenated to target entries Gives a clear estimate of false discovery rate

Protein Identification by Database Searching Sensitivity & Specificity Score Total matches Incorrect matches Correct matches

Protein Identification by Database Searching Sensitivity & Specificity

Protein Identification by Database Searching Protein Inference Peptide 1Peptide 2Peptide 3 Peptide 1Peptide 3 Peptide 2 General approach is to create a minimal list of proteins. “Principal of parsimony” or “Occam’s razor” Protein A Protein B Protein C

Protein Identification by Database Searching Further Reading: Exercises: ms.com/exercises/exercises. html