Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Slides:



Advertisements
Similar presentations
Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.
Advertisements

From Genome to Proteome Juang RH (2004) BCbasics Systems Biology, Integrated Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. PRIDE associated tools: Practical exercise 1 PRIDE team, Proteomics Services Group PANDA.
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Database Searches. Peptide mass fingerprinting digestMS Search HIT SCORE Protein X 1000 Protein Y 50 Protein Z 5 Protein X theoretical digestProtein Y.
Peptide Mass Fingerprinting
Mass Fingerprint. Protease A protease is any enzyme that conducts proteolysis, that is, begins protein catabolism by hydrolysis of the peptide bonds that.
Knowledge Enabled Information and Services Science What can SW do for HCLS today? Panel at HCSL Workshop, WWW2007 Amit Sheth Kno.e.sis Center Wright State.
Bioinformatics on Proteomics Hsueh-Fen Juan April 24, 2003 NTNU.
PROTEIN IDENTIFICATION BY MASS SPECTROMETRY. OBJECTIVES To become familiar with matrix assisted laser desorption ionization-time of flight mass spectrometry.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Protein Identification with Mascot Software (Laxmana Rao Y. and Gopalacharyulu P.V.)
Basics of 2-DE and MALDI-ToF MS
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
My contact details and information about submitting samples for MS
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Proteome.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
2D-Gel Analysis Jennifer Wagner Image retrieved from
Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University.
Proteome data integration characteristics and challenges K. Belhajjame 1, R. Cote 4, S.M. Embury 1, H. Fan 2, C. Goble 1, H. Hermjakob, S.J. Hubbard 1,
IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury,
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
A new "Molecular Scanner" design for interfacing gel electrophoresis with MALDI-TOF ThP Stephen J. Hattan; Kenneth C. Parker; Marvin L. Vestal SimulTof.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Knowledge Enabled Information and Services Science Glycomics project overview.
Lecture 9. Functional Genomics at the Protein Level: Proteomics.
Combining the strengths of UMIST and The Victoria University of Manchester Quality views: capturing and exploiting the user perspective on information.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
XML Standards for Proteomics Data Andrew Jones, Dr Jonathan Wastling and Dr Ela Hunt Department of Computing Science and the Institute of Biomedical and.
FuGE: A framework for developing standards for functional genomics Andrew Jones School of Computer Science, University of Manchester Metabomeeting 2.0.
A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
Bioinformatics Research Overview Outline Biomedical Ontologies oGlycO oEnzyO oProPreO Scientific Workflow for analysis of Proteomics Data Framework for.
A New Strategy of Protein Identification in Proteomics Xinmin Yin CS Dept. Ball State Univ.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
By Jay Krishnan. Introduction Information gathered from Proteomic techniques + neuroscientific research = Information on protein composition and function.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
The observed and theoretical peptide sequence information Cal.MassObserved. Mass ±da±ppmStart Sequence EndSequenceIon Score C.I%modification FLPVNEK.
Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.
2014 생화학 실험 (1) 6주차 실험조교 : 류 지 연 Yonsei Proteome Research Center 산학협동관 421호
Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.
10/30/2013BCHB Edwards Project/Review BCHB Lecture 17.
Post translational modification n- acetylation Peptide Mass Fingerprinting (PMF) is an analytical technique for identifying unknown protein. Proteins to.
Jarrett Egertson, Ph.D. MacCoss Lab
2D-Gel Analysis Jennifer Wagner
Proteomics Informatics David Fenyő
A perspective on proteomics in cell biology
Protein Identification by Peptide Mass Fingerprinting
Comparative proteomic analysis of human adenomyosis using two-dimensional gel electrophoresis and mass spectrometry  Haiyuan Liu, M.D., Jinghe Lang, M.D.,
Distributions of the ELDP values and Mascot scores for all protein identifications.a, frequency of ELDP value returned by correct (gray bars) and incorrect.
Comparison of ROC plots for the PMF quality metrics using test dataset 2 (44 C. difficile proteins).a, ROC curves for coverage (open squares), MC (solid.
Comparison of ROC plots for the PMF quality metrics using test dataset 3 (100 M. jannaschii proteins).a, ROC curves for coverage (open squares), MC (solid.
2-D gel images visualized by Coomassie Brilliant Blue staining representing total proteins extracted from HCT-8 under apoptotic conditions in 2 mm Gln.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Presentation transcript:

Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen

What is Proteomics? The large-scale study of proteins of an organism, cell or tissue Colony morphologies of Candida albicans wild-type and nrg1 mutant Electron micrograph of a breast cancer cell (picture courtesy of the National Cancer Institute) MALDI protein imaging of a human glioblastoma slice (Stoeckli et al. Nature Medicine 7, 493)

“Classical” proteomics M r (kDa) 7 pI 4 Identification (Peptide mass fingerprinting) Quantification (Intensity of staining of protein spot) Separation (2-Dimensional gel electrophoresis) Biological function Normalised spot volume

Should we be concerned about information quality in proteomics? More, larger, datasets being generated Combine datasets from different labs –Answer new biological or technical questions Quality of information may affect decisions on how the data is used Steven Carr et al. (2004) Molecular & Cellular Proteomics 3, 531 …a significant but undefined number of the proteins being reported as “identified” in proteomics articles are likely to be false positives.

Assessing the quality of protein identifications Difficulties: Expert scrutiny of original MS data is not practical for large datasets No established minimum acceptance criteria for protein identifications by MS Hypothesis: Any peptide mass fingerprinting search report contains information that enables a universal quality score to be calculated

Protein identification by peptide mass fingerprinting K K R R H2NH2N COOH KP tryptic digestion >Candida albicans|CA0001|IPF19501 unknown function MYQTDHGVHNVDGRMSRYIIIPDRSTIRPLLTSNLIAGSLL PSLHCSVSLFLDRVRSSLSSVSVPARVSLPRCFWLSKCLSL GARVRSLFPSLSLSRSYSSSSGPALLYSSVVHSPFLFLLLH SSLFRLLSSPLSSCSLQHLLILNSQWTHRRWEGATQFSSVK GISAVFRPSRASMCPRGFFXCSVCVPLSFRVSIGPFMLFRV PIGFSCISGPLAICFPFNEFLSCLPFLLFRFLFHPLQFLSG LPLLHYSPVINPRPFGFPHPAQPSSYV in silico digestion Theoretical mass lists Experimental mass list Search engine K R H2NH2N COOH KP R K MALDI-TOF ProteinProtein sequence database

Protein identification quality indicators Hit ratio (HR) – the number of masses matched divided by the number of masses submitted to the search –Provides a measure of the signal-to-noise ratio in the mass spectrum m/z m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 m8m8 m9m9 m 10 peptide mass fingerprint mass list m 1 m 2 m 3 m 4 m 5 m 6 m 7 m 8 m 9 m 10 m/z m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 m8m8 m9m9 m 10 highlighted peaks matched to protein HR = 6/10 = 0.6 spectrum processing database searching

Protein identification quality indicators Mass coverage (MC) – the percent sequence coverage multiplied by the protein mass in kDa MC= x = 13.9 kDa – Measures the amount of protein sequence matched

Protein identification quality indicators Excess of limit-digested peptides (ELDP) – the number of matched peptides having no missed cleavages minus the number of matched peptides containing a missed cleavage site –reflects the completeness of the digestion that precedes the peptide mass fingerprinting ELDP= 5 – 3 = +2

Protein identification quality indicators David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006 David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, Streptomyces coelicolor Clostridium difficile Methanococcus jannaschii

ROC analysis shows that HR, MC, and ELDP can discriminate between correct and incorrect protein identifications PMF score = (100 * HR) + MC + (10 * ELDP) Data from 581 PMF experiments (protein identifications from 2-D gel spots)

Qurator provides an information quality (IQ) framework Extend generic ontology of IQ concepts –Allow scientists to define quality characteristics specific to their domain HR, MC, ELDP Framework for managing IQ –Allow scientists to use their own IQ definitions –... and reuse those created by others Annotate experimental data with quality characteristics –Produce “quality-aware” information resources –Allow user-scientists to access/select/filter data according to their quality preferences

Making the Qurator framework useful A key aim of the Qurator project is to integrate IQ tools with existing standards –IQ indicators should apply to common data formats –Qurator functions should be plugged into tools already used by scientists For proteomics we have aligned Qurator with –the PEDRo standard data model (and its XML serialisation) –the Pedro data entry tool sourceforge.net/projects/pedro

PEDRo: a standard format for proteomics data Taylor CF et al. (2003) Nature Biotechnology 3, 247 PEDRo schema Section of XML output from PEDRo data collator tool

Qurator Pedro Plugin When a data model is selected, the Qurator Pedro plugin queries the IQ ontology to discover indicators relevant to the kind of data e.g. for the PEDRo proteomics model, HR, MC and ELDP Values for the calculated indicators for the selected data items are displayed along with basic provenance data (e.g. timestamp…) Web services that calculate the IQ indicators can be invoked using the “Plugins” button

Conclusions & future work Numerical indicators (HR, MC, and ELDP) that describe the quality of protein identifications by peptide mass fingerprinting –Useful for validation of protein identifications –Can be computed from search reports (e.g. Mascot) The proteomics case is a proof-of-concept for the Qurator IQ framework –We are working to embed Qurator services in a wider range of desktop tools (e.g. Taverna workflow environment) –Further usability/usefulness trials of the tools are planned

Acknowledgements Alun Preece Binling Jin Al Brown Paulo Missier Suzanne Embury Computing Science Medical Sciences Computer Science