Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.

Similar presentations


Presentation on theme: "Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen."— Presentation transcript:

1 Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen

2 What is Proteomics? The large-scale study of proteins of an organism, cell or tissue Colony morphologies of Candida albicans wild-type and nrg1 mutant Electron micrograph of a breast cancer cell (picture courtesy of the National Cancer Institute) MALDI protein imaging of a human glioblastoma slice (Stoeckli et al. Nature Medicine 7, 493)

3 “Classical” proteomics 10 100 M r (kDa) 7 pI 4 Identification (Peptide mass fingerprinting) Quantification (Intensity of staining of protein spot) Separation (2-Dimensional gel electrophoresis) Biological function Normalised spot volume

4 Should we be concerned about information quality in proteomics? More, larger, datasets being generated Combine datasets from different labs –Answer new biological or technical questions Quality of information may affect decisions on how the data is used Steven Carr et al. (2004) Molecular & Cellular Proteomics 3, 531 …a significant but undefined number of the proteins being reported as “identified” in proteomics articles are likely to be false positives.

5 Assessing the quality of protein identifications Difficulties: Expert scrutiny of original MS data is not practical for large datasets No established minimum acceptance criteria for protein identifications by MS Hypothesis: Any peptide mass fingerprinting search report contains information that enables a universal quality score to be calculated

6 Protein identification by peptide mass fingerprinting K K R R H2NH2N COOH KP tryptic digestion >Candida albicans|CA0001|IPF19501 unknown function MYQTDHGVHNVDGRMSRYIIIPDRSTIRPLLTSNLIAGSLL PSLHCSVSLFLDRVRSSLSSVSVPARVSLPRCFWLSKCLSL GARVRSLFPSLSLSRSYSSSSGPALLYSSVVHSPFLFLLLH SSLFRLLSSPLSSCSLQHLLILNSQWTHRRWEGATQFSSVK GISAVFRPSRASMCPRGFFXCSVCVPLSFRVSIGPFMLFRV PIGFSCISGPLAICFPFNEFLSCLPFLLFRFLFHPLQFLSG LPLLHYSPVINPRPFGFPHPAQPSSYV 783.3858 889.5141 1089.5898 1089.6163 1106.6204 1166.6390 1239.6004 1628.7234 2733.4504 3223.7871 3398.7783 in silico digestion Theoretical mass lists Experimental mass list Search engine K R H2NH2N COOH KP R K MALDI-TOF ProteinProtein sequence database

7 Protein identification quality indicators Hit ratio (HR) – the number of masses matched divided by the number of masses submitted to the search –Provides a measure of the signal-to-noise ratio in the mass spectrum m/z m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 m8m8 m9m9 m 10 peptide mass fingerprint mass list m 1 m 2 m 3 m 4 m 5 m 6 m 7 m 8 m 9 m 10 m/z m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 m8m8 m9m9 m 10 highlighted peaks matched to protein HR = 6/10 = 0.6 spectrum processing database searching

8 Protein identification quality indicators Mass coverage (MC) – the percent sequence coverage multiplied by the protein mass in kDa MC= 55752 x 25 1000 100 = 13.9 kDa – Measures the amount of protein sequence matched

9 Protein identification quality indicators Excess of limit-digested peptides (ELDP) – the number of matched peptides having no missed cleavages minus the number of matched peptides containing a missed cleavage site –reflects the completeness of the digestion that precedes the peptide mass fingerprinting ELDP= 5 – 3 = +2

10 Protein identification quality indicators David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006 David A. Stead, Alun Preece, and Alistair J. P. Brown Universal metrics for quality assessment of protein identifications by mass spectrometry MCP published March 27, 2006 www.mcponline.org/cgi/reprint/M500426-MCP200v1 Streptomyces coelicolor Clostridium difficile Methanococcus jannaschii

11 ROC analysis shows that HR, MC, and ELDP can discriminate between correct and incorrect protein identifications PMF score = (100 * HR) + MC + (10 * ELDP) Data from 581 PMF experiments (protein identifications from 2-D gel spots)

12 Qurator provides an information quality (IQ) framework Extend generic ontology of IQ concepts –Allow scientists to define quality characteristics specific to their domain HR, MC, ELDP Framework for managing IQ –Allow scientists to use their own IQ definitions –... and reuse those created by others Annotate experimental data with quality characteristics –Produce “quality-aware” information resources –Allow user-scientists to access/select/filter data according to their quality preferences www.qurator.org

13 Making the Qurator framework useful A key aim of the Qurator project is to integrate IQ tools with existing standards –IQ indicators should apply to common data formats –Qurator functions should be plugged into tools already used by scientists For proteomics we have aligned Qurator with –the PEDRo standard data model (and its XML serialisation) –the Pedro data entry tool sourceforge.net/projects/pedro

14 PEDRo: a standard format for proteomics data Taylor CF et al. (2003) Nature Biotechnology 3, 247 PEDRo schema Section of XML output from PEDRo data collator tool

15 Qurator Pedro Plugin When a data model is selected, the Qurator Pedro plugin queries the IQ ontology to discover indicators relevant to the kind of data e.g. for the PEDRo proteomics model, HR, MC and ELDP Values for the calculated indicators for the selected data items are displayed along with basic provenance data (e.g. timestamp…) Web services that calculate the IQ indicators can be invoked using the “Plugins” button

16 Conclusions & future work Numerical indicators (HR, MC, and ELDP) that describe the quality of protein identifications by peptide mass fingerprinting –Useful for validation of protein identifications –Can be computed from search reports (e.g. Mascot) The proteomics case is a proof-of-concept for the Qurator IQ framework –We are working to embed Qurator services in a wider range of desktop tools (e.g. Taverna workflow environment) –Further usability/usefulness trials of the tools are planned

17 Acknowledgements Alun Preece Binling Jin Al Brown Paulo Missier Suzanne Embury Computing Science Medical Sciences Computer Science www.qurator.orgwww.abdn.ac.uk/proteomics


Download ppt "Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen."

Similar presentations


Ads by Google