Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein grouping in mzIdentML. ProteinDetectionList ProteinAmbiguityGroup id=“PAG1” ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE”

Similar presentations


Presentation on theme: "Protein grouping in mzIdentML. ProteinDetectionList ProteinAmbiguityGroup id=“PAG1” ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE”"— Presentation transcript:

1 Protein grouping in mzIdentML

2 ProteinDetectionList ProteinAmbiguityGroup id=“PAG1” ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE” anchor protein ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE” anchor protein ProteinAmbiguityGroup id=“PAG2” ProteinDetectionHypothesis id=“PDH2” dbseq_ref=“dbseq_Q05423|CP2E2_MOUSE” sequence same-set ProteinDetectionHypothesis id=“PDH2” dbseq_ref=“dbseq_Q05423|CP2E2_MOUSE” sequence same-set ProteinDetectionHypothesis id=“PDH3” dbseq_ref=“dbseq_Q05312|CP2F1_MOUSE” sequence subset ProteinDetectionHypothesis id=“PDH3” dbseq_ref=“dbseq_Q05312|CP2F1_MOUSE” sequence subset....

3 ProteinAmbiguityGroup and ProteinDetectionHypothesis

4 id: MS:1001591 name: anchor protein def: "A representative protein selected from a set of sequence same-set or spectrum same-set proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001592 name: family member protein def: "A protein with significant homology to another protein, but some distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001593 name: group member with undefined relationship OR ortholog protein def: "TO ENDETAIL: a really generic relationship OR ortholog protein." [PSI:MS] is_a: MS:1001101 ! protein group or subset relationship id: MS:1001594 name: sequence same-set protein def: "A protein which is indistinguishable or equivalent to another protein, having matches to an identical set of peptide sequences." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001595 name: spectrum same-set protein def: "A protein which is indistinguishable or equivalent to another protein, having matches to a set of peptide sequences that cannot be distinguished using the evidence in the mass spectra." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship Existing CV terms for ProteinDetectionHypothesis

5 id: MS:1001596 name: sequence sub-set protein def: "A protein with a sub-set of the peptide sequence matches for another protein, and no distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001597 name: spectrum sub-set protein def: "A protein with a sub-set of the matched spectra for another protein, where the matches cannot be distinguished using the evidence in the mass spectra, and no distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001598 name: sequence subsumable protein def: "A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001599 name: spectrum subsumable protein def: "A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship Existing CV terms for ProteinDetectionHypothesis

6 Problems No requirement for any exporter to use the terms “MAY” “anchor protein” doesn’t capture intended role and isn’t used consis id: MS:1001596 name: sequence sub-set protein def: "A protein with a sub-set...." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship No definition of what should be put in the value slot of cv terms: Could be the PDH identifier, accession or DBSequence identifier of group representative or any other protein that is super-set to this protein Or anything else for that matter What does passThreshold = “true” on PDH mean? Unclear how to count the number of identified proteins in an mzIdentML file Count PAGs or count PDHs? No terms for protocol describing how inference has been done or how to interpret results

7 Proposed work group outcomes Attach cv terms to describing how protein inference has been done – Still under discussion, since these effectively describe parts of the algorithm used Exactly one mandatory “representative protein” MUST be present per group (new name for “anchor protein”) on PDH – To be checked by semantic validator ProteinDetectionList MUST have a cv term “number of identified proteins” (count PAGs that have “representative protein” PDH with passThreshold=“true” Each PDH SHOULD be flagged with one term from a group stating whether it is “representative protein”, “sequence|spectrum same-set”, “sequence|spectrum subset”, “sequence|spectrum subsumed” or “marginally distinguished” (i.e. Not strictly any of these, but not enough evidence to be a group representative) – Value slot of these terms SHOULD contain a comma-separated list of super-set or same-set (as appropriate) PDH IDs

8 Table 1 –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file. mzIdentML context CV termsValues Require- ment level Description ProteinDetection- Protocol “No parsimony”, “Strict parsimony”, “Parsimony with additional considerations” Parent term: “Parsimony usage” xsd:String (to allow free text description) SHOULD No parsimony used means no parsimony approach has been applied generating the protein list. Strict parsimony used should be indicated if parsimony is the only consideration used to report proteins. Parsimony with additional considerations used should be indicated if additional information such as quantitation information is used to influence which proteins are reported, or if some additional proteins are reported for other reasons, such as a desire to report one protein from each gene to which any matched peptide maps. ProteinDetection- Protocol “No intact protein separation for protein inference”, “Partial isolation for protein inference”, “Nearly complete isolation for protein inference” Parent term: “Role of intact protein separation in protein inference” xsd:String (to allow free text description) SHOULD In workflows where proteins are not separated to any degree, or in which protein separation information is not used in the protein inference, this will have a value of No intact protein separation for protein inference”, as will be the case in strictly bottom up proteomics. At the other limit, Nearly complete isolation should be indicated when separation of intact proteins is conducted and relied upon for protein inference, as is common in multi- dimensional gel-based work. The Partial isolation for protein inference value should be specified for cases where some level of protein isolation is used – for example, if a sizing column is used to separate intact proteins into fractions or in the common GeLC-MS workflow where 1D gel separation is followed by bottom up analysis of the gel slices. ProteinDetection- Protocol “Attempted isoform differentiation”, “Prevented isoform differentiation” Parent term: “Isoform Differentiation” -SHOULD In the context of a parsimony approach, an inference tool can either attempt to report multiple protein forms by determining if there is adequate evidence to support the detection of more than one isoform in a cluster (most common), or alternately the tool could prevent this differentiation process and maximally group instead. ProteinDetection- Protocol Accession Ambiguity is Reported “true”, “false”SHOULDUsed for reporting whether ambiguity is reported i.e. if true PAGs may contain one or more PDHs, if false, each PAG must contain only one PDH (no attempt to report ambiguity).

9 ProteinDetection- Protocol Threshold applied to Peptides “true”, “false”SHOULD Set to true if thresholds are applied to PSMs or peptide level prior to protein inference. If thresholds have been applied, these should be reported under ProteinDetectionProtocol->Threshold using appropriate CV terms. ProteinDetection- Protocol Multiple matches per spectrum are considered “true”, “false”SHOULD This should be set to false for protein inference approaches that limit to a single top ranking peptide per spectrum for consideration during protein inference; true should be set for approaches that preserve multiple answers per spectrum and provide all of these to the protein inference algorithm. ProteinDetection- Protocol “Spectrum-centric parsimony Minimization”, “Sequence-centric parsimony minimization”, “Sequence-centric parsimony minimization with additional rules”, “No parsimony minimization” Parent term: Parsimony Minimization Method -SHOULD Sequence-centric parsimony minimization means that the inference method has sought to find the minimal set of proteins that explain all the peptide sequences observed, while Spectrum-centric parsimony minimization means the inference approach has sought to find the minimal set of proteins that explains the collection of observed spectra. Sequence-centric parsimony with additional rules would apply if a sequence-centric approach is used but additional rules are used – for example, if allowances are made to compensate for limitations of this approach such as I/L and deamidation ambiguities. No parsimony minimization should be indicated only if the Parsimony usage field is set to No parsimony. ProteinDetection- Protocol “Exhaustive list ambiguity modeling”, “Limited list ambiguity modeling”, Parent term: Ambiguity Modeling Approach -SHOULD In modelling a PAG, in one approach an algorithm can list all known intersection relationships, including accessions that have very limited overlap with the representative protein in the group. Alternately approaches to limit the scope of accessions that are included using various approaches. For example, one could list only accessions that have at least some minimal level of intersection with the representative protein in the group. This CV term simply captures whether the group modelling is limited in some way or is exhaustive in listing accessions. ProteinDetection- Protocol ->Threshold Protein Quality Threshold: MinimumNumSequencesRe quired IntegerSHOULD An integer value representing the number of identified peptide sequences required for creating a PDH. ProteinDetection- Protocol TaxonomyBasedPreference“true”, “false”SHOULD In some workflows, one might map identified peptides to a multi-species protein sequence database, but prefer matches to sequences from a particular species. ProteinDetection- Protocol ->Threshold Other thresholding terms? Table 1 cont. –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

10 mzIdentML contextCV termValues Require-ment level Description ProteinDetectionList number of identified proteins IntegerMUST The value reported should equal the number of PAGs containing a PDH flagged as Representative Protein and passThreshold=“true” ProteinAmbiguityGroupProtein cluster identifier String. A within-file unique identifier MAY A common identifier reported allows multiple PAGs to be linked, for example indicating some peptides are shared between different PAGs. ProteinAmbiguityGroup NumberDistinctProteinSeq uences IntegerSHOULD The number of distinct protein sequences among the PDHs in the group. For example, if there are two PDH with different identifiers that have identical full length sequences, the NumberDistinctProteinSequences would be one. ProteinDetectionHypothesisRepresentative protein- MUST (be present on one PDH per PAG that is counted) The Representative protein will generally have likelihood greater than or equal to other proteins in the ProteinAmbiguityGroup, but this is not required Exactly one PDH within a PAG must be assigned with this label to serve as the representative for the putatively detected protein. A PDH labelled as the Representative protein can have passThreshold=“true|false” i.e. it need not have passed the threshold reported in the ProteinDetectionProtocol. ProteinDetectionHypothesis Sequence Same-Set Protein xsd:String – comma separated list of PDH Ids that are same-set SHOULD A protein that is indistinguishable or equivalent to another protein in the group, having matches to an identical set of peptide sequences. ProteinDetectionHypothesisSpectrum Same-Set Protein xsd:String – comma separated list of PDH Ids that are same-set SHOULDA protein that is indistinguishable or equivalent to the Representative protein, having matches to a set of peptide sequences that cannot be distinguished using the evidence in the mass spectra. Table 2 New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

11 ProteinDetectionHypothesis Sequence Subset Protein xsd:String – comma separated list of PDH Ids that are super-set SHOULD A protein with a sub-set of the peptide sequence matches for the Representative protein, and no distinguishing peptide matches. ProteinDetectionHypothesis Spectrum Subset Protein xsd:String – comma separated list of PDH Ids that are super-set SHOULD A protein with a sub-set of the matched spectra for the Representative protein, where the matches cannot be distinguished using the evidence in the mass spectra. ProteinDetectionHypothesis Sequence Multiply Subsumable Protein xsd:String – comma separated list of PDH Ids that subsume this PDH SHOULD A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins. ProteinDetectionHypothesis Spectrum Multiply Subsumable Protein xsd:String – comma separated list of PDH Ids that subsume this PDH SHOULD A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins. ProteinDetectionHypothesis Marginally distinguished protein -MAY Assigned to a PDH that has some evidence to support its presence in addition to the representative protein i.e. they have a unique peptide but not sufficient to be promoted as a Representative Protein in a PAG. ProteinDetectionHypothesisCovering Set ProteinMAY A member of a minimal set of proteins sufficient to explain all matched peptides/spectra via a parsimony approach. This provides an alternative means of reporting a parsimonious protein list when ParsimonyUsage=“Parsimony with additional considerations”. A PAG can contain zero, one, or multiple PDHs bearing this term. DBSequence Protein Sequence Identical xsd:String – comma separated list of native accession(s) of protein with identical protein sequence MAY Full length protein sequence is identical with respect to the protein specified in the value attribute of this term. DBSequenceProtein Sequence Subsequence xsd:String – native accession of protein with “super”-sequence MAYFull length protein sequence is a subsequence of the protein specified in the value attribute of this term. Table 2 cont. New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

12 Unresolved issues Are the protocol terms necessary / sensible / overkill? Is there general consensus on the idea that the number of identified proteins MUST be reported – and must equal count of PAGs with PDH passThreshold=“true” Is it sensible to have SHOULD rules on all subset/same-sets? Extra terms for relationships between protein sequences – Probably these will be removed Mechanism for updating the mzIdentML specifications and validation software – Minor update + submission to shortened PSI process?


Download ppt "Protein grouping in mzIdentML. ProteinDetectionList ProteinAmbiguityGroup id=“PAG1” ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE”"

Similar presentations


Ads by Google