Presentation is loading. Please wait.

Presentation is loading. Please wait.

Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con.

Similar presentations


Presentation on theme: "Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con."— Presentation transcript:

1 Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010

2 Summary We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron Alexander (“Sasha”) Schwarzman2Superset Me—Not JATS-Con Nov 2, 2010

3 Contents Why we built a JPTS superset DTD vs. Schematron – Attribute values – Number of element occurrences – Element position & sequence – References Lessons learned Alexander (“Sasha”) Schwarzman3Superset Me—Not JATS-Con Nov 2, 2010

4 Why we built a JPTS superset No generic book model Lack of familiarity with Schematron Lack of mature tool support (running SVRL not a viable option in Production environment) Lack of expertise on integrating Schematron with validation against relational DB JATS v2.3: no Compound Keywords, not all content models parameterized Alexander (“Sasha”) Schwarzman4Superset Me—Not JATS-Con Nov 2, 2010

5 DTD vs. Schematron: Attribute values Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt) Strict DTD JPTS Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20105

6 DTD vs. Schematron: Attribute values (cont’d) XML instance (contains non-allowed article type) Schematron @article-type ' ' not allowed, must be 'rga', 'cor', or edt' Schematron message @article-type 'xxx' not allowed, must be 'rga', 'cor', or 'edt' Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20106

7 DTD vs. Schematron: Number of element occurrences Requirement: Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs Strict DTD JPTS Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20107

8 DTD vs. Schematron: Number of occurrences (cont’d) XML instance (wrong number of paragraphs)... jb... Blah Blah-blah Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20108

9 DTD vs. Schematron: Number of occurrences (cont’d) Schematron ' ' in ' ' must contain exactly two paragraphs ' ' in ' ' must contain only one paragraph Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 20109

10 DTD vs. Schematron: Number of occurrences (cont’d) Schematron message 'ack' in 'jb' must contain only one paragraph Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201010

11 DTD vs. Schematron: Element position & sequence Requirement: If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info Strict DTD <!ELEMENT article-categories (subject-group*, special-collection?) > JPTS <!ELEMENT article-categories (subj-group*) > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201011

12 DTD vs. Schematron: Element position & sequence (cont’d) XML instance (wrong sequence of subject groups) New Methods and Applications of Earthquake Early Warning Solid Earth Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201012

13 DTD vs. Schematron: Element position & sequence (cont’d) Schematron <rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling:: subj-group[@subj-group-type=('toc-category','subset')])"> /@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present Schematron message subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201013

14 DTD vs. Schematron: References Validating references is a challenge: Variety vs. the need to enforce editorial style Strict DTD: Fixed element order, no mixed content Punctuation, spacing, face markup – on output JPTS: Lots of elements, any order, mixed content Punctuation, spacing, face markup included Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201014

15 DTD vs. Schematron: References (cont’d) Strict DTD <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > <!ATTLIST book-standalone-citation id ID #REQUIRED > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201015

16 DTD vs. Schematron: References (cont’d) JPTS <!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc |... |...)* > <!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201016

17 DTD vs. Schematron: References (cont’d) Example: Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201017

18 DTD vs. Schematron: References (cont’d) XML instance (strict DTD) Mood A. M. Graybill F. A. 1963 Introduction to the Theory Statistics 2nd 295 pp McGraw-Hill New York Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201018

19 DTD vs. Schematron: References (cont’d) XML instance (JPTS) Mood, A. M., and F. A. Graybill ( 1963 ), Introduction to the Theory Statistics, 2 nd ed., 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201019

20 DTD vs. Schematron: References (cont’d) Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition, if present, follows source ): <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201020

21 DTD vs. Schematron: References (cont’d) Schematron can check that all required elements are present: <assert test="(person-group | string-name) and year and source and publisher-name and publisher-loc"> required element missing & that the elements are in the correct sequence: Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201021

22 DTD vs. Schematron: References (cont’d) XML instance (JPTS) (edition is in the wrong place) Mood, A. M., and F. A. Graybill ( 1963 ), 2 nd ed., Introduction to the Theory …, 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201022

23 DTD vs. Schematron: References (cont’d) This Schematron uses positional predicate [1] to check that year is immediately followed by source : <rule context="mixed-citation[@publication-type= 'book-standalone']/year"> ' ' must be followed by 'source', not by ' ' Schematron message 'year' must be immediately followed by 'source', not by 'edition' Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201023

24 DTD vs. Schematron: References (cont’d) But how to check the sequence of required elements when there might be optional elements interspersed between them? This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between: <rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> ' ' must be preceded by 'source' Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201024

25 DTD vs. Schematron: References (cont’d) Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order: – Each element rewritten as a string of its element names – Content model represented as a regular expression – Schematron checks the string of names against regex – Schematron generates an error message if content does not match the model Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201025

26 DTD vs. Schematron: References (cont’d) An XML file, e.g., citation-models.xml, specifies structured citation models:... ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc)... Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201026

27 DTD vs. Schematron: References (cont’d) Advantages: – DTD is still DTD-valid – Mixed content is permitted – Type-sensitive handling of references is possible Caveat: XSLT 2.0! Alexander (“Sasha”) SchwarzmanSuperset Me—Not JATS-Con Nov 2, 201027

28 Lessons learned AGU Tag Set + Schematron (200+ checks) – Ensures data quality – Ensures markup integrity – Provides control over production processes AGU Tag Set is a superset of JPTS – Based on JPTS – Uses the same modularization principles – Can be easily mapped to JPTS Were we to do this again we would have developed JPTS subset and a Schematron Alexander (“Sasha”) Schwarzman28Superset Me—Not JATS-Con Nov 2, 2010

29 Lessons learned (cont’d) Appropriate layer validation – Even the most “Prussian” DTD can’t enforce all business rules, data types, and house style – Rules-based checking needed anyway – May as well use “Californian” JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc. Paradigm shift: the crux of validation shifts from XML parser to Schematron engine Alexander (“Sasha”) Schwarzman29Superset Me—Not JATS-Con Nov 2, 2010

30 Lessons learned (cont’d) This shift is not without costs: – Content may be valid to JPTS but make no sense – Dependency on Schematron for semantic integrity – Constraints on business partners: must be Schematron-capable and have tools – Schematron does not “fix” problems—people do. Processes and procedures must be well-defined Alexander (“Sasha”) Schwarzman30Superset Me—Not JATS-Con Nov 2, 2010

31 Lessons learned (cont’d) Writing a simple Schematron is easy; building a complex and efficient one is not: – Elicit, document, convey, and clarify the Requirements – Ensure Schematron fits into your workflow – Modularize Schematron – Ensure that individual Schematron rules aren’t in conflict – Optimize Schematron performance – Employ XSLT 2.0 – Test, test, test – Cultivate Schematron & XSLT 2.0 expertise in-house Alexander (“Sasha”) Schwarzman31Superset Me—Not JATS-Con Nov 2, 2010

32 Conclusion What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters? When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say: “Superset Me—Not!” Alexander (“Sasha”) Schwarzman32Superset Me—Not JATS-Con Nov 2, 2010


Download ppt "Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con."

Similar presentations


Ads by Google