Presentation on theme: "NATIONAL LIBRARY OF MEDICINE PubMed Central and the NLM Journal Archiving Vocabulary."— Presentation transcript:
NATIONAL LIBRARY OF MEDICINE PubMed Central and the NLM Journal Archiving Vocabulary
NATIONAL LIBRARY OF MEDICINE What is PubMed Central? Digital archive of life sciences journals includes health policy, bioinformatics and other fields Participation is voluntary and limited to journals: covered by a major abstracting/indexing service, or have 3 editorial board members with current grants from major non-profit funding agencies Journals deposit an authoritative electronic copy that must meet PMC data quality standards Deposits are permanent Copyright retained by publisher or author
NATIONAL LIBRARY OF MEDICINE Access to PMC Content Free access to full-text articles and supporting data Not necessarily open access Journal may delay free access to its content research articles are generally free in a year or less Full-text searching in PMC Citations for all articles included in PubMed Fully integrated with other Entrez databases – sequence data, taxonomy, books, etc.
NATIONAL LIBRARY OF MEDICINE Why??? Why Free? The more eyes the better Readers provide another level of quality control Why XML? Preserves structure of an article Lends itself to intelligent processing Human readable – not dependent on technology Portable
NATIONAL LIBRARY OF MEDICINE PubMed Central DTD History pmc-1.dtd DTD currently in production (but not for long). Derived from keton.dtd and BMC article.dtd. Designed to be a simple DTD for online display and archive. Written with samples from PNAS, MBC, and BMC. Why a new DTD? Elements/attributes had to be added to accommodate new journals. DTD would become cumbersome quickly if we had to keep making changes for each new title. Original “simplicity” of design would lead to confusing data structures as the dtd expanded. Moved away from standard XML practices to accommodate source SGML. Needed an independent review.
NATIONAL LIBRARY OF MEDICINE The Reviewers Mulberry Technologies, Inc The Task Review the pmc-1.dtd for XML best practices, applicability to archive and online retrieval use, and completeness in application to STM journals. Create an updated version of the DTD Document the new DTD. An electronic publishing consultancy specializing in SGML- and XML- based systems. Has been active in SGML since 1984 and in XML since 1996. Has extensive experience in the development and maintenance of SGML and XML applications for STM publishers.
NATIONAL LIBRARY OF MEDICINE The Results pmc-2.dtd Mulberry’s Suggestions Create two DTDs: one for archiving to allow us to convert data from multiple sources to our DTD. a subset for authoring to allow us to retain some control when publishers create articles to the DTD. Use proven solutions like XLINK and the XHTML table standard. Use data models to simplify the DTD.
NATIONAL LIBRARY OF MEDICINE Harvard E-Journal Archiving Project The Melon Foundation funded the Harvard Library to study the feasibility of using one DTD for archiving journal articles. Harvard commissioned Inera, Inc. for the E-Journal Archive DTD Feasibility Study. Conclusion – yes, it is feasible, but the right DTD does not exist. A meeting was held in April 2002 to discuss the changes needed to the PMC2 DTD to expand its range to include most any journal. Attendees included PMC, Mulberry Technologies, Inc. (consultant to PMC), The Mellon Foundation, The Harvard Library, and Inera (consultant to Harvard- Mellon).
NATIONAL LIBRARY OF MEDICINE Conclusions 1.PMC and Harvard-Mellon had different ideas about what the DTD should do. Harvard was interested in an Interchange DTD, which would allow publishers to submit in multiple formats, which would all be valid. PMC was interested in an Archive DTD, which would be open enough to allow conversion of multiple sources into one single format. 2. If the PMC2 DTD was modularized, and some pieces were added (like the OASIS table model), many DTDs could be built using the same elements, giving both flexibility and consistency.
NATIONAL LIBRARY OF MEDICINE Status The “NLM Archiving and Interchange DTD Suite” has been created and released. Mulberry and Inera analyzed hundreds of journals across subjects to insure that the DTD Suite was powerful enough to tag them. The “NLM Journal Archiving DTD” and the “Journal Publishing DTD” have been created from the DTD Suite. The Archiving DTD and the Suite were circulated through the Mulberry’s and Inera’s contacts in the electronic publishing world for comments and suggestions. Suggestions that made the DTD more useable were incorporated.
NATIONAL LIBRARY OF MEDICINE Archiving / Publishing DTDs PLoS is using the DTD for their journals TechBooks is using Journal Publishing DTD to send PMC content for J. Athletic Training and using the DTD for internal journal production High Wire Press will use the DTDs for their content Atypon JSToR will use the DTD for its E-Journal Archive CSIRO (Australia's Commonweath Scientific & Industrial Research Organisation) will tag its journals with the new DTD Several others small journals trying to use the DTD to submit content to PMC
NATIONAL LIBRARY OF MEDICINE JSTOR The Scholarly Journal Archive JSTOR’s Electronic-Archiving Initiative Archiving full journal issues Use Archiving DTD for article material Publishers supply sample data for analysis and development Association of Computing Machinery American Economics Association American Mathematical Society American Political Science Association Blackwell Publishing, Ltd. The Ecological Society of America John Wiley & Sons National Academy of Sciences The Royal Society The University of Chicago Press
NATIONAL LIBRARY OF MEDICINE Highwire Press Library of the Sciences and Medicine Currently using their own proprietary DTD Will be moving to the Archiving DTD Journals in Biological Sciences Physical Sciences Medical Sciences Social Sciences
NATIONAL LIBRARY OF MEDICINE CSIRO Commonwealth Scientific and Industrial Research Organization Australia’s largest scientific research agency Independent science and technology publisher Journals, online journals, books, magazines and CD-ROMs Using Inera’s eXtyles to both clean up and convert from Microsoft Word
NATIONAL LIBRARY OF MEDICINE Centers for Medicare and Medicaid Services United States Department of Health and Human Services Centers for Medicare and Medicaid Services Office of Strategic Planning Publishing DTD Initial product is Health Care Financing Review 2004 CMS Statistics guide and other publications to follow FrameMaker application
NATIONAL LIBRARY OF MEDICINE Other Publishers Public Library of Science (PLoS Biology & PLoS Medicine) National Athletic Trainers' Association (Journal of Athletic Training) St. James Publishing (Journal of Burns & Surgical Wound Care) Amphibian and Reptile Conservation Journal of Medical Internet Research
NATIONAL LIBRARY OF MEDICINE Conversion Vendors Tested DTD within a week of release Tested in advance of clients Have converted for publishing clients Notable vendors (that we know about): TechBooks (Fairfax, VA) — submitted XML in Publishing DTD to PubMed Central within 2 weeks of the DTD release. using for & 30 journals Data Conversion Laboratory (Fresh Meadows, NY) — has agreed to convert content to the Archiving DTD for individual Open Access articles submitted to PubMed Central by authors. converted CMS publications (and others)
NATIONAL LIBRARY OF MEDICINE Other Service Providers Atypon Systems hosting, software, and operations provider using for Annual Reviews – 31 journals Lawrence Erbaum Associates – 81 journals University of California Press – 33 journals Impressions, Inc. composition and publishing for print and online books and journals used both the DTD and a schema version with Word 2003
NATIONAL LIBRARY OF MEDICINE Who Owns the Tagset? The DTDs? Not “Open Source” DTDs and Tagset are in the public domain NLM retains control over changes and additions to the Tagset and DTDs But: Anyone may create a new DTD from or use them without permission from NLM
NATIONAL LIBRARY OF MEDICINE NLM Requests 1.If you create a DTD from the DTD Suite 2.And intend it to stay compatible with the Suite 3.Then please include the following comment in modules: “Created from, and fully compatible with, the Archiving and Interchange DTD Suite.” 1.If you alter one or more modules of the suite 2.Then please rename your version and all its modules to avoid any confusion with the original Suite 3.And, please include the following statement as a comment in all your DTD modules: “Based in part on, but not fully compatible with, the Archiving and Interchange DTD Suite.”
NATIONAL LIBRARY OF MEDICINE What’s Next?: Working Group To keep the DTD relevant to the publishing and archiving communities, we have created the XML Interchange Structure Working Group. This group advises NLM on recommended changes in and/or additions to the tagset. The Working group met for the first time on August 18, 2003. The recommendations from this meeting led to version 1.1 of the DTDs, released on November 1, 2003.
NATIONAL LIBRARY OF MEDICINE What’s Next?: Other DTDs Because the DTD is built as a set of DTD modules, other document types can be created (relatively) easily using the same content models. We are building a Books DTD and planning an Online Documentation DTD.
NATIONAL LIBRARY OF MEDICINE What’s Next? PMC Complete redesign of software – built around NLM Archiving and Interchange DTD Portable PMC – toolset with basic functions to build SQL database from PMC archival files create standard TOC and article displays Citation linking based on automated parsing of reference citations from scanned OCR text Japanese (NIG/DDBJ) developing journal archiving system Wellcome Trust / JISC adding £1.75 million to digitize journals they enlist Journals will be regular PMC participants Titles include Annals of Surgery, Journal of Anatomy, Journal of Physiology, all of which go back to late 1800s
NATIONAL LIBRARY OF MEDICINE Intermission The PMC Back Issue Scanning Project or Digitization
NATIONAL LIBRARY OF MEDICINE Back Issue Digitization Create a complete digital archive of PMC journals Bring the collection to today’s “if not online, it doesn’t exist” user Cover-to-cover digital copy of everything up to where journal began producing electronic copy Publisher gets free, unencumbered digital copy First complete archive, Bulletin of the Medical Library Association (1911), released in November 2003
NATIONAL LIBRARY OF MEDICINE Digitization Details PDF file for each article with true reproduction of grayscale and color images Citation / abstract XML record (if not already in PubMed) Mechanically improved (5-pass) OCR text for: Searching across the collection and in individual PDFs Potential automated reference linking TIFF files for scanned page images and each grayscale and color figure
NATIONAL LIBRARY OF MEDICINE TOC for Digitized Issue
NATIONAL LIBRARY OF MEDICINE Digitized Article Summary Page
NATIONAL LIBRARY OF MEDICINE Page Browse, HiFi Image, PDF
NATIONAL LIBRARY OF MEDICINE OCR Text for J Virol Article
NATIONAL LIBRARY OF MEDICINE PMC and Open Access PMC exists with or without Open Access (OA) Deposit of OA content is subject to normal PMC standards and requirements OA source files can be downloaded by anyone: PMC-OAI service for full-text XML FTP service to download complete article packet – XML, images, PDF and supplemental data files OA simplifies distribution of content to collaborating archives Opens door for data mining and creation of innovative products by others
NATIONAL LIBRARY OF MEDICINE Stumping What the world needs now: XML-based authoring and editing products designed for scientific articles Straightforward, universal standard for defining access rights, similar to copyright indication Other operational, free archives that can form a collaborative archiving network
NATIONAL LIBRARY OF MEDICINE Links PubMed Central – http://www.pubmedcentral.gov NLM DTDs and documentation http://dtd.nlm.nih.gov firstname.lastname@example.org