Presentation on theme: "Archiving Research Data, Dryad,and Publishers Neil Beagrie, Charles Beagrie Ltd Bloomsbury Conference June 2010 With contributions from Julia Chruszcz,"— Presentation transcript:
Archiving Research Data, Dryad,and Publishers Neil Beagrie, Charles Beagrie Ltd Bloomsbury Conference June 2010 With contributions from Julia Chruszcz, Peter Williams, and Todd Vision
Overview The Challenge; The Dryad Consortium; Supplementary Data and Publishers; Research Data Preservation Costs (KRDS); The Future.
4 PRC Global Study n=3759 n=2940 n=1262 n=1653 n=2989 n=2118 n=1294 n=2565 n=1868 n=2273 n=841 n=2362 Source: PRC global study (forthcoming)
Requesting Data Wicherts et al. (2006 Am. Psychol. 61, 726) requested data from the 141 most recent articles in American Psychological Association (APA) journals. 6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes… Only 27% of authors shared their data
The Dryad Consortium of Scholarly Societies and publishers (and libraries)
Archiving at publication Avoids loss, corruption, obsolescence of data files; The point in time when authors are best able to ensure the correctness of data and metadata; Authors have incentive to deposit their data in order to complete the publication process; Journals are best able to monitor compliance with policy; In short, the Genbank model works.
Incentives to authors Access to colleagues data Visibility and citability –Another way for work to have high impact Integration –Combinability with other data adds value Long-term preservation –Including data format migration Ad hoc data sharing can be burdensome –Deposition to multiple specialized repositories –Fulfilling individual requests for data takes effort
Joint Data Archiving Policy DEPOSIT AT PUBLICATION –As a condition for publication, all data used in the paper should be archived in an appropriate public archive. REPEATABILITY –Data should be given with sufficient detail so that together with the paper content, each result in the published paper may be re-created. EMBARGO –Authors may elect to have the data publicly available at time of publication, or if the archive allows opt to embargo access to the data. EXCEPTIONS –Exceptions may be granted at the discretion of the editor, especially for sensitive information such as the location of endangered species. COORDINATION –The aim is for the Dryad consortium of journals to adopt this policy simultaneously.
Thats all well and good, but wheres this appropriate public archive?
A mosaic of specialized databases There are a growing number to which deposition is encouraged/required (Genbank, Treebase) –And others are emerging A world in which every datatype had its own required database, each with its own submission system: –Would be a huge burden on authors –Would inevitably leave some data orphaned –Might never be financially possible
Overcoming the submission burden Integrating journal submission and data submission –Prepopulating bibliographic metadata –Handshaking with specialized repositories Enhancing low-quality author-provided metadata –Human curation –Machine assisted metadata enhancement
The Repository Dryad is a repository (at Duke) for datasets underlying scientific research articles; Its initial focus has been evolution and ecology; Participating journals subscribe to the Joint Data Archiving Policy; Dryad datasets will have (DOIs), and Creative Commons CC-Zero licenses; Project Funded by the National Science Foundation 2008-2012; Sustainability plan a key deliverable.
Overview Consultancy for Dryad Sustainability: covered areas of draft business plan and sustainability for Dryad Presenting one of the contributions(publishers) to section on Comparators and Costs Outcomes from desk research and 12 interviews with publishers/data publishers + some additional input drawn from Keeping Research Data Safe Very brief presentation – article in preparation for Learned Publishing Oct 2010 issue….KRDS2 available from JISC
Interviewees Journal of Clinical Investigation Journal of the American Medical Association Molecular Phylogenetics and Evolution (Elsevier) Journal of Heredity (OUP) Ecological Society of America Wiley-Blackwell + Ecology Letters Royal Society Federation of American Societies for Experimental Biology OECD Publishing Internet Archaeology and Archaeology Data Service Pangaea: Publishing Network for Geoscientific & Environmental Data Dataverse Network (Social Sciences, Harvard)
Some Findings: growth Many interviewees stated that supplementary data and materials are showings rapid growth 3 gave figures: from 32 articles in 2000, to 251 in 2009 – an increase of 784%; from 6% in 2005 to 38% in 2009; from 2% a decade ago to 87% in 2009.
Some Findings: workflow supplementary data have grown organically at the various journals investigated (author driven); Both the work and the costs being absorbed into the daily running of journals; in 4 cases minimal impact on work duties; in 5 others there was a significant but often unquantified impact (two of these might be considered data publications with a focus on publishing data papers or datasets); and in 3 cases the information was not available or unknown; can be explained in terms of level of effort or importance applied : the greatest levels of effort are associated with copy editing, format migration, addition of metadata, etc, whilst the least effort is required for simply hosting the material; and/or high-levels of automation in the workflow.
Some Findings: costs These were in most cases unknown or only partially known; Costs mentioned but usually not quantified include: digital storage costs, salary costs of journal staff; and long term preservation costs; detailed cost information was really only available from Internet Archaeology via Archaeology Data Service which had participated in an activity based costing study (KRDS2); Internet Archaeology archiving costs reflect those for a dataset publisher so only a comparator for part of Dryads content – large datasets.
Some Findings: revenue only author fees and journal subscription fees were mentioned as current revenue sources for the supplementary materials in journals; 3 journals interviewed have author charges for supplementary materials (see next slide); The data archiving and sharing organisations interviewed relied primarily on (uncertain) research grants and temporary or re-current core funding, but one had access to a small endowment and another has a charging policy for some depositors.
Some Findings: author charges Journal of Clinical Investigation - authors are charged $300 for supplemental data to appear online with accepted articles; Ecological Archives - submission of appendices and supplements is free up to 10MB. Above this, there is a fee of $250 for the first 1 GB and $50 for each subsequent GB. The fee for publication of a data paper is $250 for publication of the abstract in the relevant journal plus publication of up to 10 MB in Ecological Archives. An additional $250 is charged for data sets between 10MB and 1GB, and for larger datasets there is an additional $50 per GB fee; The Federation of American Societies for Experimental Biology (FASEB) charges $100 for each Supplemental file.
Keeping Research Data Safe (KRDS1 & KRDS2): JISC-funded studies of Research Data Preservation Costs (separate Dryad costing project by Lori Eakin- Richards based on KRDS approach)
KRDS: what did we learn? Whole of Service costing/Seeing theBig Picture Selection of 2009 Allocation of UKDA Activity Costs Acquisition5.8% Ingest21.5% A. Storage +Pres. Planning3.1% Access16.9%
KRDS:Implications Changing view of digital preservation costs: –getting stuff in and out costs much higher than keeping it (bit preservation + migration); –Staff costs c.70% of total costs; –Importance of economies of scale and automation; –Findings of KRDS and Dryad Repositorys own activity costing projections fed into Dryad sustainability planning.
Future Plans Dryad sustainability plan being put to Dryad member societies and publishers; Dryad extending consortium to new members –achieving economies of scale; Bid to JISC to establish Dryad-UK; Extending KRDS research and implementations.
Further Information Dryad see www.datadryad.orgwww.datadryad.org Keeping Research Data Safe2 (KRDS2) webpage at www.beagrie.com/jisc.phpwww.beagrie.com/jisc.php KRDS2 report available from JISC website http://www.jisc.ac.uk/publications/reports/20 10/keepingresearchdatasafe2.aspx#downlo ads http://www.jisc.ac.uk/publications/reports/20 10/keepingresearchdatasafe2.aspx#downlo ads Email: email@example.com