Presentation on theme: "Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010."— Presentation transcript:
Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010
Motivation Many journals have data citation, or at least data sharing policies. – Most are “recommendations” – Many will soon be mandatory – But are they enacted? Multiple depositories exist for data sharing – Allow browsing for available data – Provide space for data storage – Recommend how data reuse should be properly credited. – But are they utilized?
Intentions Report current status of data citation and sharing to relevant journals Recommend best practices – Increase ability to retrace and reuse data – Ease transition to mandatory polices – Promote appropriate credit to data author
Background Advent of data sharing/citation policies Continued expression of the need for increased data sharing, esp. for meta- analysis and global change studies Similar studies in Biomedical journals* or focused on Genbank**, but few in Ecological/Evolutionary journals *Piwowar and Chapman. Public sharing of research datasets: A pilot study of associations. Journal of Informetrics April (2): **Noor et al Data Sharing: How Much Doesn't Get Submitted to GenBank? PLoS Biol. 4(7):228
Research Questions What are current practices for data citation within articles? – Do authors tend to cite that dataset itself or related paper? – How does the author obtain the dataset? How do these practices vary across discipline, journal, data type, data source? – Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy? How have these practices varied across time? – Does increased data reuse/sharing correlate with changes in journal policy? – Does data reuse/sharing simply increase with time since the advent of the internet?
Angles of Attack “Snapshot” approach – 1 st issue in 2010 for journals of interest To assess “current state” To evaluate utility of a particular journal for more detailed “Time Series” investigation “Time Series” approach – Random sample of 25 articles per journal per year To investigate trends over time, especially considering changes in journal data/citation policies
Nitty-Gritty Methods Random sampling – Export all articles and accompanying metadata Journal- specific – Assign record number to each article – Generate random numbers to select 25 articles Data Extraction – Recorded on Excel spreadsheet, uploaded weekly to GoogleDocs – Read Journal Citation/Data Policy in Preparation for Extraction – Read through articles manually Special attention to the Methods and Acknowledgements sections. Identify instances of data reuse and sharing – Copy relevant excerpts Code according to established fields – Record additional metadata Open access, Discipline, Submission to Publication duration, etc.
Extracted Fields ISI metadata – DOI – Author and affiliation – Abstract and keywords – Journal and ISSN For each instance of Data Reuse, Sharing, or Production – Depository – Type of InText and Bibliographic Citation Author-Year, URL, Accession # – How dataset acquired Is depository clearly referenced? Was it obtained from a colleague? Is it previous work by one of the authors? – Where citation occurs – Type of Dataset Gene Sequence, Phylogenetic Tree, Ecological, etc
Selected Journals Dryad “Top Three” – Justification: 1. Most currently posted datasets...is it really being reused? 2. Known "High Impact" Journals 3. Cover target disciplines and depositories – Systematic Biology (Systematics, Phylogenetics/geography) – American Naturalist (Behavior, Natural History, Ecology) – Molecular Ecology (Genetics, Molecular Evolution) Other options: ESA family, Discipline-specific, Broad
Limitations Only looking at a few journals and disciplines Relying only on the main text – Not looking at supplementary material unless article extremely unclear – Have to assume if it wasn’t stated, it wasn’t reused/shared Would have developed automated extraction, text coding if time permitted – Process more articles – Remove bias – Standardization
Unresolved Problems (suggestions please!) Data Type Classification – Easy: Gene Sequences and Phylogenetic Trees – Biology vs. Ecology – Subdivisions in Biology, Earth, etc Bio: Morphology, Behavior Eco: Competition, Community Earth: Soils, GIS “Articles” according to ISI – AmNat: High % are models Notes and Comments Natural History Miscellany – SysBio: Points of View Author Recurrence – SysBio: only 50 articles per year and multiple publications/accreditations to the same people (Wiens, Sullivan) – AmNat: less pronounced problem (Abrams)
Findings Qualitative observations Good citation, bad citation Journal Comparison Time Series – % Reuse % Sharing results not presented – Data type – Depository
Qualitative Observations - Internal (journal) supplementary depositories used more as a dump than for reusable data Additional or color figures and tables Statistical outputs – InText citations allude to raw data supplement, but often ends up being raw results – Defunct data storage Personal URLS Problem retrieving supplementary data (SysBio ) – More data produced than shared – Alignments and Trees often not posted to TreeBase – Ecological datasets grossly under shared
Haphazard citation practices Accessions cited in Text vs. Table Author vs. Accession Only depository referenced – Especially with large datasets Some in Methods, Some in Results – Majority of reuses cited in Methods – Sharing cited roughly 50/50 between Methods and Results Crediting self before others – Bibliographic citations not given or only for same author – Give article citation for self, but not accession; accession for others but not their article Disparate citation formats within a single paper
Good Citation “Previously published sequence data were used for V. velella 18S (Collins, 2002, GenBank AF358087), P. porpita 18S (Collins, 2002, AF358086), Staurocladia wellingtoni 18S (Collins, 2002, AF358084), S. wellingtoni 16S (Schuchert, 2005, AJ580934), Hydra circumcincta 18S (Medina et al., 2001, AF358080), and H. vulgaris 16S (Pont-Kingdon et al., 2000, AF100773).” Taxon Gene region Author-Year – accompanying bibliographic citation GenBank Accession
Bad Citation Incomplete “The sequences, which were all produced in our previous studies (Aceto et al. 1999; Cozzolino et al. 2001) and are available in GenBank” Usually missing accession, sometimes author and depository Sometimes the info is buried in tables or not given for large compilations Unclear – “During annual aerial surveys, observers sketch the extent of defoliation from the air on paper or digital maps (Ciesla 2000) that are then compiled as a series of polygons in a geographical information system (GIS) (Liebhold et al. 1997).” Who is the original data author? – Are these theoretical, methodological or data citations? – Bibliographic citations occasionally shed light Where is the data stored?
What is a good citation? Data easily retraceable Proper credit given Criteria – Depository mentioned in text – Accession mentioned in text – Author credit given in Bibliography
Citations: Systematic Biology
Citations: American Naturalist
Journal Comparison: Snapshot Data ReuseData Sharing Systematic Biology ~ Frequent use of Genbank ~ Occasional use of Treebase ~ Often post to Treebase, but often unclear about GA vs. PT ~ Internal Difficult (no unique accession [generic URL]; not accessible pre-2008) American Naturalist ~ Varied data (biological) ~ Often extracted from literature or used to validate a model ~ Occasional sharing: Dryad, Treebase, Genbank, internal Molecular Ecology ~ Frequent use of Genbank, but steadily drops off after 2009 ~ Some morphological data matricies ~ Posting to Genbank, but alternatively given in Methods and Results ~ Level of accessibility varies widely Ecology ~ Minor datasets~Extensive datasets rarely shared ~ Ecological Archives (accessible but used for excess figures and results)
Journal Comparison: % Reuse and Sharing in 2010
Percent Reuse over time
Depository: Systematic Biology
Depository: American Naturalist
Data Types: Systematic Biology
Data Types: American Naturalist
Back to the big picture Inform journals and depositories about current practices vs. policy Best Practices recommendations Continued research on trends in data citation
Suggested Best Practices Accession numbers and Authors of each dataset (reused and shared) given in the Methods or Supplementary Table referenced in the Methods – Authors not charged extra page/online fees – Authors allowed to exceed Reference limit to credit data Editorial enforcement – Checklist Internal Depositories made more accessible – Usable formats – Unique and Stable URLs
Long Term Best Practices Separate Supplementary Data Section – Example: Molecular Ecology SysBio added a separate section but it is defunct AmNat has an “Online enhancements” header – For both internal and external deposits – Distinguish from “data-dump” (extra figures, outputs) – Accompanying References section Unlimited length DATA cited, in addition to publication – Could combine into a new reference type: Author. Year. Title. Journal. Pages. Depository. Accession. Track on par with publications in ISI, etc
Continued Research Snapshot and time series of Molecular Ecology – Possibly Ecology if time permits – Alternative (suggestions please!): Just snapshots Trends over time – Has reuse and sharing increased? – Have citation practices improved over time? Is this influenced by journal/depository recommendations on citations? Correlation with influential factors – Is there more data reuse in articles that are also open access or share data? – Are certain dataset types or article disciplines more inclined to reuse/share data? Data shared vs. data produced Sync with Journal/Depository Metadata (Nic) and Search Findings (Valerie) – Refine “Good” citation criteria Journal and depository specific
Additional Exploration Track the cited or shared datasets Look at supplemental data alone – Internal (journal repositories) Additional data not cited in text? Data dumps? – Ease of access Accuracy of accession numbers Actual data reusability – Method/processing metadata – File format Software/model reuse and sharing – R-packages, GUIs – Encouraged by American Naturalist Databases – Independent databases vs. depositories % utilized out of available – Caching/stability options, linking metadata to depositories
Final products Reports to requisite journals/depositories Potential Manuscripts – Journal Comparison: Citation Practices – Treebase: Shared vs. Produced – Best Practices recommendations Shared dataset!
Thanks for listening! Questions? Suggestions? – Unresolved problems Unresolved problems – Continued Research Continued Research
Hurdles Determining extracted fields Coding data now vs. later
In light of data/citation policies…. Compare “performance” of sysbio and amnat in their depository and journal policy performance (do they meet the requirements?) – or state this in future research section OWW: Nic – do “editor” instructions or other sections of policy indicate how data/citation policies are enforced?