Aspiration: all scientific literature online, all data online, and for them to interoperate
Closing the concept-data gap Maintaining the credibility of science Exploiting the data deluge & computational potential Combating fraud Addressing planetary challenges Supporting citizen science Responding to citizens’ demands for evidence Restraining the “Database State” Why is open data an urgent issue?
Intelligent openness Openness of data per se has no value. Open science is more than disclosure Data must be: Accessible Intelligible Assessable Re-usable Only when these four criteria are fulfilled are data properly open METADATA
The transition to open data Pathfinder disciplines where benefit is recognised and habits are changing
Databases as publications Hosts/suppliers of databases are publishers They have a responsibility to curate and provide reliable access to content. They may also deliver other services around their products They may provide the data as a public good or charge for access
The Worldwide Protein Data Bank (wwPDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. As of January 2012, it held 78477 structures. 8120 were added in 2011, at a rate of 677 per month. In 2011, an average of 31.6 million data files were downloaded per month. The total storage requirement for the repository was 135GB for the archive. The total cost for the project is approximately $11-12 million per year (total costs, including overhead), spread out over the four member sites. It employs 69 FTE staff. wwPDB estimate that $6-7 million is for “data in” expenses relating to the deposition and curation of data.
UK Data Archive The UK Data Archive, founded 1967, is curator of the largest collection of digital data in the social sciences in the United Kingdom. UKDA is funded mainly by Economic and Social Research Council, University of Essex and JISC, and is hosted at University of Essex. On average around 2,600 (new or revised) files are uploaded to the repository monthly. (This includes file packages, so the absolute number of files is higher.) The baseline size of the main storage repository is <1Tb, though with multiple versions and files outside this system, a total capacity of c.10Tb is required. The UKDA currently (26/1/2012) employs 64.5 people. The total expenditure of the UK Data Archive (2010-11) was approx £3.43 million.. Total staff costs (2010-11) across the whole organisation: £2.43 million. Non-staff costs in 2009-10 were approx £580,000, but will be much higher in 2011-12, ie almost £3 million due to additional investment.
Institutional Repositories (Tier 3) » Most university repositories in the UK have small amounts of staff time. The Repositories Support Project survey in 2011 received responses from 75 UK universities. It found that the average university repository employed a total 1.36 FTE – combined into Managerial, Administrative and Technical roles. 40% of these repositories accept research data. In the vast majority of cases (86%), the library has lead responsibility for the repository.276 » ePrints Soton » ePrints Soton, founded in 2003, is the institutional repository for the University of Southampton. It holds publications including journal articles, books and chapters, reports and working papers, higher theses, and some art and design items. It is looking to expand its holdings of datasets. » It has a staff of 3.2 FTE (1FTE technical, 0.9 senior editor, 1.2 editors, 0.1 senior manager). Total costs of the repository are of £116, 318, comprised of staff costs of £111,318, and infrastructure costs of £5,000. (These figures do not include a separate repository for electronics and computer science, which will be merged into the main repository later in 2012.) It is funded and hosted by the University of Southampton, and uses the ePrints server, which was developed by the University of Southampton School of Electronics and Computer Science.
Contingency of these databases PDB and arXiv dependent on mixes of discretionary decisions by government bodies and philanthropy UK Data Archive is unusual in its centrality to the social sciences funding system University repositories highly varied in performance and in support from the top Funders and universities are under many pressures But researchers can do more to promote data access, as can journals
Growth in formal corrections (Examples from Nature, Nature Biotechnology, Nature Neuroscience, Nature Methods) Missing controls, results not sufficiently representative of experimental variability, data selection Investigator bias, e.g., in determining the boundaries of an area to study (lack of blinding) Technical replicates wrongly described as biological replicates Over-fitting of models for noisy datasets in various experimental settings: fMRI, x-ray crystallography, machine learning Errors and inappropriate manipulation in image presentation, poor data management Contamination of primary culture cells
Mandating reporting standards is not sufficient 2002: Nature journals mandate deposition of MIAME-compliant microarray data 2006: compliance issues identified MIAME – Minimal Information About a Microarray Experiment Ioannidis et al., Nat Gen 41, 2, 149 (2009) Of 18 papers containing microarray data published in NG in 2005-2006, 10 analyses could not be reproduced, 6 only partially.
Irreproducibility: NPG actions so far Awareness raising – meetings 2013/14: NINDS, NCI, Academy of Medical Sciences, Royal Society, Science Europe,…… Awareness raising: Editorials, articles by experts We removed length limits on online methods sections We substantially increased figure limits in Nature and improved access to Supplementary Information data in research journals. Statistical advisor (Terry Hyslop) and referees appointed ‘Reducing our irreproducibility’ Editorial + check lists for authors, editors and referees (23 April 2013) Nature + NIH + Science meeting of journal editors in Washington (May 2014)
Raising awareness: our content Tackling the widespread and critical impact of batch effects in high-throughput data, Leek et al., NRG, Oct 2010 How much can we rely on published data on potential drug targets? Prinz et al., NRDD, Sep 2011 The case for open computer programs, Ince et al., Nature, Feb 2012 Raise standards for preclinical cancer research, Begley & Ellis, Nature, Mar 2012 Must try harder – Editorial, Nature, Mar 2012 Face up to false positives, MacArthur, Nature, Jul 2012 Error prone – Editorial, Nature, Jul 2012 Next-generation sequencing data interpretation: enhancing reproducibility and accessibility, Nekrutenko & Taylor, NRG, Sep 2012 A call for transparent reporting to optimize the predictive value of preclinical research. Landis et al., Nature, Oct 2012 Know when your numbers are significant, Vaux, Nature, Dec 2012 Reuse of public genome-wide gene expression data, Rung & Brazma, NRG, Feb 2013
Raising awareness: our content (2) Reducing our irreproducibility – Editorial, Nature, May 2013 Reproducibility: Six red flags for suspect work, Begley, Nature, May 2013 Reproducibility: The risks of the replication drive, Bissell, Nature, Nov 2013 Of carrots and sticks: incentives for data sharing, Kattge et al, Nature Geoscience, Nov 2014 Open code for open science? Easterbrook, Nature Geoscience Nov 2014 Code share – Editorial, Nature 29 Oct 2014 Journals unite – Editorial with Science and NIH 6 Nov 2014
Implementation of reporting checklist Onerous! – Authors, referees, editors, copyeditors Referees: – We are not yet sure whether they are paying much attention. Authors: – Some papers submitted with checklist without prompt – Many have embraced source data Improves reporting (see following slide). We have commissioned an external assessment of the impact. The list may be driving changes in experimental design in the longer term
Reporting animal experiments in Nature Neuroscience Jan ‘12 (10 papers)Oct ‘13 – Jan ‘14 (41 papers) ‘Not reported’ includes cases for which the specific question was not relevant (e.g., investigator cannot be blinded to treatment) Most frequent problems: power analysis calculations, low n (sample size justification), proper blinding or randomization, multiple t-tests.
Attention needed: Cell line identity Identify the source of cell lines and indicate if they were recently authenticated (e.g., by STR profiling) and tested for mycoplasma contamination. This checklist question is not yet enforced as a mandate Audit of Nature Cell Biology papers (Aug’13 – Dec’13): -Of 21 relevant papers: -20 indicate the source of cell lines(*) -4 indicate authentication was done(**) -5 acknowledge cell lines were not authenticated -17 indicate the cells were tested and demonstrated mycoplasma-free(**) (*) quality of information variable (**) timing of tests not always satisfactory
Question about developing author- contribution transparency Author contribution statements in Nature journals are informal, unstructured, non- templated. Should this change? How? (Possible goals: increased credit, increased accountability for potential flaws.) How granular should this information become?
Irreproducibility: underlying issues Experimental design: randomization, blinding, sample size determinations, independent experiments vs technical replicates, Statistics Big data, overfitting (needs gut scepticism/tacit knowledge) Gels, microscopy images, Reagents validity – antibodies, cell lines Animal studies description Methods description Data deposition Publication bias and refutations – where? IP confidentiality – replication failures unpublishable Lab supervision Lab training Pressure to publish “It pays to be sloppy”
Funders: The NIH Collins and Tabak, Nature 27 January 2014 NIH is developing a training module on enhancing reproducibility and transparency of research findings, with an emphasis on good experimental design. This will be incorporated into the mandatory training on responsible conduct of research for NIH intramural postdoctoral fellows later this year. Informed by this pilot, final materials will be posted on the NIH website by the end of this year for broad dissemination, adoption or adaptation, on the basis of local institutional needs.
Funders: The NIH Collins and Tabak, Nature 27 January 2014 Several of the NIH's institutes and centres are also testing the use of a checklist to ensure a more systematic evaluation of grant applications. Reviewers are reminded to check, for example, that appropriate experimental design features have been addressed, such as an analytical plan, plans for randomization, blinding and so on. A pilot was launched last year that we plan to complete by the end of this year to assess the value of assigning at least one reviewer on each panel the specific task of evaluating the 'scientific premise' of the application: the key publications on which the application is based (which may or may not come from the applicant's own research efforts). This question will be particularly important when a potentially costly human clinical trial is proposed, based on animal- model results. If the antecedent work is questionable and the trial is particularly important, key preclinical studies may first need to be validated independently. Informed by feedback from these pilots, the NIH leadership will decide by the fourth quarter of this year which approaches to adopt agency-wide, which should remain specific to institutes and centres, and which to abandon.
Universities/institutes: target issues Data validation Lab size and management Training Publication bias Data/notebooks access Reagent access
Nature and NPG data policies Enforce community database deposition Encourage community database development Launch Scientific Data Nature-journal editors encourage submissions of Data Descriptors to Scientific Data