Presentation on theme: "Making your data work for you: Scratchpads, publishing & the Biodiversity Data Journal Vince Smith 1, Dave Roberts 1 & Lyubomir Penev 2 1. Natural History."— Presentation transcript:
Making your data work for you: Scratchpads, publishing & the Biodiversity Data Journal Vince Smith 1, Dave Roberts 1 & Lyubomir Penev 2 1. Natural History Museum, London 2. Pensoft Publishers, Sofia, Bulgaria firstname.lastname@example.org EBI, UK 25 September, 2012
Our informatics grand challenge… “Link together evolutionary data… by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses” Cyndy Parr, Rob Guralnick, Nico Cellinese and Rod Page. TREE. doi:10.1016/j.tree.2011.11.001
Our informatics grand challenge… Cyndy Parr, Rob Guralnick, Nico Cellinese and Rod Page. TREE. doi:10.1016/j.tree.2011.11.001 This requires data, information & knowledge to be… Digital Not printed paper Openly accessible Not behind barriers Linked-up Not in silos “Link together evolutionary data… by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses”
15-20k new spp. described annually (2M total) 1 30k nomenclatural acts (12M total) 1 20k phylogenies (750k total) 2 31k taxa sequenced (360k taxa total) 3 800k BioMed papers (40M total pp. of taxonomy) 4 Countless specimens, images, maps, keys… Most of our output is not digital, open or linked Typically generated by small communities for “local” research projects Figures from 1) Zhang, Zootaxa 2011 4, 1-4; 2) Web-of-Science; 3) Genbank and 4) PubMed.
Scratchpad Virtual Research Environments Making taxonomy digital, open & linked
Your data 1 “Published” & reviewed on your site 3 Uploaded & tagged 2 FastIntuitiveFit for use What is a Scratchpad? A website for you & your community
Scratchpads EDIT (07-11), ViBRANT / eMonocot (11-13) Hosted websites for taxonomists Taxonomic, regional or societal Research & publication platform Supports the taxonomic workflow Modular (Drupal) & flexible Two full time developers Ecosystem of communities (~450) http://scratchpads.eu
Summary of what Scratchpads can do Taxon pages, generated from tagged content (plant/animal) Bibliography management Character matrixes Specimen records Distribution maps (from specimens and regional) Images, video and sound (bulk import) Excel spreadsheet import (dynamically generated) Darwin Core Archive export Tabular data editing Custom content User management Custom webforms EOL data import (taxonomy, species information) GBIF Map integration
Nodes, 430, 948 Sites 326 Users 6809 Active Users 5733 (273 w / 759 m) Sites Users Scratchpad v.1 usage (2007- Mar. 2012) ViBRANT SP 2 Prof. scientists Amateur naturalists Citizen scientists Range: 1-1049 Mean: 15 Mode: 1
Scratchpad 2 – the new version of Scratchpads More professional Easier to… -configure (workflows) -navigate (facets) -& populate (MS Excel templates) Greater standardisation Still highly flexible Project profiles (eMonocot) Framework for integration Launched March 2012 120 sites to date EOL Fellows SP1 migration ongoing e.g. http://ihs.myspecies.info/
Getting data in and out of Scratchpads 2
Online community revision Freeloader flies http://milichiidae.info Taxonomy is in perpetual beta -Constantly evolving -Changing contributors -Small granular contributions Sustainability -A permanent space to work -Guaranteed access (2016) -Easy ways to get the data out Open science -Beyond Open Access -New ways of working -Data management plans Need incentives to use -More efficient (functions & reuse) -Attribution & provenance -Credit via citation New forms of publication
Publishing observations & taxon data Specimen records & species pages on Scratchpads Pushed to GBIF & EOL (requires site registration with GBIF & EOL) >19K specimen records > 122k species pages >377M specimen records GBIF > 1 M species pages in EOL http://scratchpads.eu > http://gbif.org & http://eol.org Darwin Core Archive (DwCA)
Experiments with article publishing Paper assembled from Scratchpad database XML submission, peer review & marked-up publication by Pensoft 5-step workflow for selecting data, adding metadata & previewing Published in Zookeys & Phytokeys (worldwide coverage) PDF HTML XML http://scratchpads.eu > http://pensoft.net doi:10.3897/zookeys.50.539
Example papers via Scratchpads… Blagoderov V, Hippa H, Nel A (2010). ZooKeys 50: 79–90. doi: 10.3897/zookeys.50.506 Faulwetter S, Chatzigeorgiou G, Galil BS, Nicolaidou A, Arvanitidis C (2011. ZooKeys 150: 327–345. doi: 10.3897/zookeys.150.1877 Brake I, von Tschirnhaus M (2010). ZooKeys 50: 91–96. doi: 10.3897/zookeys.50.505 http://milichiidae.info/node/14995http://polychaetes.marbigen.org/node/35http://sciaroidea.info/node/44428 Live (updated) versions of these papers
BDJ The Biodiversity Data Journal Making small data big!
BUT… We need to encourage taxonomists to mobilize & describe their data This takes considerable effort (e.g. Scratchpads) “Arguably” this is best rewarded through credit This means papers and citations Process must be very easy for authors Process must facilitate data reuse Meet “Open Data” policy commitments The Biodiversity Data Journal is very different… Why do we need another new journal!!! Taxonomy needs less fragmentation, not more!
Biodiversity Data Journal (BDJ) All data matters: No lower or upper limit of manuscript size! Multiple publishing routes (not just Scratchpads) ALL within a single online collaborative platform, including the writing of the manuscript! New collaborative article authoring tool Community peer review with “open” &“public” options This is in addition to conventional peer-review Online editorial process and version control Standards-compliant (Darwin Core, Dublin Core, NLM etc.) Pre-defined Code-compliant article templates
BDJ publication & dissemination workflow
Pensoft manuscript writing tool Collaborative online editing Rich text capabilities Various templates for taxon treatments Identification keys builder Assembling plates from single figures References import (CrossRef, PubMed Central, etc.) Species occurrence data import (Darwin Core compliant) Smart citation for figures, tables, references & automated positioning
Testing screenshots of the writing tool ID Key preview Multi-figure platesPlate layout ID Key builder Manuscript preview
Why publish in the BDJ? Joining (small) data into a large data pool Open-access, archiving and re-using your data through data aggregators Providing citation record and creditability for data in the form of peer-reviewed publications Facilitating online article authoring and editorial process for authors, reviewers and editors Using a truly innovative dissemination of atomized content Very low-cost. Free in the launch phase, thereafter at fee that anyone can afford!
What will BDJ publish? Single taxon treatments and nomenclatural acts Local or regional checklists Sampling reports and occasional inventories Habitat-based checklists and inventories Ecological and biological observations of species and communities? Single identification keys ANY KIND of biodiversity-related database, including genomic, ecological and environmental data (data papers) Biodiversity-related software tools Starting late 2012, early 2013 Recruiting editors now
BDJ Barcoding, genomic & environmental sequence papers Making small data big!
Mammal taxa added to Genbank annually Proper Linnaean names Aus sp. = dark taxa", taxa (specimens) that aren't identified to a known species
Proportion of mammal dark taxa in Genbank Proper Linnaean names Aus sp.
BOLD Proportion of invert. dark taxa in Genbank
Dark taxa are the norm for bacteria
A lesson in principles for dealing with dark taxa Roth v. Wikipedia http://www.newyorker.com/online/blogs/books/2012/09/an-open-letter-to-wikipedia.html
But Wikipedia said “no” “I understand your point that the author is the greatest authority on their own work,” writes the Wikipedia Administrator—“but we require secondary sources.”
But Wikipedia said “no” One of Wikipedia’s core principles, along with things like neutrality, is verifiability: a reader must be able to look at a statement in a Wikipedia article and find out where it comes from. http://quominus.org/archives/981
Lessons for taxonomy & dark taxa… http://quominus.org/archives/981 Taxonomic statements should be verifiable Literature is the evidence base for taxonomy Literature should be the evidence base for dark taxa
Example templates & dissemination BIODIVERSITY MANUSCRIPT Occurrence data “Dark” taxon data Image galleries Morphometric data Environmental sequence data Genome descriptions Any other data XML MARK UP Structured text (data!) ARTICLES Occurr- ence data Taxon names Taxon treatments Plazi BHL Wiki COL Biblio- graphies
Example template & data fields
Workflow describing “Dark Taxa” PWT – COLLABORATIVE ARTICLE AUTHORING TOOL Dark taxon sequenced BDJ – PEER-REVIEW Automated submission to Pensoft Writing Tool MANUSCRIPT PUBLISHED Metadata: voucher specimen, images, locality, etc. MANUSCRIPT FINALISATION & SUBMISSION Automated update of bibliographic metadata, taxon name, Zoobank record, etc.
Data published Descriptions Images Occurrences Nomenclature Literature Plazi
“Dark Taxon” papers Should contain… -The scope of the taxonomic, ecological & geographic coverage -The sources of voucher specimens -The sampling & lab. protocols used -The process used to ID taxa to which vouchers belong Possible data fields include… -Average no. of records per taxon -Range of records per taxon (Min-Max) -Average, min. and max. sequence length -Range of intraspecific variation -Median variation with in taxon X% -Range of divergence to closed know taxon pairs (min & max?) -Median divergence between closest taxon pair
Possible discussion points… The concept… -Is it a good approach to incentivize data publishing & good metadata practices? -The suitability for “Dark Taxa”, new genomes and env. sequence data -Is this more suitable for some data papers (e.g. dark taxa) than others? The practicalities… -The fit to existing systems (both for data collection and dissemination) -The data fields (Dark Taxa”, new genomes and env. sequence data) -Next steps in developing this concept
Acknowledgements Scratchpad technical development -Simon Rycroft, Ben Scott, Ed Baker, Alice Heaton, Katherine Boulton, Scratchpad outreach -Irina Brake, Laurence Livermore, Dimitris Koureas E-Monocot -Paul Wilkin &the Kew team, Charles Godfray & the Oxford team ViBRANT -Dave Roberts, Lucy Reeve & many many more Pensoft -Lyubomir Penev, Teodor Georgiev & colleagues Our 7,000+ users
Why we need new methods of publishing… Primary data Drawings: Slavena Peneva Publishing and sharing of primary data RE-USE of CONTENT