Adrian Hine, Natural History Museum, London

Slides:



Advertisements
Similar presentations
Migrating Entomologys Collection Management System to EMu Adrian Hine.
Advertisements

MvCIS - Forbes Hawkins – Copyright © 2004 Museum Victoria Forbes Hawkins Collection Systems Developer Museum Victoria - Melbourne, Australia Museum Victoria.
The White Rose Collaborative Collection Partnership Brian Clifford University of Leeds.
Depends entirely on support from the user base Many technical issues still need to be resolved Long term development horizon Proposal for a Simplified.
Sylvia OrliSylvia Orli Department of BotanyDepartment of Botany National Museum of Natural HistoryNational Museum of Natural History Smithsonian InstitutionSmithsonian.
Virtualizing Entomology Collection Student: Di Wang (Alan) Sponsors: John Marris: Curator, Entomology Research Museum Stuart Charters: Department of Applied.
The Caught and Coloured website: its EMu origins Alex Chubaty – Collection Information Systems Craig Churchill – IT Software Development Museum Victoria.
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
The Digital Facility – Supporting Digitisation Vladimir Blagoderov.
Web-based Specimen Databasing: Lessons from the Plant Bug Planetary Biodiversity Inventory Project presented by Randall T. Schuh Curator and Chair Division.
6 th Annual Focus Users’ Conference 6 th Annual Focus Users’ Conference Scheduling Requests and Request Reports Presented by: Sara Sayasane Presented by:
BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior.
FOSSIL INSECT DIGITIZATION WORKFLOW AT THE UNIVERSITY OF COLORADO Talia Karim 1, Lindsay Walker 1, Richard Levy 2 1 CU Museum of Natural History 2 Denver.
Overview of the NMNH Collection Level Index (CLI) National Museum of Natural History (NMNH) Tom Hollowell 14 Oct 2014.
The Natural History Museum Speaker: Charles Hussey Science Data Co-ordinator Department of Information and Library Systems
I Collections : introduction Gordon Paterson Chair, i Collections.
Dave Smith Petrology Collections Manager Global EMu Users Meeting, NHM (11-14 th Oct 2011) Mapping museum pest activity.
Corals and sea anemones on line: a functioning biodiversity database D. G. Fautin R. W. Buddemeier University of Kansas: Department of Ecology and Evolutionary.
RDA, RDE, or what? By: Darrell J. Siebert (NHM, London) Laurence Livermore (NHM, London) Andrew Brown (Ke Software, Melbourne)
Supporting high-throughput digitisation workflows in EMu
OU Digital Library development project Liz Mallett – Project Manager James Alexander – Project Developer 25 January 2012.
Hatching the EMu Update on Tyne & Wear Archives & Museums Collections Management Project Tyne & Wear Archives & Museums.
Collections Management Museums Reporting in KE EMu.
Reporting in EMu Crystal != Reporting or Why is reporting so difficult and can we do anything about it? Bernard Marshall KE Software.
EMu and Archives NA EMu Users Conference – Oct Slide 1 EMu and Archives Experiences from the Canada Science and Technology Museum Corporation.
In 1993 Simon Fowler defined income generation by archives as ‘those activities organised by archival staff with the aim of raising.
Case studies of practical data management Ben Kreunen Technical Support Officer University Digitisation Service.
ALLOWS FOR efficient computerization and management of biological collections and mobilization of specimen information onto the Internet.ALLOWS FOR efficient.
1 NEWSPLAN – The Way ahead Ed King, Head of Newspaper Collections, British Library NEWSPLAN LIEM Regional Council 2 October 2008.
The City of Fargo Master Address File Project. Discovering what the heck is out there? The City of Fargo is currently developing a comprehensive, standardized,
Sterling Chadee Director of Statistics. The processing of the data from the field enumeration began in July 2011 until September All data processors.
The GNM-DMS a Document Management System for the Germanische Nationalmuseum Martin Doerr, ICS-Forth Siegfried Krause, GNM April 2004.
Introduction to Interactive Media The Interactive Media Development Process.
The Unique Challenges of a Photographic Collection Marc Boulay Photographic Archivist University of St Andrews North Street, looking West [St Andrews]
NHM Digital Collection Programme Ian Owens, Natural History Museum, London Digital Specimen 2014, Berlin, September 2014.
© 2005 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice The China Digital Museum Project.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
WORKFLOW. What is workflow A system to manage and monitor working processes Defining and tracking the flow of work between individuals and/or departments.
Collections Management Proposal for a Simplified Structure for EMu Chicago, Oct 2005.
Providing an Avalanche of Data GETTING OUR DATA ON THE WEB Scott Williams Oct 12, 2011.
KE Software presentation at: Icelandic Institute of Natural History 9 September 2005.
KE EMu, the world’s premier collections management software.
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
EMu in the NHM: A personal perspective Darrell J. Siebert Dept. of Zoology.
A centre of expertise in digital information managementwww.ukoln.ac.uk Quality Assurance For Museum Web Sites: Review Brian Kelly UKOLN University of Bath.
Research Data Management At the Smithsonian Using Sidora CNI December 10, 2013.
IMu Rapid Data Entry Andrew Brown. Overview Browser-based Desktop Tablet Phone Project-based Authenticated access.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
A superior collections management system for the world’s largest: Museums Art Galleries Historical Societies Herbaria Botanic Gardens KE EMu.
Taxonomic Workflow in the EDIT Platform for Cybertaxonomy Andreas Kohlbecker, Pepe Ciardelli, Niels Hoffmann, Katja Luther, Andreas Müller Botanic Garden.
The William and Linda Steere Herbarium The New York Botanical Garden
Darrell Siebert The MOA Programme: Did we really do that?
Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?
31 st May 2007Image Management in Bio- and Environmental Sciences: New Directions Julia Hoare Digitising Linnaeus: developing global access to taxonomic.
Or how to make 11=1 Dr Penny Berents Australian Museum EMu Project.
|community| Effective online partnerships for small museums Tuesday October 4 th, Museums Association Presenters: Zoe Hill, Paul Rowe, Mike Rowe.
Collections Management A superior collections management system for the world’s largest: Museums Art Galleries Historical Societies Herbaria Botanic Gardens.
Using “batches” to create workflows in CONTENTdm for shared projects within the library Jill Ellern, Systems Librarian Paromita Biswas, Metadata Librarian.
What are our collections being used for?
Data capture methodologies in digitization of bee pollinators
Crowd-sourcing, Public Participation, and Data Enrichment – Using crowd-sourcing tools Biological Collections Digitisation in the Pacific , Symposium.
Automation of systematic reviews: the reviewer’s viewpoint
Digitisation Workflows, Tools and Techniques - Whole-drawer imaging
Elspeth Haston, Robyn Drinkwater, Robert Cubey & Ruth Monfries
C2CAMP (A Working Title)
Bringing Organism Observations Into Bioinformatics Networks
Oil Analysis in the Digital Era
Library Technology Conference: Building Exhibits
opening our collections data to the public
Zoning Map Modernization with GIS
Presentation transcript:

Adrian Hine, Natural History Museum, London iCollections, Mass Digitisation of British & Irish Lepidoptera Adrian Hine, Natural History Museum, London

iCollections Background iCollections began March 2013 for 3 years, using 8 full time digitisers plus existing staff. Digitise the British Lepidoptera (Butterflies & Moths) ca. ½ million specimens (5000 drawers). Pilot project for mass digitsation of pinned insects. The main aim of digitisation is to capture the label data, not on the specimen image per se. Workflow for the Digital Collections Programme (DCP) – a Digital Museum. DCP, shifting from uncoordinated digitisation projects to a planned programme. Working toward a digital museum. Prototype mass digitisation workflows for pinned insects (one of the most challenging collection types) – the NHM hasn’t engaged in any kind of mass digitisation outside Botany department. ‘Digitise’: Image, transcribe core data, interpret the sites/parties & georeference. High end of digitisation – high quality data suitable for researchers. Phase 1, butterflies Phase 2: macromoths Phase 3: micromoths

Digitisation Benefits Three top-level themes: Research Collections Public engagement Have to choose carefully to maximise limited budget. British Lepidoptera ticks all these boxes! Why digitise? Digitise for a purpose. We have limited funds so we have to target these carefully to ensure we maximise the benefits Amateur entomologists – big amateur lepidopterists community (twitchers of the entomology world) Interested in looking at former distributions. UK Lepidoptera Collection Challenge - Dry pinned material with data labels on the pins. extremely time consuming working out efficient workflows and designing an infrastructure to implement this

Research Large powerful dataset (50% usable), temporal & spatial. Cimate change, distributional changes, migration, morphometrics. Occurance records to National Biodiversity Network. NHM climate change research group Suited to climate change studies - phenology studies (responsive to climate change - dates of first occurrence can be extrapolated). Studies so far on a limited dataset show that for every 1 degree centigrade in spring is warmer butterfly emergence is brought forward by 8 days. Post 1976 rate of change is less, 2-3 days per 1 degree change. Ecologists & conservations looking at distribution changes.

Better Collections Better curation & preservation, access Will be interesting to see if there is a different pattern for macromoths and micromoths. Better curation & preservation, access

Public Engagement Lepidoptera charismatic group, lot of public interest. Explain our science: Science Uncovered, Nature Live, TV, radio. Will be interesting to see if there is a different pattern for macromoths and micromoths.

Data Workflow Data quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu. Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage. Consistent, systematic approach to data capture. Every stage of the digitisation process followed written protocols. Each specimen given a unique specimen number (Data Matrix barcode & human readable).

Data Workflow Opted for data capture outside EMu poor quality data in EMu makes databasing directly into EMu difficult (sites, taxonomy, parties). build a highly streamlined data entry interface for transcription phase. build harmonisation tools to control data going into EMu (reduce duplication). Developing a RDA for the future. Biggest challenge is harmonisation with existing data within EMu (taxonomy, sites, parties, specimens). Sites data – although lots of data inside EMu, generally very poor and not very usable. We would be spending all our time resolving the messy data inside, this being an impediment to the digitisation project. Taxonomy data – likewise, don’t want to get bogged down

Digitisation Workflow Transcription Taxonomy Harmonisation Import into EMu Georeferencing Imaging Specimen Preparation Digitiser Taxonomist Georeferencer Data Manager For the scale required it has to be a highly efficient production line. Optimised and independent of one another (so one step doesn't act as a bottle neck) The digitisation workflow can be partitioned into a number of distinct steps. By treating them as discrete processes it enables each of these tasks to be optimised by providing targeted tools and the appropriate personnel for these specific tasks. Digitisers can’t be expected to make specialist interpretations in geography or taxonomy without a lot of training. Imaging preparation: Focus on imaging and not on capturing basic metadata capture that interrupts. However a few basic pieces of data that must be captured. Record ingestion: Automated via script Raw data capture: Focus on speed & consistency of data capture. Streamline the interface so data entry can be extremely rapid. Data validation: Largely dealing with taxonomy names & collecting localities. ‘turning strings into things’, This step is often under appreciated and insufficient resources allocated to generate good quality content. Turn a simple string into a meaningful data concept. May be new, may exists already. Biggest challenge to the project. Georeferencing: Import into EMu:

Specimen Preparation Work in teams of TWO. Person 1: preparation & reassembly, person 2: imaging Original drawers. Organisation of specimen in old drawers. The determination doesn’t exist individually on the specimens, rather there is label separator in the drawers between batches of specimens that has the determination of all the specimens. Each specimen moved to a unit tray, all the labels are removed and placed on a stage adjacent to the specimen. Delicate operation, sitting on old cork drawers, old specimens. If legs/antennae/abdomen fall off placed in a gelatine capsule. Majority of specimens don’t have unique identifiers associated with them, so a unique identifier, a specimen number is added as a Data Matrix

Imaging A single image of the upperside specimen together with labels is taken using a DSLR camera with macro lens using an imaging station. The image is taken with a default file name, set up automatically formatted with a prefix of the digitisers name plus a running number. B) The image is saved in a folder structure that enables two core bits of metadata to be capture at image time; the top level folder is the new drawer number, the subfolder which is the taxon name (the filed as name in the collection). At the point of capture three important pieces of data must be captured; specimen number drawer number taxon name Reassembly of labels and into new drawer.

Ingestion into Transcription Database Script uses the application Barcodefiler to search the image for a barcode. If one is found the script renames the image filename with the specimen number. It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields; specimen number (from barcode) drawer number (from folder name) taxon name (from folder name) Using ImageMagic libraries it creates a cropped label derivative image. A script has been developed by our IT specialist Chris Sleep to try and automate some of the slow manual tasks. The script take the image file and where it is located and generates a stub database record from that. Label derivative – prime reason was to improve the efficiency of the rapid data capture interface. A magnified crop of the image labels for the digitisers to read. Cropped from fixed coordinates. These are to be imported into EMu as a distinct digital asset.

Transcription Perhaps show a demo The ingestion process pulls in the images (full and label derivative), specimen number, taxon name & drawer number. Transcription done by the digitisers. Transcription focuses on the core label data. There may be all kinds of additional extraneous data on the labels (sale, unknown numbers, collector notations), but’s it’s often hard to interpret and to codify. We started but decided it was too time consuming for the benefits (just flag the tick box instead). Collecting site Collection date Collector Registration number and detail Preparation details Type status The use of lookups to control and speed up data entry Collecting locality, collector and registration data are ‘harmonised’ or ‘normalised’ Sites We wish to harmonise/normalised Site data. We find this easiest achieved before ingest into EMu. Capture of raw variants (interpretation is a specialism). Also this slows down digitisation, also there will poor consistency if many people are making these interpretation, it would take a lot of training as well. Box Hill as example Many variants of the same site concept. Reconciliation to master record done in next phase of the workflow. A single site for import into EMu & georeferenced (specialist). Unique string that occur on labels. So far 7,500 unique strings for 97,000 specimen records Taxonomy Pulled from the naming of the folders the images are placed into. Does contain variants (not always consistently entered) and typos and mistakes (many of these names do not occur in standard checklists). Collectors Enter ‘verbatim’ but atomised into title, first, middle & last names Registration data

Data Harmonisation Biggest challenge is how to harmonise data with existing EMu data. Wish to use appropriate records where they exist in EMu and not to create additional duplicates. Data concepts we wish to harmonise with EMu records; Taxonomy (determination) Parties (collectors) Locations (drawers) Data concepts to create as new Sites The scale of the mess makes data management extremely challenging. Taxonomy (determination) Difficult to reconcile in an automated way (i.e. import algorithm) Parties (collectors) Lots of duplicates and some messy data, but reconciling them with existing EMu records is very do-able Locations (drawer numbers) Imported a fresh set of museum drawer location records for this project. Straightforward to match on number. Sites (collecting site) CREATE NEW SET OF SITES Too messy, no consistency in how they were generated, incorrect, duplicated, no georeferences. Faced with the choice do we attempt to clean up there of try and start generating a good clean set of site records.

Taxonomy Harmonisation EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations. Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess! Need taxonomic expertise to validate the iCollections name with the correct concept in EMu. Typos, errors when entering names by digitisers. Can’t rely on the EMu import algorithms as matching taxon names is too complex. Need human intervention. Built mapping tool to map taxon name with existing EMu name. Although limited number of UK butterfly species (ca. 60 resident species + migrants), aberrations make it a lot more complicated (1500 names and growing). Many of these aberrational names are invalid MS names only found in the collection. Nevertheless there is interest in the lepidopterists community in knowing what aberrations we hold. Ideally would have resolved the mess before we attempt this project. Reality though we don’t have the luxury to do this. Live with the mess!!

Taxonomy Harmonisation Tool

Sites Harmonisation Messy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources). Mapping site variants to a site master record. Box Hill Box Hill; Surrey Box Hill; Kent Box Hill; Surrey; UK; Box Hill; near Dorking 51.254 N, -0.308 W Box Hill, Dorking Out of 181,000 specimens, just 9,681 unique site variants. For Sites it as an opportunity to create some good quality content from scratch based on a consistent method of interpretation according to an agreed protocol.

Sites Harmonisation & Georeferencing

Sites Georeferencing

Import into EMu Import is a phased approach; Images. KE have built a backend script to ingest multimedia server side. Reports out a csv with the EMu irn & file name identifier. Specimen record (taxonomy, drawer location & multimedia). Georeferenced collection event data. Much quicker than importing through the client. Uses the batch operations module. Automated generation of csv import files.

Issues Barcode no reads or misreads. Printing quality of barcodes. Multiple specimens on one pin. Conflicting data. Data difficult to interpret. Specimens with old style specimen number labels (non barcode). Specimen records exist already in EMu.

Digitisation Progress Preparation: 1.15 minutes Imaging: 1.05 minutes Transcription: 0.59 minutes Total: 2.80 minutes This doesn’t include the validation (Sites/Taxonomy), georeferencing & import. However these are not going to be done on a specimen by specimen basis. These are for the relatively easy butterflies. Large. Won’t be so straight-forward for micro moths! Moving onto the moths next Digitisers involved with other projects, they are not dedicated to the iCollections digitsation.

iCollections Team The success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines. Gordon Paterson chair Victoria Carter project manager Darrell Siebert quality assurance Peter Wing digitiser Elisa Cane Flavia Toloni Jo Durant Lyndsey Douglas Sara Albuquerque Jasmin Perera Sophie Ledger Gerrardo Mazzetta Geoff Martin collections management Martin Honey Blanca Huertas Theresa Howard Steve Brooks research Angela Self Ian Kitching Malcolm Penn georeferencing Liz Duffell Caitlin McLaughlin Mike Sadka database & interface designer Adrian Hine data workflow Chris Sleep database Vladimir Blagoderov image workflow Steve Cafferty

Questions?