Presentation on theme: "ICollections, Mass Digitisation of British & Irish Lepidoptera Adrian Hine, Natural History Museum, London."— Presentation transcript:
iCollections, Mass Digitisation of British & Irish Lepidoptera Adrian Hine, Natural History Museum, London
iCollections Background iCollections began March 2013 for 3 years, using 8 full time digitisers plus existing staff. Digitise the British Lepidoptera (Butterflies & Moths) ca. ½ million specimens (5000 drawers). Pilot project for mass digitsation of pinned insects. The main aim of digitisation is to capture the label data, not on the specimen image per se. Workflow for the Digital Collections Programme (DCP) – a Digital Museum.
Digitisation Benefits Three top-level themes: Research Collections Public engagement Have to choose carefully to maximise limited budget. British Lepidoptera ticks all these boxes!
Research Large powerful dataset (50% usable), temporal & spatial. Cimate change, distributional changes, migration, morphometrics. Occurance records to National Biodiversity Network.
Public Engagement Lepidoptera charismatic group, lot of public interest. Explain our science: Science Uncovered, Nature Live, TV, radio.
Data Workflow Data quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu. Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage. Consistent, systematic approach to data capture. Every stage of the digitisation process followed written protocols. Each specimen given a unique specimen number (Data Matrix barcode & human readable).
Data Workflow Opted for data capture outside EMu – poor quality data in EMu makes databasing directly into EMu difficult (sites, taxonomy, parties). – build a highly streamlined data entry interface for transcription phase. – build harmonisation tools to control data going into EMu (reduce duplication). Developing a RDA for the future. Biggest challenge is harmonisation with existing data within EMu (taxonomy, sites, parties, specimens).
Digitisation Workflow Transcription Taxonomy Harmonisation Import into EMu Georeferencing Imaging Specimen Preparation Digitiser Taxonomist Georeferencer Data Manager
Ingestion into Transcription Database Script uses the application Barcodefiler to search the image for a barcode. If one is found the script renames the image filename with the specimen number. It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields; – specimen number (from barcode) – drawer number (from folder name) – taxon name (from folder name) Using ImageMagic libraries it creates a cropped label derivative image.
Data Harmonisation Biggest challenge is how to harmonise data with existing EMu data. Wish to use appropriate records where they exist in EMu and not to create additional duplicates. Data concepts we wish to harmonise with EMu records; Taxonomy (determination) Parties (collectors) Locations (drawers) Data concepts to create as new Sites
Taxonomy Harmonisation EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations. Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess! Need taxonomic expertise to validate the iCollections name with the correct concept in EMu. Typos, errors when entering names by digitisers. Can’t rely on the EMu import algorithms as matching taxon names is too complex. Need human intervention. Built mapping tool to map taxon name with existing EMu name.
Taxonomy Harmonisation Tool
Sites Harmonisation Messy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources). Mapping site variants to a site master record. Box Hill Box Hill; Surrey Box Hill; KentBox Hill; Surrey; UK; Box Hill; near Dorking N, W Box Hill, Dorking Out of 181,000 specimens, just 9,681 unique site variants.
Sites Harmonisation & Georeferencing
Import into EMu Import is a phased approach; 1) Images. KE have built a backend script to ingest multimedia server side. Reports out a csv with the EMu irn & file name identifier. 2) Specimen record (taxonomy, drawer location & multimedia). 3) Georeferenced collection event data.
Issues Barcode no reads or misreads. Printing quality of barcodes. Multiple specimens on one pin. Conflicting data. Data difficult to interpret. Specimens with old style specimen number labels (non barcode). Specimen records exist already in EMu.
iCollections Team The success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines. Gordon Patersonchair Victoria Carterproject manager Darrell Siebertquality assurance Peter Wingdigitiser Elisa Canedigitiser Flavia Tolonidigitiser Jo Durantdigitiser Lyndsey Douglasdigitiser Sara Albuquerquedigitiser Jasmin Pereradigitiser Sophie Ledgerdigitiser Gerrardo Mazzettadigitiser Geoff Martincollections management Martin Honeycollections management Blanca Huertascollections management Theresa Howardcollections management Steve Brooksresearch Angela Selfresearch Ian Kitchingresearch Malcolm Penngeoreferencing Liz Duffellgeoreferencing Caitlin McLaughlingeoreferencing Mike Sadkadatabase & interface designer Adrian Hinedata workflow Chris Sleepdatabase Vladimir Blagoderovimage workflow Steve Caffertyimage workflow