Presentation on theme: "Adrian Hine, Natural History Museum, London"— Presentation transcript:
1 Adrian Hine, Natural History Museum, London iCollections, Mass Digitisation of British & Irish LepidopteraAdrian Hine, Natural History Museum, London
2 iCollections Background iCollections began March 2013 for 3 years, using 8 full time digitisers plus existing staff.Digitise the British Lepidoptera (Butterflies & Moths) ca. ½ million specimens (5000 drawers).Pilot project for mass digitsation of pinned insects.The main aim of digitisation is to capture the label data, not on the specimen image per se.Workflow for the Digital Collections Programme (DCP) – a Digital Museum.DCP, shifting from uncoordinated digitisation projects to a planned programme. Working toward a digital museum.Prototype mass digitisation workflows for pinned insects (one of the most challenging collection types) – the NHM hasn’t engaged in any kind of mass digitisation outside Botany department.‘Digitise’: Image, transcribe core data, interpret the sites/parties & georeference. High end of digitisation – high quality data suitable for researchers.Phase 1, butterfliesPhase 2: macromothsPhase 3: micromoths
3 Digitisation Benefits Three top-level themes:ResearchCollectionsPublic engagementHave to choose carefully to maximise limited budget. British Lepidoptera ticks all these boxes!Why digitise? Digitise for a purpose. We have limited funds so we have to target these carefully to ensure we maximise the benefitsAmateur entomologists – big amateur lepidopterists community (twitchers of the entomology world)Interested in looking at former distributions.UK Lepidoptera CollectionChallenge - Dry pinned material with data labels on the pins.extremely time consuming working out efficient workflows and designing an infrastructure to implement this
4 Research Large powerful dataset (50% usable), temporal & spatial. Cimate change, distributional changes, migration, morphometrics.Occurance records to National Biodiversity Network.NHM climate change research groupSuited to climate change studies - phenology studies (responsive to climate change - dates of first occurrence can be extrapolated).Studies so far on a limited dataset show that for every 1 degree centigrade in spring is warmer butterfly emergence is brought forward by 8 days.Post 1976 rate of change is less, 2-3 days per 1 degree change.Ecologists & conservations looking at distribution changes.
5 Better Collections Better curation & preservation, access Will be interesting to see if there is a different pattern for macromoths and micromoths.Better curation & preservation, access
6 Public EngagementLepidoptera charismatic group, lot of public interest.Explain our science: Science Uncovered, Nature Live, TV, radio.Will be interesting to see if there is a different pattern for macromoths and micromoths.
7 Data WorkflowData quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu.Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage.Consistent, systematic approach to data capture.Every stage of the digitisation process followed written protocols.Each specimen given a unique specimen number (Data Matrix barcode & human readable).
8 Data Workflow Opted for data capture outside EMu poor quality data in EMu makes databasing directly into EMu difficult (sites, taxonomy, parties).build a highly streamlined data entry interface for transcription phase.build harmonisation tools to control data going into EMu (reduce duplication).Developing a RDA for the future.Biggest challenge is harmonisation with existing data within EMu (taxonomy, sites, parties, specimens).Sites data – although lots of data inside EMu, generally very poor and not very usable. We would be spending all our time resolving the messy data inside, this being an impediment to the digitisation project.Taxonomy data – likewise, don’t want to get bogged down
9 Digitisation Workflow TranscriptionTaxonomy HarmonisationImport into EMuGeoreferencingImagingSpecimen PreparationDigitiserTaxonomistGeoreferencerData ManagerFor the scale required it has to be a highly efficient production line.Optimised and independent of one another (so one step doesn't act as a bottle neck)The digitisation workflow can be partitioned into a number of distinct steps. By treating them as discrete processes it enables each of these tasks to be optimised by providing targeted tools and the appropriate personnel for these specific tasks.Digitisers can’t be expected to make specialist interpretations in geography or taxonomy without a lot of training.Imaging preparation: Focus on imaging and not on capturing basic metadata capture that interrupts. However a few basic pieces of data that must be captured.Record ingestion: Automated via scriptRaw data capture: Focus on speed & consistency of data capture. Streamline the interface so data entry can be extremely rapid.Data validation: Largely dealing with taxonomy names & collecting localities. ‘turning strings into things’, This step is often under appreciated and insufficient resources allocated to generate good quality content. Turn a simple string into a meaningful data concept. May be new, may exists already. Biggest challenge to the project.Georeferencing:Import into EMu:
10 Specimen PreparationWork in teams of TWO. Person 1: preparation & reassembly, person 2: imagingOriginal drawers. Organisation of specimen in old drawers. The determination doesn’t exist individually on the specimens, rather there is label separator in the drawers between batches of specimens that has the determination of all the specimens.Each specimen moved to a unit tray, all the labels are removed and placed on a stage adjacent to the specimen. Delicate operation, sitting on old cork drawers, old specimens. If legs/antennae/abdomen fall off placed in a gelatine capsule. Majority of specimens don’t have unique identifiers associated with them, so a unique identifier, a specimen number is added as a Data Matrix
11 ImagingA single image of the upperside specimen together with labels is taken using a DSLR camera with macro lens using an imaging station.The image is taken with a default file name, set up automatically formatted with a prefix of the digitisers name plus a running number.B) The image is saved in a folder structure that enables two core bits of metadata to be capture at image time; the top level folder is the new drawer number, the subfolder which is the taxon name (the filed as name in the collection).At the point of capture three important pieces of data must be captured;specimen numberdrawer numbertaxon nameReassembly of labels and into new drawer.
12 Ingestion into Transcription Database Script uses the application Barcodefiler to search the image for a barcode. If one is found the script renames the image filename with the specimen number.It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields;specimen number (from barcode)drawer number (from folder name)taxon name (from folder name)Using ImageMagic libraries it creates a cropped label derivative image.A script has been developed by our IT specialist Chris Sleep to try and automate some of the slow manual tasks. The script take the image file and where it is located and generates a stub database record from that.Label derivative – prime reason was to improve the efficiency of the rapid data capture interface. A magnified crop of the image labels for the digitisers to read. Cropped from fixed coordinates. These are to be imported into EMu as a distinct digital asset.
13 Transcription Perhaps show a demo The ingestion process pulls in the images (full and label derivative), specimen number, taxon name & drawer number.Transcription done by the digitisers.Transcription focuses on the core label data. There may be all kinds of additional extraneous data on the labels (sale, unknown numbers, collector notations), but’s it’s often hard to interpret and to codify. We started but decided it was too time consuming for the benefits (just flag the tick box instead).Collecting siteCollection dateCollectorRegistration number and detailPreparation detailsType statusThe use of lookups to control and speed up data entryCollecting locality, collector and registration data are ‘harmonised’ or ‘normalised’SitesWe wish to harmonise/normalised Site data. We find this easiest achieved before ingest into EMu.Capture of raw variants (interpretation is a specialism). Also this slows down digitisation, also there will poor consistency if many people are making these interpretation, it would take a lot of training as well.Box Hill as exampleMany variants of the same site concept. Reconciliation to master record done in next phase of the workflow. A single site for import into EMu & georeferenced (specialist). Unique string that occur on labels. So far 7,500 unique strings for 97,000 specimen recordsTaxonomyPulled from the naming of the folders the images are placed into. Does contain variants (not always consistently entered) and typos and mistakes (many of these names do not occur in standard checklists).CollectorsEnter ‘verbatim’ but atomised into title, first, middle & last namesRegistration data
14 Data HarmonisationBiggest challenge is how to harmonise data with existing EMu data.Wish to use appropriate records where they exist in EMu and not to create additional duplicates.Data concepts we wish to harmonise with EMu records;Taxonomy (determination)Parties (collectors)Locations (drawers)Data concepts to create as newSitesThe scale of the mess makes data management extremely challenging.Taxonomy (determination)Difficult to reconcile in an automated way (i.e. import algorithm)Parties (collectors)Lots of duplicates and some messy data, but reconciling them with existing EMu records is very do-ableLocations (drawer numbers)Imported a fresh set of museum drawer location records for this project. Straightforward to match on number.Sites (collecting site)CREATE NEW SET OF SITESToo messy, no consistency in how they were generated, incorrect, duplicated, no georeferences. Faced with the choice do we attempt to clean up there of try and start generating a good clean set of site records.
15 Taxonomy Harmonisation EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations.Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess!Need taxonomic expertise to validate the iCollections name with the correct concept in EMu.Typos, errors when entering names by digitisers.Can’t rely on the EMu import algorithms as matching taxon names is too complex. Need human intervention.Built mapping tool to map taxon name with existing EMu name.Although limited number of UK butterfly species (ca. 60 resident species + migrants), aberrations make it a lot more complicated (1500 names and growing). Many of these aberrational names are invalid MS names only found in the collection. Nevertheless there is interest in the lepidopterists community in knowing what aberrations we hold.Ideally would have resolved the mess before we attempt this project. Reality though we don’t have the luxury to do this. Live with the mess!!
17 Sites HarmonisationMessy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources).Mapping site variants to a site master record.Box HillBox Hill; SurreyBox Hill; Kent Box Hill; Surrey; UK;Box Hill; near Dorking N, WBox Hill, DorkingOut of 181,000 specimens, just 9,681 unique site variants.For Sites it as an opportunity to create some good quality content from scratch based on a consistent method of interpretation according to an agreed protocol.
20 Import into EMu Import is a phased approach; Images. KE have built a backend script to ingest multimedia server side. Reports out a csv with the EMu irn & file name identifier.Specimen record (taxonomy, drawer location & multimedia).Georeferenced collection event data.Much quicker than importing through the client. Uses the batch operations module.Automated generation of csv import files.
21 Issues Barcode no reads or misreads. Printing quality of barcodes. Multiple specimens on one pin.Conflicting data.Data difficult to interpret.Specimens with old style specimen number labels (non barcode).Specimen records exist already in EMu.
22 Digitisation Progress Preparation: 1.15 minutesImaging: 1.05 minutesTranscription: 0.59 minutesTotal: 2.80 minutesThis doesn’t include the validation (Sites/Taxonomy), georeferencing & import. However these are not going to be done on a specimen by specimen basis.These are for the relatively easy butterflies. Large. Won’t be so straight-forward for micro moths!Moving onto the moths nextDigitisers involved with other projects, they are not dedicated to the iCollections digitsation.
23 iCollections TeamThe success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines.Gordon PatersonchairVictoria Carterproject managerDarrell Siebertquality assurancePeter WingdigitiserElisa CaneFlavia ToloniJo DurantLyndsey DouglasSara AlbuquerqueJasmin PereraSophie LedgerGerrardo MazzettaGeoff Martincollections managementMartin HoneyBlanca HuertasTheresa HowardSteve BrooksresearchAngela SelfIan KitchingMalcolm PenngeoreferencingLiz DuffellCaitlin McLaughlinMike Sadkadatabase & interface designerAdrian Hinedata workflowChris SleepdatabaseVladimir Blagoderovimage workflowSteve Cafferty