Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?

Slides:



Advertisements
Similar presentations
Linking Repositories Scoping Study Key Perspectives Ltd University of Hull SHERPA University of Southampton.
Advertisements

ARROW Plus is a Best Practice Network selected under the ICT Policy Support Programme (ICT PSP) DEMO or How does the ARROW system look.
Migrating Entomologys Collection Management System to EMu Adrian Hine.
EMu New Features 2013 Bernard Marshall KE Software.
1 Functional Strategy – IS & IT Geoff Leese November 2006, revised July 2007, September 2008, August 2009.
[Organisation’s Title] Environmental Management System
Adrian Hine, Natural History Museum, London
Confidential Agenda sparesFinder introduction Key Issues Master data harmonisation – Cleansing process Item Management & Governance – Governance Issues.
The Caught and Coloured website: its EMu origins Alex Chubaty – Collection Information Systems Craig Churchill – IT Software Development Museum Victoria.
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
NHM Science Strategy 2013 … Challenges - Digitise 20,000,000 specimens within 5 years Ian Owens, Dir. Sci.
Loans Management at the Natural History Museum Dave Smith UK EMu Users Group Meeting th April 2007.
Collections Management Museums EMu – Data Cleaning with EMu Data Cleaning with EMu Warren Hindley.
Collections Digital Strategy Alan Hart. Collections Digital Strategy Science Strategy: Challenge: A new generation of natural history museums – revolutionise.
Collections Management Natural History Museums Common Development A Natural History Example using Darwin Core Much talk of common development Concern over.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Capturing the impact of research Briony Rayfield.
Reviewing professional services: Support Process Review (SPR) at the University of Bristol HESA/SUMS Seminar, 24 th June 2011 Helen Galbraith, Director.
Collections Management Museums Bristol 2009 – EMu Development Process KE EMu Development Process Alex Fell Operations Manager KE Software UK.
RDA, RDE, or what? By: Darrell J. Siebert (NHM, London) Laurence Livermore (NHM, London) Andrew Brown (Ke Software, Melbourne)
5 th September 2003Diane Tough Content Creation at the NHM or The evolving catalogue!
Empowering Staff Through Institute Planning (ESTIP) Executive Workshop Institute Name: XXXXXX Presenter: XXXXXX Date: XXXXXX.
By Saurabh Sardesai October 2014.
Supporting high-throughput digitisation workflows in EMu
Hatching the EMu Update on Tyne & Wear Archives & Museums Collections Management Project Tyne & Wear Archives & Museums.
Collections Management Museums EMu 3.1 / 3.2 – New Features EMu 3.1 / 3.2 New Features Bernard Marshall Chief Technology Officer KE Software.
Corporate Governance: Beyond Compliance at a time of Recession Prof. Ashley G. Frank BA(Econ)[Magna Cum Laude], MDPA (Cum Laude], MBA, MCom [Cum Laude],
Accessibility Planning, Training & Advisory Programme Making the connections—making it happen Putting Accessibility Planning withinreach! Derek Palmer.
SIMS Personnel - Staff Performance and Pay Dave Cattlin.
The Evergreen, Background, Methodology and IT Service Management Model
LIFE 3 LIFE 3 : Predicting Long Term Preservation Costs Brian Hole LIFE 3 Project Manager The British Library KeepIt training course 05/02/10.
Product Quality, Testing, Reviews and Standards
The GNM-DMS a Document Management System for the Germanische Nationalmuseum Martin Doerr, ICS-Forth Siegfried Krause, GNM April 2004.
Implementing an Automated ACCUPLACER Score Upload System for the i3 Platform A Cooperative Effort by Testing Staff, Other Student Services Areas, and IT.
3 rd IPO Controlliing Conference The European Court of Auditors Peer Review:. Administrative Simplification at the ECA.
- 1 - Roadmap to Re-aligning the Customer Master with Oracle's TCA Northern California OAUG March 7, 2005.
CSI - Introduction General Understanding. What is ITSM and what is its Value? ITSM is a set of specialized organizational capabilities for providing value.
NHM Digital Collection Programme Ian Owens, Natural History Museum, London Digital Specimen 2014, Berlin, September 2014.
UNDAF M&E Systems Purpose Can explain the importance of functioning M&E system for the UNDAF Can support formulation and implementation of UNDAF M&E plans.
Digitization of Natural History Collections (DIGIT) Larry Speers Program Officer Digitization of Natural History Collections Data TDWG Annual Meeting Oct.
ISTC-CMSA workshop Brief description of collections (~ 37milj objects ~ type specimens)  Recent zoological collections  Section Invertebrates 
Get your hands dirty cleaning data European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford
Event Data History David Adams BNL Atlas Software Week December 2001.
© 2001 Change Function Ltd USER ACCEPTANCE TESTING Is user acceptance testing of technology and / or processes a task within the project? If ‘Yes’: Will.
The Digital Library for Earth System Science: Contributing resources and collections Meeting with GLOBE 5/29/03 Holly Devaul.
Evaluating E-Learning Efficacy University of York, UK Wayne Britcliffe and Simon Davis Edinburgh Napier Learning and Teaching conference 14 th June 2012.
Introduction to the Continual Service Improvement Toolkit Welcome.
Co-funded by the Community programme eContentplus Arrow Plus Project Use of Book and Press s. Rights of Authors and Publishers Kraków,
Statistics New Zealand's Move to Process-oriented Statistics Production Julia Gretton and Tracey Savage IAOS Conference Shanghai, China, October 2008.
We provide web based benchmarking, process diagnostics and operational performance measurement solutions to help public and private sector organisations.
Consultancy and Training Trends and Service Offer
KE EMu for Cataloguers Unit objectives: Introduction to KE EMu KE EMu terms and definitions What are the KE EMu modules What is Catalog.
Introduction to KE EMu
EMu in the NHM: A personal perspective Darrell J. Siebert Dept. of Zoology.
KE EMu Collection Management Training Page 1 KE EMu Collection Management Training
The William and Linda Steere Herbarium The New York Botanical Garden
Museum Lives Project The Great Unknown.
Darrell Siebert The MOA Programme: Did we really do that?
Moving towards a ‘One University’ at the University of Leeds Mark Britchford (University of Leeds) & Michael Cousins (Triaster) Student Systems Administration.
Natural History Systems “Just another catalogue” “We’re different” Some degree of commonality in non-catalogue modules e. g. taxonomy Cross-fertilisation.
CSI - Introduction ITIL v3.
Darrell Siebert Managing EMu in the NHM: Will Botanists Ever Agree with Zoologists? Will Palaeontologists Ever Agree with Anybody?
Estates across STFC This presentation is to give PPD the opportunity to respond to proposals for the future management of Estates across STFC The proposals.
Presentation to the Ad-hoc Joint Sub-Committee on Parliamentary Oversight and Accountability Wednesday 20 March 2002 PUBLIC SERVICE MONITORING AND EVALUATION.
Authentication and Authorisation for Research and Collaboration Heiko Hütter, Martin Haase, Peter Gietz, David Groep AARC 3 rd.
CATS Self Review and Planning Tool An Introduction and Overview Alison Poot and Melody West, CATS Project Team (University of Tasmania)
What are our collections being used for?
Geo-referencing The Field:
College of Social Sciences
Module 1.1 Overview of Master Facility Lists in Nigeria
Presentation transcript:

Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?

Data Enhancements Project Digital Collections Programme (DCP). Programme to kick-start mass digitisation of our collections. A number of projects: – Pilot digitisation: iCollections, slide scanning, Picturae, eMesozoic. – Mobilisation of digitised registers – Data standards & policies (core data, standards, barcoding, entry-level digitisation) – System enhancements (improving our system, training, reporting) – Data enhancements Data Enhancements two main objectives: – Improve the NHM EMu Ssystem – Improve the quality of existing data Two phases of project: – Year 1: Assessment (April 2014-March 2015) – Year 2: Implementation (April 2015-March 2016)

Improve the NHM EMu System Built a bloated & complicated system: 5 separate departmental implementations. Didn’t have a clear idea what we wanted EMu for (collections/research) Project objectives: Streamline EMu Catalogue – remove low usage columns/Tabs. In progress. Harmonise structure of EMu across departments. In progress. Simplify Collection Events & Sites data model. In progress. Build a unified molecular data model across life science. Planned. Build a single model how we capture verbatim data. Planned.

Improve the Quality of Existing Data Significant ongoing cleansing activity. But parochial, decided by individual curators, and not aligned with the museum vision. Project objectives: Assess the current state of data in EMu. Define a priority list of EMu data enhancements. Develop an overarching plan for data enhancements that fits with the strategic objectives of the DCP. Develop efficient workflows to clean data. Define data quality metrics to monitor improvement & progress. Develop an actual implementation plan to begin enhancement.

Scale of Problem ModuleBOTENTMINPALZOOALL Catalogue689,8591,110,312432,385441,3291,927,6754,618,856 Collection Events716,434114,05127,65319,757700,7601,578,654 Collection Index0695, ,799714,329 Multimedia152,58919,05611,85040,13173,121730,373 Analysis7526,42035, ,690 Stratigraphy00120,2230 Parties192,40142,67114,72613,152266,464530,415 Sites82,272101,965139,07337,123402,071763,259 Taxonomy309,6111,184,48416,014122,735421,5912,054,548 Bibliography17,531306,65911,76779,56476, ,545,485 Snapshot Nov 2014

Assessment Phase A series of different assessments to understand the current situation: – Field usage – Major data issues by department – Duplicates records – Lookups in EMu – Existing EMu data against DCP core data – Georeference data – Catalogue record types

Assessment of Field Usage ModuleTotal FieldsNo. Used% Used Analysis % Catalogue % Coll. Events % Collection Index901820% Multimedia % Parties % Sites % Stratigraphy % Taxonomy % Total % Field usage across modules Conclusions: Large numbers of fields are not being used across modules, but particularly the Catalogue where only 55% of the ca. 2,400 fields are used. Recommendations: Unused & low usage fields should be removed to simplify the Catalogue module. Distribution of Field Usage Lower BoundUpper BoundNo. of Fields 90%100%5 80%90%1 70%80%1 60%70%6 50%60%6 40%50%38 30%40%11 20%30%56 10%20%84 0%10%1122 Total1330

Assessment of Major Data Issues Conclusions: Our data issues are broad and deep with large numbers of erroneous records, incomplete records & missing records. current data enhancement workflows are inadequate and need scaling up. Recommendations: To initiate a pilot project of data enhancements to develop efficient workflows to tackle the scale of data issues we face. For each issue: Scale Priority (ranked) % complete Reason Approach

Assessment of Duplicate Records Duplicated Values Total duplicate records Percentage duplication Defined by Summary Data Catalogue100,0651,523, % Collection Events157,906771, % Collection Index5,70112, % Locations4,60410, % Multimedia32,73382, % Parties16,748235, % Analysis3943, % Stratigraphy2,11010, % Sites34,014409, % Taxonomy90,266263, % TOTAL 3,321, % Defined By Other Fields Parties24,068305, % Multimedia30,35594, % Taxonomy227,895683, % Conclusions: Duplicates are a serious impediment for users. A rough estimate of the overall duplication rate is 25%, but is considerably higher in some modules. Recommendations: Devise efficient workflows to remove duplicates in key modules. To implement a plan across science to remove duplicates from core modules engaging collections staff. What is a duplicate – defining uniqueness Literal v. sematic duplicates Duplicates are self-perpetuating

Duplicate Cluster Size

Assessment of Lookups in EMu ModuleCount Bibliography3 Catalogue53 Collection Events7 Disposals3 Loans8 Locations1 Multimedia1 Parties5 Registration Lots1 Sites7 Stratigraphy4 Taxonomy10 Valuations1 Total104 Conclusions: Survey identified a total of 104 lookups to clean in EMu of which 42 were deemed high priority, many being important for museum compliance. Many lookups were not deemed worth cleaning (remove?). Recommendations: To instigate a plan to define vocabularies, clean existing lookups and lock down. Surveyed data managers to determine which of the hundreds of lookups are most important to control and clean. Each lookup scored: 1.Scope (dept./museum) 2.Priority (low/medium/high) 3.Ease (easy/moderate/difficult)

Assessment of Existing Data against DCP Core Data Percentage of Records Core Data ItemBOTENTMINPALZOOMean BM Number of object99.9%22.8%89.6%99.2%71.4%76.6% Object-level identifier if different-92.0%--- Location within the NHM38.9%26.4%52.3%28.1%22.3%33.6% Label and/or object image(s)20.9%5.2%1.6%16.1%36.9%16.1% Taxonomic name96.3%99.9%83.4%67.7%68.8%97.1% Type status (if single specimen)21.7%5.2% 5.0%5.7%8.5% Geographical location94.2%23.4%89.5%73.2%66.8%69.4% Date of collection98.2%23.4%15.2%37.6%66.8%48.2% Collector98.2%23.4%-37.6%66.8%56.5% NHM Acquisition Information-2.4%33.1%2.3%61.2%24.7% Stratigraphy %- Percentage of Specimen Records with DCP Core Data

There are no existing museum-wide best practice, protocol or tools for georeferencing. there is a successful georeferencing software tool and workflow for the iCollections project. Only 12% of records are georeferenced. Georeferencing DepartmentGeoreferencedTotal% Georeferenced Botany10,40082, % Entomology4,098101, % Zoology36,542402, % Mineralogy41,143139, % Palaeontology1,04337, % All93,226763, % Conclusions: Most of our data are not georeferenced and to do so requires significant effort. There is a successful georeferencing pipeline (software tool, protocol & workflow) for the iCollections project. Recommendations: Existing Site data requires significant cleaning before it can be georeferenced. The existing georeferencing pipeline needs to be made more generic and flexible to meet the requirements of all mass digitisation projects.

Assessment of Catalogue Record Types Number of record types and kinds in EMu DepartmentNo. Sub- Depart. No. of Record Types No. of Record Kinds Combinations Botany Entomology Mineralogy Palaeontology Zoology Libraries PEG TOTAL Collections Custody tab Conclusions: There are a plethora of record types (n=45) and kinds (64) in the Catalogue module. Many of these are not heavily used, no longer relevant or not well thought out. Recommendations: Simplification and harmonisation of collection kinds and types across collection areas.

EMu Data Enhancements Phase 2 Objectives Develop individual data enhancement plans for each of the 5 collections based on: – data enhancement priorities – ongoing collection activity – Digitisation priorities Establish a framework for data enhancements and to embed these into business as usual activity by end of project (20%). Develop efficient data enhancement workflows – piloting some key workflows e.g. deduplication. Measuring improvements using KPI’s and DQI’s. Establish mechanisms to maintain data quality going forward. Simplify Catalogue record types.

Scaling Up Duplicate Removal With the size of the mountain how do we make this achievable? De-duping using manual merge & delete is not scalable for us (millions of records). Advantage of using Operations module – Runs server side – significantly faster. – Users don’t have to ‘nurse’ merges and deletes. – Offers opportunity to build a batch upload of merges. Tim Conyers (Data Manager Zoology) is developing a workflow using the Operations to run batches of merges. Trialling in Parties but could be extended to any module.

Operations Module for Merging

Bulk Merge & Delete Workflow Export data Run Perl script to identify duplicates Generates an import file for Operations Import merge operations into EMu Run merges in Operations Export subordinate irns and collate in Excel Import single delete operation Run delete operation Semi-automated workflow for removing duplicates in Parties module Manual mark-up of duplicates by curators

Measuring Improvements Monitoring improvements (to measure progress and show to management): – DQI’s (Data Quality Indicators) – Verification, completeness – KPI’s (Key Performance Indicators) – more granular depending on specific enhancement workflow

Maintaining Data Quality Maintaining data quality: – Harmonisation of client (sharing common fields) – Agreement of controlled vocabularies, locking lookups – Defining what makes a record unique and setting unique registry – Mandatory fields – Standardised templates for data capture outside EMu – Scripts server side to run more sophisticated check – Authority lists – Field-level help – Data QC built into all data capture projects outside EMu

Development of Tools in EMu A need for more data management tools for data managers to tackle the scale of their data issues. Develop further the operations module to better support bulk merges. Manage queue of operations better, integrate the merge & delete as a continuous process. Identify Duplicates tool/record similarity tool. Move data from field to field tool. Find orphaned record tool (records with no attachments tool).

Lessons Learnt It’s worth taking the time to analyse the state of your data. It will be very revealing and make you focus where you want to go. Important to set out clear goals, priorities for data. What do we think it’s worth focussing on and importantly what is not a priority. It’s as much about changing peoples culture e.g. stopping people what they may currently be doing. This far more difficult to realise and manage.

Questions?