Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?

Similar presentations


Presentation on theme: "Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?"— Presentation transcript:

1 Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?

2 Data Enhancements Project Digital Collections Programme (DCP). Programme to kick-start mass digitisation of our collections. A number of projects: – Pilot digitisation: iCollections, slide scanning, Picturae, eMesozoic. – Mobilisation of digitised registers – Data standards & policies (core data, standards, barcoding, entry-level digitisation) – System enhancements (improving our system, training, reporting) – Data enhancements Data Enhancements two main objectives: – Improve the NHM EMu Ssystem – Improve the quality of existing data Two phases of project: – Year 1: Assessment (April 2014-March 2015) – Year 2: Implementation (April 2015-March 2016)

3 Improve the NHM EMu System Built a bloated & complicated system: 5 separate departmental implementations. Didn’t have a clear idea what we wanted EMu for (collections/research) Project objectives: Streamline EMu Catalogue – remove low usage columns/Tabs. In progress. Harmonise structure of EMu across departments. In progress. Simplify Collection Events & Sites data model. In progress. Build a unified molecular data model across life science. Planned. Build a single model how we capture verbatim data. Planned.

4 Improve the Quality of Existing Data Significant ongoing cleansing activity. But parochial, decided by individual curators, and not aligned with the museum vision. Project objectives: Assess the current state of data in EMu. Define a priority list of EMu data enhancements. Develop an overarching plan for data enhancements that fits with the strategic objectives of the DCP. Develop efficient workflows to clean data. Define data quality metrics to monitor improvement & progress. Develop an actual implementation plan to begin enhancement.

5 Scale of Problem ModuleBOTENTMINPALZOOALL Catalogue689,8591,110,312432,385441,3291,927,6754,618,856 Collection Events716,434114,05127,65319,757700,7601,578,654 Collection Index0695,5290018,799714,329 Multimedia152,58919,05611,85040,13173,121730,373 Analysis7526,42035,5180042,690 Stratigraphy00120,2230 Parties192,40142,67114,72613,152266,464530,415 Sites82,272101,965139,07337,123402,071763,259 Taxonomy309,6111,184,48416,014122,735421,5912,054,548 Bibliography17,531306,65911,76779,56476,617 492138 11,545,485 Snapshot Nov 2014

6 Assessment Phase A series of different assessments to understand the current situation: – Field usage – Major data issues by department – Duplicates records – Lookups in EMu – Existing EMu data against DCP core data – Georeference data – Catalogue record types

7 Assessment of Field Usage ModuleTotal FieldsNo. Used% Used Analysis51017635% Catalogue2424133055% Coll. Events97016117% Collection Index901820% Multimedia3941216% Parties64616526% Sites95123425% Stratigraphy26615157% Taxonomy91438642% Total7165263337% Field usage across modules Conclusions: Large numbers of fields are not being used across modules, but particularly the Catalogue where only 55% of the ca. 2,400 fields are used. Recommendations: Unused & low usage fields should be removed to simplify the Catalogue module. Distribution of Field Usage Lower BoundUpper BoundNo. of Fields 90%100%5 80%90%1 70%80%1 60%70%6 50%60%6 40%50%38 30%40%11 20%30%56 10%20%84 0%10%1122 Total1330

8 Assessment of Major Data Issues Conclusions: Our data issues are broad and deep with large numbers of erroneous records, incomplete records & missing records. current data enhancement workflows are inadequate and need scaling up. Recommendations: To initiate a pilot project of data enhancements to develop efficient workflows to tackle the scale of data issues we face. For each issue: Scale Priority (ranked) % complete Reason Approach

9 Assessment of Duplicate Records Duplicated Values Total duplicate records Percentage duplication Defined by Summary Data Catalogue100,0651,523,73930.82% Collection Events157,906771,30938.86% Collection Index5,70112,2060.91% Locations4,60410,7642.41% Multimedia32,73382,2046.77% Parties16,748235,52441.25% Analysis3943,0026.11% Stratigraphy2,11010,10939.55% Sites34,014409,10749.14% Taxonomy90,266263,4248.43% TOTAL 3,321,38825.44% Defined By Other Fields Parties24,068305,45553.05% Multimedia30,35594,6588.80% Taxonomy227,895683,06022.15% Conclusions: Duplicates are a serious impediment for users. A rough estimate of the overall duplication rate is 25%, but is considerably higher in some modules. Recommendations: Devise efficient workflows to remove duplicates in key modules. To implement a plan across science to remove duplicates from core modules engaging collections staff. What is a duplicate – defining uniqueness Literal v. sematic duplicates Duplicates are self-perpetuating

10 Duplicate Cluster Size

11 Assessment of Lookups in EMu ModuleCount Bibliography3 Catalogue53 Collection Events7 Disposals3 Loans8 Locations1 Multimedia1 Parties5 Registration Lots1 Sites7 Stratigraphy4 Taxonomy10 Valuations1 Total104 Conclusions: Survey identified a total of 104 lookups to clean in EMu of which 42 were deemed high priority, many being important for museum compliance. Many lookups were not deemed worth cleaning (remove?). Recommendations: To instigate a plan to define vocabularies, clean existing lookups and lock down. Surveyed data managers to determine which of the hundreds of lookups are most important to control and clean. Each lookup scored: 1.Scope (dept./museum) 2.Priority (low/medium/high) 3.Ease (easy/moderate/difficult)

12 Assessment of Existing Data against DCP Core Data Percentage of Records Core Data ItemBOTENTMINPALZOOMean BM Number of object99.9%22.8%89.6%99.2%71.4%76.6% Object-level identifier if different-92.0%--- Location within the NHM38.9%26.4%52.3%28.1%22.3%33.6% Label and/or object image(s)20.9%5.2%1.6%16.1%36.9%16.1% Taxonomic name96.3%99.9%83.4%67.7%68.8%97.1% Type status (if single specimen)21.7%5.2% 5.0%5.7%8.5% Geographical location94.2%23.4%89.5%73.2%66.8%69.4% Date of collection98.2%23.4%15.2%37.6%66.8%48.2% Collector98.2%23.4%-37.6%66.8%56.5% NHM Acquisition Information-2.4%33.1%2.3%61.2%24.7% Stratigraphy---73.0%- Percentage of Specimen Records with DCP Core Data

13 There are no existing museum-wide best practice, protocol or tools for georeferencing. there is a successful georeferencing software tool and workflow for the iCollections project. Only 12% of records are georeferenced. Georeferencing DepartmentGeoreferencedTotal% Georeferenced Botany10,40082,27212.64% Entomology4,098101,9654.02% Zoology36,542402,0719.09% Mineralogy41,143139,07329.58% Palaeontology1,04337,1232.81% All93,226763,25912.21% Conclusions: Most of our data are not georeferenced and to do so requires significant effort. There is a successful georeferencing pipeline (software tool, protocol & workflow) for the iCollections project. Recommendations: Existing Site data requires significant cleaning before it can be georeferenced. The existing georeferencing pipeline needs to be made more generic and flexible to meet the requirements of all mass digitisation projects.

14 Assessment of Catalogue Record Types Number of record types and kinds in EMu DepartmentNo. Sub- Depart. No. of Record Types No. of Record Kinds Combinations Botany 15 2024387 Entomology 0 528 Mineralogy 0 51827 Palaeontology 10 4949 Zoology 22 2413195 Libraries 0 304 PEG 2 306 TOTAL 49 4564683 Collections Custody tab Conclusions: There are a plethora of record types (n=45) and kinds (64) in the Catalogue module. Many of these are not heavily used, no longer relevant or not well thought out. Recommendations: Simplification and harmonisation of collection kinds and types across collection areas.

15 EMu Data Enhancements Phase 2 Objectives Develop individual data enhancement plans for each of the 5 collections based on: – data enhancement priorities – ongoing collection activity – Digitisation priorities Establish a framework for data enhancements and to embed these into business as usual activity by end of project (20%). Develop efficient data enhancement workflows – piloting some key workflows e.g. deduplication. Measuring improvements using KPI’s and DQI’s. Establish mechanisms to maintain data quality going forward. Simplify Catalogue record types.

16 Scaling Up Duplicate Removal With the size of the mountain how do we make this achievable? De-duping using manual merge & delete is not scalable for us (millions of records). Advantage of using Operations module – Runs server side – significantly faster. – Users don’t have to ‘nurse’ merges and deletes. – Offers opportunity to build a batch upload of merges. Tim Conyers (Data Manager Zoology) is developing a workflow using the Operations to run batches of merges. Trialling in Parties but could be extended to any module.

17 Operations Module for Merging

18 Bulk Merge & Delete Workflow Export data Run Perl script to identify duplicates Generates an import file for Operations Import merge operations into EMu Run merges in Operations Export subordinate irns and collate in Excel Import single delete operation Run delete operation Semi-automated workflow for removing duplicates in Parties module Manual mark-up of duplicates by curators

19 Measuring Improvements Monitoring improvements (to measure progress and show to management): – DQI’s (Data Quality Indicators) – Verification, completeness – KPI’s (Key Performance Indicators) – more granular depending on specific enhancement workflow

20 Maintaining Data Quality Maintaining data quality: – Harmonisation of client (sharing common fields) – Agreement of controlled vocabularies, locking lookups – Defining what makes a record unique and setting unique registry – Mandatory fields – Standardised templates for data capture outside EMu – Scripts server side to run more sophisticated check – Authority lists – Field-level help – Data QC built into all data capture projects outside EMu

21 Development of Tools in EMu A need for more data management tools for data managers to tackle the scale of their data issues. Develop further the operations module to better support bulk merges. Manage queue of operations better, integrate the merge & delete as a continuous process. Identify Duplicates tool/record similarity tool. Move data from field to field tool. Find orphaned record tool (records with no attachments tool).

22 Lessons Learnt It’s worth taking the time to analyse the state of your data. It will be very revealing and make you focus where you want to go. Important to set out clear goals, priorities for data. What do we think it’s worth focussing on and importantly what is not a priority. It’s as much about changing peoples culture e.g. stopping people what they may currently be doing. This far more difficult to realise and manage.

23 Questions?


Download ppt "Adrian Hine, Data Manager (Entomology) Natural History Museum, London Making a molehill out of a mountain – is it achievable to clean 12 million EMu records?"

Similar presentations


Ads by Google