Presentation is loading. Please wait.

Presentation is loading. Please wait.

Get your hands dirty cleaning data. 2008 European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford

Similar presentations


Presentation on theme: "Get your hands dirty cleaning data. 2008 European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford"— Presentation transcript:

1 Get your hands dirty cleaning data. 2008 European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford elizabeth.bruton@mhs.ox.ac.uk

2 Outline ► Data Migration ► Problem -> Solution approach ► Tools ► Manual Data Cleaning ► Examples ► Current and Future Practices (Documentation, Policing, Review)‏

3 Data Migration ► First step towards better, cleaner data ► Steps:  Prepare and analyse legacy system  Data mapping  KE EMu system design  Data migration

4 Legacy System Analysis ► Prepare and analyse previous (legacy) system  Data: structure and relationships - tables and fields. ► Primary ► Secondary ► Cross-reference  Documentation and usage  Redundant data

5 Legacy Data analysis

6 Data Mapping

7 KE EMu system design ► Default and additional fields across different modules ► Field titles ► Screen Designer  e.g. Summary tab for ecatalogue module ► Finally data migration

8 Data cleaning overview ► Problem -> solution approach  Input data  Operations  Output data ► Manual or automated operations or both? ► Which tools to use for automated operations?  KE EMu tools – many powerful built-in tools within EMu  Non-KE EMu tools – scripts to use on data imported from EMu; reimport back into EMu  Both

9 KE EMu Tools: Texql ► queries ► KE Texpress Texql queries  Similar syntax to mySQL or SQL ► Uses:  Analysing data and data structure  Analysing search queries  Advanced search queries

10 KE EMu Tools: Global Replace ► Very useful, powerful but also potentially ‘dangerous’ tool ► Can use in combination with search query or list options within EMu ► Can use regular expressions and/or wildcard searches ► Powerful tool for single field or Field A->Field B operations

11 KE EMu Tools: Record Merge ► Does what it says on the tin ► Merge one or more duplicate record(s) into single record ► Only ‘attachments’ to different modules are merged into record not data ► Ditto tool can be used for easily copying data from one record to another ► Attachments to original duplicate record(s) are removed so records can be deleted

12 KE EMu Tools: Reports ► Tool to present information in assorted ways ► Can be used to produce reports but can also be used as data export tool ► Microsoft Excel or CSV format appropriate for more advanced data operations

13 Non-KE EMu Tools: Scripting ► Personally use php and mySQL ► Perl is also useful scripting tool; used by KE ► Have written CSV to mySQL file checker and converter in php ► Then run more advanced operations on data using php scripts ► PhpMyAdmin can export data in many formats including CSV

14 Non-KE EMu Tools: Scripting ► Systematic Approach  Keep copy of original data  Produce data mapping or data cleaning document  Perform operations using php file on mySQL table  Check data produced (manual or automatic) and output logs  Validate data in EMu and then import

15 Manual Data Cleaning ► Some problems cannot be done automatically, either partially or entirely ► Need to be ‘eyeballed’ by a person, preferably someone familiar with the museum’s collections

16 Example: Parties Records ► Legacy system used two systems of noting object ‘makers’  Freetext ‘Maker’ field with no centralised system (1:1 ratio); used for applicable records  Assigned makers with centralised system; only used for first 3,000 or so records ► Freetext data imported into EMu resulted in approximately 5,500 Parties records

17 Example: Parties Records ► Good example of mapping freetext field to more structured data field with 1:Many ratio ► KE ran script which ‘detected’ maker type and formatted accordingly, i.e. Maker Type etc ► But still much cleaning up to be done ► Two approaches: automatic then manual

18 Example: Parties Records ► Problem: Creation-related data within legacy system were all free-text fields ► The museum wanted to keep this data in some format as it contained valuable information, such as ambiguities or uncertainties ► e.g. Italy or France, Attributed to Smith & Jones, possibly last quarter of 19 th century etc

19 Example: Parties Records ► This data did not fit neatly into defined, structure fields such as Parties, Places or Creation Date ► Also wanted to clean Parties records ► Solution: Automatic batch process then manual cleaning

20 Example: Parties Records – Automatic Approach  Exported Creation data (Parties, Place, Creation Date) from EMu  Ran script which checked for and removed duplicates in Parties and Place  Note: The above operation deleted rather than manipulated data but still integral part of data cleaning operation  Copied cleaned Parties, Place, Creation Data into single free-text field: Creation Notes  Re-imported data into EMu using Import Tool

21 Example: Parties Records – Automatic Approach  Began data cleaning by running Global Replace operation within EMu eparties module, removing 'Signed by', 'Attributed to', or 'Made by' from the relevant parties records  Next: Manual Approach

22 Example: Parties Records – Manual Approach  Cleaned records: Check Parties Type (Person or Organisation) and edited records (Surname, Forename, Organisation etc)‏  Merged and deleted duplicate records  Checked and deleted unattached parties records

23 Example: Parties Records – End Result ► Currently have 3,300 cleaner Parties records

24 Current and Future Practices ► Current  Systematic approach to data cleaning; incorporated into monthly museum EMu Users' Meeting  Review ► In Progress  Documentation ► Future  Policing

25 Conclusion ► Data cleaning and policing is an ongoing process for an institution of any size ► Data standards must be set and adhered to ► Needs to be approached and done in a systematic way ► Any questions?


Download ppt "Get your hands dirty cleaning data. 2008 European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford"

Similar presentations


Ads by Google