Presentation is loading. Please wait.

Presentation is loading. Please wait.

April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 1 Janeen Jones (IZ), Robert Lücking (Botany), Joanna McCaffrey (AA), Christine Niezgoda.

Similar presentations


Presentation on theme: "April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 1 Janeen Jones (IZ), Robert Lücking (Botany), Joanna McCaffrey (AA), Christine Niezgoda."— Presentation transcript:

1 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 1 Janeen Jones (IZ), Robert Lücking (Botany), Joanna McCaffrey (AA), Christine Niezgoda (Botany) & Mary Anne Rogers (Fishes) Data Scrubbing – Preparation for EMu

2 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 2 Greetings from The Field Museum

3 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 3 Premier Natural History Museum - collections cover Zoology, Botany, Geology & Anthropology Inaugurated in 1896, moved into beautiful Daniel Burnham building overlooking Lake Michigan in 1921 22+ M specimens, ~1650 new specimens/artifacts per day, 200 scientists in active research program in 70+ countries 1.5 M visitors to our building, with 4.7 M web visits annually Programs covering collections, research, conservation, education and exhibitions 400,000 sq. ft. of exhibition space, 180,000 sq. ft. collections resource center Who We Are

4 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 4 What Weve Done So Far Since early 2003 - Botany - 16 databases, 4 platforms, we cleaned 20M cells => 330 catalogue records, 110,000 taxonomy records, 75,000 parties records, 15,000 transaction records and 15,000 multimedia records Fishes - 1 database, 2 platforms - final stage IZ - 3 databases, 2 platforms - final stage Insects – 22 databases, in progress

5 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 5 Data Scrubbing/Cleansing It's ugly, no one wants to do it, it has to get done, and it is never ending. What is it? Who would want to do it? How to begin?

6 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 6 Definition Also referred to as data cleansing, the act of detecting and removing and/or correcting a databases dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly). The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases. Sophisticated software applications are available to clean a databases data using algorithms, rules and look-up tables, a task that was once done manually and therefore still subject to human error. [Wikipedia]databasesdirty datadatasoftware applications algorithms

7 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 7 Data Scrubbing Goals Keep costs down – reduce need for data transformations on KE side. Go for one file per module if possible. Keep it simple - create methods that can be used by different people (different skill levels, and different platforms)

8 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 8 Parameters of the Task Several dimensions: –Data Syntax – data format, data dictionaries –Catalogue Design – content informs design –Source/Destination – related to EMu and current data model(s) –Convenience / Expediency – freezing schema and data

9 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 9 Parameters – Data Syntax Make the same things the same, e.g., names Dates - textual dates (Spring 1910), dates (10/4/2006) Other measurements, e.g., lat, lon, depth, width (units of measure) Data types e.g., text, number Authority lists, e.g., taxonomy

10 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 10 Parameters - Data Destination The source and the destination databases may not match well, due to different purposes, not to mention the obvious differences of different designs/designers. –Part of scrubbing is mapping the source to the destination, and being sure not to leave anything behind (legacy fields are great).

11 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 11 Parameters - Design Data scrubbing is a means to design your catalogue. You should become familiar with your data. –When cleaning data youll recognize patterns of problems, you should note these and design your catalog to minimize the occurrence of these problems.

12 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 12 Parameters – Convenience / Expediency Data are not static –Build your methods of scrubbing to account for the fact that the data will NOT be frozen while you are scrubbing

13 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 13 Planning - The Plan Understand all the parameters before starting - one person in charge Close interaction with all members of the team, especially if the work is across multiple databases, and the scrubbing is dispersed – its a team exercise, shared goals, and tools

14 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 14 Planning - Which Standards? However long and hard you plan and estimate, it will take longer – count on it. Set standards, and decide on deviation leeway: how much backtracking you are willing to do to make things right? –Database –Work products: on the specimen label, in the record book, or invoices. Consider : Is this an opportunity to do data/audit inventory?

15 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 15 Planning - Which Standards? No matter what line you draw, youll cross it. Give yourself latitude to correct large mistakes with your data. –e.g, if your catalog does not allow duplicate catalog numbers and your cleaning finds duplicates, youll need to re-catalog at least one object or determine if it has the wrong number

16 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 16 Parameters - Tips Decide which can be done without freezing data and which need to be handled with other tools. After cleaning data in your live platform, you should decide if it is possible to alter your existing database to prevent bad data from being re-entered. (e.g., convert a text field to a look-up field with values built from the cleaned data.)

17 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 17 Techniques Analysis first Tokenization & Packetization : allows for massive cleaning of a target field, but requires analysis, part of mapping exercise Mapping tables : removes the problematic data from the live version and allows manipulation without affecting your data set. Once the work has been done on unique values, those changes can be applied at exportation. Allows you to clean data in a field, as well as parse apart data of a single field into many fields. Encoding : allows partial deferment of cleaning, or transformations at exportation Mirror EMu tables/modules : helps shape direction towards the goal, saves money Typos, language variations, abbreviations, diacritic marks : prevents duplication of names Site / Locality descriptions : Qualifiers :

18 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 18 Techniques Analysis first Tokenization & Packetization : allows for massive cleaning of a target field, but requires analysis, part of mapping exercise Mapping tables : removes the problematic data from the live version and allows manipulation without affecting your data set. Once the work has been done on unique values, those changes can be applied at exportation. Allows you to clean data in a field, as well as parse apart data of a single field into many fields. Encoding : allows partial deferment of cleaning, or transformations at exportation Mirror EMu tables/modules : helps shape direction towards the goal, saves money Typos, language variations, abbreviations, diacritic marks : prevents duplication of names Site / Locality descriptions : Qualifiers :

19 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 19 Techniques Tokenization & Packetization : make a unique token out of each data parcel, like a name, and then re-format it for export. E.g., J. Jones, R. Bieler, and Jochen Gerber becomes J. M. Jones\R. Bieler\J. Gerber as a packet of Brief Names for linking into the Parties module. Mapping tables : use to scrub tokens Encoding : field extensions, and notes fields, scrub now or later Mirror EMu tables/modules : Catalog, Parties, Taxonomy, MM, Collection Events & Sites Typos, language variations, abbreviations, diacritic marks : prevents duplication of names Qualifiers : * and ? In taxa can be treated at cf. and aff. when applied to catalogue entries and in general need to be dealt with as far as possible – cant search for.

20 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 20 Mapping Example - 1

21 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 21 Mapping Example - 2

22 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 22 Site / Locality Example Analyzing this set of records revealed to the field botanist that they were all the SAME! –Santa Rosa National Park, Sector Murcielago –Area de Conservacion Guanacaste, Sector Santa Rosa National Park, Murcielago –Guanacaste National Park, Santa Rosa Section, at Murcielago –Guanecaste Conservation Area, Murcielago –Guanecaste, Parque Nacional Santa Rosa Section, Sector Murcielago Preferred form is: Santa Rosa National Park (Guanacaste Conservation Area), Sector Murciélago

23 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 23 Name Blow-up Example Names have a way of propagating - L. A. de Escobar Collector Katherine Albert de Escobar L. Albert de Escobar L. de Escobar L. Escobar LastDerived Brief Katherine Albert de Middle EscobarL. K. A. de EscobarLinda First Linda E. Collector Katherine de Albert Linda Catherine Albert K. de Escobar L. K. A. de Escobar Linda Albert Collector L. C. A. de Escobar Linda C. Albert L. A. Escobar LAE Collector

24 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 24 Data Simplicity 1 There are past owners numbers recorded in two places

25 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 25 Data Simplicity 2

26 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 26 Tokenizing

27 April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 27 Summary Definition and Goals Parameters of the task are informed by the context –syntax, data destination, convenience/expediency, design Planning - the plan, standards Examples Q u e s t i o n s ?


Download ppt "April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 1 Janeen Jones (IZ), Robert Lücking (Botany), Joanna McCaffrey (AA), Christine Niezgoda."

Similar presentations


Ads by Google