April 11, 2006Data Scrubbing - EMu Users Group London Apr 2006 1 Janeen Jones (IZ), Robert Lücking (Botany), Joanna McCaffrey (AA), Christine Niezgoda.

Slides:



Advertisements
Similar presentations
Writing Good Use Cases - Instructor Notes
Advertisements

Set Up Instructions Place a question in each spot indicated Place an answer in each spot indicated Remove this slide Save as a powerpoint slide show.
Current design issues for digital archives Robert Munro (presented by David Nathan) Endangered Languages Archive (ELAR), School of Oriental and African.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Debugging in End- User Software Engineering summarized by Andrew Ko Toward Sharing Reasoning to Improve Fault Localization in Spreadsheets Joey Lawrance,
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
1 Advanced with GMail A CYC Electives Module
The Reinberger Childrens Library Center Step-by-step instructions for capturing a MARC record and adding a 658 Tag to a record.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Copyright CompSci Resources LLC Web-Based XBRL Products from CompSci Resources LLC Virginia, USA. Presentation by: Colm Ó hÁonghusa.
International Telecommunication Union Workshop on Standardization in E-health Geneva, May 2003 Digital Imaging in Pathology for Standardization Yukako.
Click to edit Master title style Page - 1 OneSky Teams Step-by-Step Online Corporate Communication Support 2006.
Tutorial 3 – Creating a Multiple-Page Report
Tutorial 9 – Creating On-Screen Forms Using Advanced Table Techniques
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Leading for High Performance. PKR, Inc., for Cedar Rapids 10/04 2 Everythings Up-to-Date in Cedar Rapids! Working at classroom, building, and district.
Addition Facts
Year 6 mental test 5 second questions
1 From the data to the report Module 2. 2 Introduction Welcome Housekeeping Introductions Name, job, district, team.
The National Certificate in Adult Numeracy
Database Design Using the REA Data Model
Introduction Lesson 1 Microsoft Office 2010 and the Internet
INTERNET PROTOCOLS Class 9 CSCI 6433 David C. Roberts Entire contents copyright 2011, David C. Roberts, all rights reserved.
1 How to Enter Time. 2 Select: Log In Once logged in, Select: Employees.
Configuration management
Maintaining data quality: fundamental steps
Information Systems Today: Managing in the Digital World
Campaign Overview Mailers Mailing Lists
ABC Technology Project
Introduction AmeriCorps State & National 1 The following presentation will guide AmeriCorps State and National Program users through how to create Applicant-Determined.
School of Geography FACULTY OF ENVIRONMENT Working with Tables 1.
Microsoft Access.
ECATS RCCA CAMP PROCESS ENHANCEMENTS
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
R12 Assets A Look Inside SM. Copyright © 2008 Chi-Star Technology SM -2- High-Level Overview R12 Setups –Subledger Accounting –ADI Templates –XML Reports.
Microsoft Office Illustrated Fundamentals Unit C: Getting Started with Unit C: Getting Started with Microsoft Office 2010 Microsoft Office 2010.
VOORBLAD.
Benchmark Series Microsoft Excel 2013 Level 2
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
Chapter 5 Microsoft Excel 2007 Window
Squares and Square Root WALK. Solve each problem REVIEW:
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Chapter 5 Test Review Sections 5-1 through 5-4.
GL Interfaces 1 Using General Ledger Interfaces The File Maintenance and Procedures to successfully use the General Ledger Interfaces Jim Simunek, CPIM.
Addition 1’s to 20.
Pasewark & Pasewark Microsoft Office XP: Introductory Course 1 INTRODUCTORY MICROSOFT WORD Lesson 8 – Increasing Efficiency Using Word.
Key Stage 3 National Strategy Handling data: session 4.
25 seconds left…...
School Census Summer 2011 Headlines Version Jim Haywood Product Manager for Statutory Returns.
Copyright 2001 Advanced Strategies, Inc. 1 Data Bridging An Overview Prepared for DIGIT By Advanced Strategies, Inc.
Week 1.
We will resume in: 25 Minutes.
Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.
Computer Concepts BASICS 4th Edition
© Paradigm Publishing, Inc Access 2010 Level 2 Unit 2Advanced Reports, Access Tools, and Customizing Access Chapter 8Integrating Access Data.
Chapter 8 Improving the User Interface
CSCI3170 Introduction to Database Systems
12-CRS-0106 REVISED 8 FEB 2013 PRESENTS Payment Functionality.
Benchmark Series Microsoft Excel 2013 Level 2
Presentation transcript:

April 11, 2006Data Scrubbing - EMu Users Group London Apr Janeen Jones (IZ), Robert Lücking (Botany), Joanna McCaffrey (AA), Christine Niezgoda (Botany) & Mary Anne Rogers (Fishes) Data Scrubbing – Preparation for EMu

April 11, 2006Data Scrubbing - EMu Users Group London Apr Greetings from The Field Museum

April 11, 2006Data Scrubbing - EMu Users Group London Apr Premier Natural History Museum - collections cover Zoology, Botany, Geology & Anthropology Inaugurated in 1896, moved into beautiful Daniel Burnham building overlooking Lake Michigan in M specimens, ~1650 new specimens/artifacts per day, 200 scientists in active research program in 70+ countries 1.5 M visitors to our building, with 4.7 M web visits annually Programs covering collections, research, conservation, education and exhibitions 400,000 sq. ft. of exhibition space, 180,000 sq. ft. collections resource center Who We Are

April 11, 2006Data Scrubbing - EMu Users Group London Apr What Weve Done So Far Since early Botany - 16 databases, 4 platforms, we cleaned 20M cells => 330 catalogue records, 110,000 taxonomy records, 75,000 parties records, 15,000 transaction records and 15,000 multimedia records Fishes - 1 database, 2 platforms - final stage IZ - 3 databases, 2 platforms - final stage Insects – 22 databases, in progress

April 11, 2006Data Scrubbing - EMu Users Group London Apr Data Scrubbing/Cleansing It's ugly, no one wants to do it, it has to get done, and it is never ending. What is it? Who would want to do it? How to begin?

April 11, 2006Data Scrubbing - EMu Users Group London Apr Definition Also referred to as data cleansing, the act of detecting and removing and/or correcting a databases dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly). The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases. Sophisticated software applications are available to clean a databases data using algorithms, rules and look-up tables, a task that was once done manually and therefore still subject to human error. [Wikipedia]databasesdirty datadatasoftware applications algorithms

April 11, 2006Data Scrubbing - EMu Users Group London Apr Data Scrubbing Goals Keep costs down – reduce need for data transformations on KE side. Go for one file per module if possible. Keep it simple - create methods that can be used by different people (different skill levels, and different platforms)

April 11, 2006Data Scrubbing - EMu Users Group London Apr Parameters of the Task Several dimensions: –Data Syntax – data format, data dictionaries –Catalogue Design – content informs design –Source/Destination – related to EMu and current data model(s) –Convenience / Expediency – freezing schema and data

April 11, 2006Data Scrubbing - EMu Users Group London Apr Parameters – Data Syntax Make the same things the same, e.g., names Dates - textual dates (Spring 1910), dates (10/4/2006) Other measurements, e.g., lat, lon, depth, width (units of measure) Data types e.g., text, number Authority lists, e.g., taxonomy

April 11, 2006Data Scrubbing - EMu Users Group London Apr Parameters - Data Destination The source and the destination databases may not match well, due to different purposes, not to mention the obvious differences of different designs/designers. –Part of scrubbing is mapping the source to the destination, and being sure not to leave anything behind (legacy fields are great).

April 11, 2006Data Scrubbing - EMu Users Group London Apr Parameters - Design Data scrubbing is a means to design your catalogue. You should become familiar with your data. –When cleaning data youll recognize patterns of problems, you should note these and design your catalog to minimize the occurrence of these problems.

April 11, 2006Data Scrubbing - EMu Users Group London Apr Parameters – Convenience / Expediency Data are not static –Build your methods of scrubbing to account for the fact that the data will NOT be frozen while you are scrubbing

April 11, 2006Data Scrubbing - EMu Users Group London Apr Planning - The Plan Understand all the parameters before starting - one person in charge Close interaction with all members of the team, especially if the work is across multiple databases, and the scrubbing is dispersed – its a team exercise, shared goals, and tools

April 11, 2006Data Scrubbing - EMu Users Group London Apr Planning - Which Standards? However long and hard you plan and estimate, it will take longer – count on it. Set standards, and decide on deviation leeway: how much backtracking you are willing to do to make things right? –Database –Work products: on the specimen label, in the record book, or invoices. Consider : Is this an opportunity to do data/audit inventory?

April 11, 2006Data Scrubbing - EMu Users Group London Apr Planning - Which Standards? No matter what line you draw, youll cross it. Give yourself latitude to correct large mistakes with your data. –e.g, if your catalog does not allow duplicate catalog numbers and your cleaning finds duplicates, youll need to re-catalog at least one object or determine if it has the wrong number

April 11, 2006Data Scrubbing - EMu Users Group London Apr Parameters - Tips Decide which can be done without freezing data and which need to be handled with other tools. After cleaning data in your live platform, you should decide if it is possible to alter your existing database to prevent bad data from being re-entered. (e.g., convert a text field to a look-up field with values built from the cleaned data.)

April 11, 2006Data Scrubbing - EMu Users Group London Apr Techniques Analysis first Tokenization & Packetization : allows for massive cleaning of a target field, but requires analysis, part of mapping exercise Mapping tables : removes the problematic data from the live version and allows manipulation without affecting your data set. Once the work has been done on unique values, those changes can be applied at exportation. Allows you to clean data in a field, as well as parse apart data of a single field into many fields. Encoding : allows partial deferment of cleaning, or transformations at exportation Mirror EMu tables/modules : helps shape direction towards the goal, saves money Typos, language variations, abbreviations, diacritic marks : prevents duplication of names Site / Locality descriptions : Qualifiers :

April 11, 2006Data Scrubbing - EMu Users Group London Apr Techniques Analysis first Tokenization & Packetization : allows for massive cleaning of a target field, but requires analysis, part of mapping exercise Mapping tables : removes the problematic data from the live version and allows manipulation without affecting your data set. Once the work has been done on unique values, those changes can be applied at exportation. Allows you to clean data in a field, as well as parse apart data of a single field into many fields. Encoding : allows partial deferment of cleaning, or transformations at exportation Mirror EMu tables/modules : helps shape direction towards the goal, saves money Typos, language variations, abbreviations, diacritic marks : prevents duplication of names Site / Locality descriptions : Qualifiers :

April 11, 2006Data Scrubbing - EMu Users Group London Apr Techniques Tokenization & Packetization : make a unique token out of each data parcel, like a name, and then re-format it for export. E.g., J. Jones, R. Bieler, and Jochen Gerber becomes J. M. Jones\R. Bieler\J. Gerber as a packet of Brief Names for linking into the Parties module. Mapping tables : use to scrub tokens Encoding : field extensions, and notes fields, scrub now or later Mirror EMu tables/modules : Catalog, Parties, Taxonomy, MM, Collection Events & Sites Typos, language variations, abbreviations, diacritic marks : prevents duplication of names Qualifiers : * and ? In taxa can be treated at cf. and aff. when applied to catalogue entries and in general need to be dealt with as far as possible – cant search for.

April 11, 2006Data Scrubbing - EMu Users Group London Apr Mapping Example - 1

April 11, 2006Data Scrubbing - EMu Users Group London Apr Mapping Example - 2

April 11, 2006Data Scrubbing - EMu Users Group London Apr Site / Locality Example Analyzing this set of records revealed to the field botanist that they were all the SAME! –Santa Rosa National Park, Sector Murcielago –Area de Conservacion Guanacaste, Sector Santa Rosa National Park, Murcielago –Guanacaste National Park, Santa Rosa Section, at Murcielago –Guanecaste Conservation Area, Murcielago –Guanecaste, Parque Nacional Santa Rosa Section, Sector Murcielago Preferred form is: Santa Rosa National Park (Guanacaste Conservation Area), Sector Murciélago

April 11, 2006Data Scrubbing - EMu Users Group London Apr Name Blow-up Example Names have a way of propagating - L. A. de Escobar Collector Katherine Albert de Escobar L. Albert de Escobar L. de Escobar L. Escobar LastDerived Brief Katherine Albert de Middle EscobarL. K. A. de EscobarLinda First Linda E. Collector Katherine de Albert Linda Catherine Albert K. de Escobar L. K. A. de Escobar Linda Albert Collector L. C. A. de Escobar Linda C. Albert L. A. Escobar LAE Collector

April 11, 2006Data Scrubbing - EMu Users Group London Apr Data Simplicity 1 There are past owners numbers recorded in two places

April 11, 2006Data Scrubbing - EMu Users Group London Apr Data Simplicity 2

April 11, 2006Data Scrubbing - EMu Users Group London Apr Tokenizing

April 11, 2006Data Scrubbing - EMu Users Group London Apr Summary Definition and Goals Parameters of the task are informed by the context –syntax, data destination, convenience/expediency, design Planning - the plan, standards Examples Q u e s t i o n s ?