Normalizing Data for Migration Kyle Banerjee

Slides:



Advertisements
Similar presentations
Defining XML The Document Type Definition. Document Type Definition text syntax for defining –elements of XML –attributes (and possibly default values)
Advertisements

C6 Databases.
An Introduction to XML Based on the W3C XML Recommendations.
System Migration IS 551 Fall 2005 Dr. Dania Bilal.
Thayer School of Engineering Dartmouth Lecture 2 Overview Web Services concept XML introduction Visual Studio.net.
Designing a Database Unleashing the Power of Relational Database Design.
23-Jun-15 HTML. 2 Web pages are HTML HTML stands for HyperText Markup Language Web pages are plain text files, written in HTML Browsers display web pages.
Data format translation and migration Future possibilities Alasdair Crockett, Data Standards Manager UK Data Archive.
1 The Information School of the University of Washington Nov 6fit more-digital © 2006 University of Washington Digital Information INFO/CSE 100,
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
Introduction to XML This material is based heavily on the tutorial by the same name at
String Escape Sequences
Entering Data in Excel. Entering numbers, text, a date, or a time n 1Click the cell where you want to enter data. n 2Type the data and press ENTER or.
The University of Adelaide Table Talk: Using tables in Word Peter Murdoch March 2014 PREPARING GOOD LOOKING DOCUMENTS.
DEiXTo.
Access Tutorial 8 Sharing, Integrating, and Analyzing Data
Phil Brewster  One of the first steps – identify the proper data types  Decide how data (in columns) should be stored and used.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
DAY 6: EXCEL CHAPTER 2 Tazin Afrin September 05,
WILIUG 1. June 2, 2005 Using Review Files with Millennium Rapid & Global Update jenny schmidt SWITCH Library Consortium.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
CSCI 6962: Server-side Design and Programming Validation Tools in Java Server Faces.
1/14 ITApplications XML Module Session 2: Using and Creating XML Documents.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
XML Syntax - Writing XML and Designing DTD's
CODD’s 12 RULES OF RELATIONAL DATABASE
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Spreadsheet skills Castle College. Objectives Look into what is a Spreadsheet. Gain some understanding into some of the functions of a Spreadsheet. Discuss.
Best Practices for Coding April 14, Best Practices Keep it simple –Plain Old Semantic HTML (POSH) Don’t recreate styles already in the EPA style.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
Data Storage Choices File or Database ? Binary or Text file ? Variable or fixed record length ? Choice of text file record and field delimiters XML anyone.
E0262 – MIS – Multimedia Storage Techniques XML (Extensible Markup Language  XML is a markup language for creating documents containing structured information.
Access 2013 Microsoft Access 2013 is a database application that is ideal for gathering and understanding data that’s been collected on just about anything.
Question of the Day  On a game show you’re given the choice of three doors: Behind one door is a car; behind the others, goats. After you pick a door,
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
XML Documents Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University Elements Attributes Comments PI Document.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Microsoft Access Designing and creating tables and populating data.
XP Chapter 1 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Preparing To Automate Data Management Chapter 1 “You.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
An Introduction to XML Sandeep Bhattaram
JavaScript Programming Unit #1: Introduction. What is Programming?
Microsoft Excel 2003 Illustrated Complete Data with Other Programs Exchanging.
Last National Copy Project Liz Baker – Manager, Resources Clare Job – Metadata Officer.
Well Formed XML The basics. A Simple XML Document Smith Alice.
The last book in Australia? Importing last Australian copy holdings into Millennium Christian West University of Canberra
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
XP New Perspectives on XML, 2 nd Edition Tutorial 7 1 TUTORIAL 7 CREATING A COMPUTATIONAL STYLESHEET.
Migration of Physical to Electronic (P2E) Resources in Alma
IS6146 Databases for Management Information Systems Lecture 1: Introduction to IS6146 Rob Gleasure robgleasure.com.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 A Tool that Uses the SAS PRX Functions.
JavaScript Syntax Fort Collins, CO Copyright © XTR Systems, LLC Introduction to JavaScript Syntax Instructor: Joseph DiVerdi, Ph.D., MBA.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
Editing Tons of Text? RegEx to the Rescue! Eric Cressey Senior UX Content Writer Symantec Corporation.
AHG Advanced Techniques for PDF Accessibility
Resource Management / Acquisitions
GO! with Microsoft Office 2016
The How and Why of DOI Assigning DOI’s to IR content
GO! with Microsoft Access 2016
Unit Six: Labels In this unit… Review Adding Text to Maps
Microsoft Access 2003 Illustrated Complete
ECONOMETRICS ii – spring 2018
Case Study: Fixing MARC data with MarcEdit and OpenRefine
Introduction to C++ Programming
ArchivesSpace Migration
INFO/CSE 100, Spring 2005 Fluency in Information Technology
Medusa at the University of Illinois
Spreadsheets, Modelling & Databases
Presentation transcript:

Normalizing Data for Migration Kyle Banerjee

Migrations are a fact of life Acquisitions data Item data ERM bibliographic Patron data Statistics Holdings Information Content Management Systems Link resolver Circulation data Archival management software Institutional Repository

You can do a lot without programming skills Absolutely! ✓ Carriage returns in data ✓ Retain preferred value of multivalued fields ✓ Missing or invalid data ✓ Find problems following complex patterns Maybe.. ? Conditional logic ? Changes based on multifield logic ? Convert free text fields to discrete values

Excel ●Mangles your data ○Barcodes, identifiers, and numeric data at risk ●Cannot fix carriage returns in data ●Crashes with large files ●OpenRefine is a better tool for situations where you think you need Excel

Keys to success  Understand differences between the old and new systems  Manually examine thousands of records  Learn regular expressions  Ask for help!

Watch out for ✓ Creative use of fields ○Inconsistencies and changing policies ○Embedded code ○Data that exploits buggy behavior ✓ Different data structures ○ Acq, licensing, electronic, items, etc ✓ Different types of data within fields (e.g. codes vs. text)

CONTENTdm migration example ●XML metadata export contained errors on every field that contained an HTML entity (& < > " &apos; etc) Oregon Health &amp Science University ●Error occurs in many fields scattered across thousands of records ●But this can be fixed in seconds!

Regular expressions to the rescue! ●“Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces” /^\s* ]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/

Regular expressions can... ● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements ● Convert free text into XML into delimited text or codes and vice versa ● Find complex patterns using proximity indicators and/or involving multiple lines ● Select preferred versions of fields

Confusing at first, but easier than you think! ●Works on all platforms and is built into a lot of software ●Ask for help! Programmers can help you with syntax ●Let’s walk through our example which involves matching and joining unknown fields across multiple lines...

Regular Expression Analysis /^\s* ]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/ ^Beginning of line \s*<Zero or more whitespace characters followed by “<” \([^>]\+>\)One or more characters that are not “>” followed by “>” (i.e. a tag). Store in \1 \(.*\)Any characters to next part of pattern. Store in \2 \(&[a-z]\+\)Ampersand followed by letters (HTML entities). Store in \3 <\/\1\n“</ followed by \1 (i.e. the closing tag) followed by a newline \s*<\1Any number of whitespace characters followed by tag \1 /<\1\2\3;/Replace everything up to this point with “<” followed by \1 (opening tag), \2 (field contents), \3, and “;” (fix HTML entity). This effectively joins the fields

A simpler example ●Find a line that contains 1 to 5 fields in a tab delimited file (because you expect 6) ^\([^\t]*\t\)\{0,4}[^\t]*$ ● To automatically join it with the next line with a space /^\(\([^\t]*\t\)\{0,4}[^\t]*\)\n/\1 / However, it would be much safer and easier to use syntax that detects the first or last field

If you want a GUI, use OpenRefine ●Sophisticated, including regular expression support and ability to create columns from external data sources ●Convert between different formats ●Up to a couple hundred thousand rows

Normalization is more conceptual than technical ●Every situation is unique and depends on the data you have and the config of the new system ●Don’t fob off data analysis on technical people who don’t understand library data ●It’s not possible to fix everything because the systems work differently (if they didn’t, migrating would be pointless)

Questions? Kyle Banerjee