Normalizing Data for Migration Kyle Banerjee
Migrations are a fact of life Acquisitions data Item data ERM bibliographic Patron data Statistics Holdings Information Content Management Systems Link resolver Circulation data Archival management software Institutional Repository
You can do a lot without programming skills Absolutely! ✓ Carriage returns in data ✓ Retain preferred value of multivalued fields ✓ Missing or invalid data ✓ Find problems following complex patterns Maybe.. ? Conditional logic ? Changes based on multifield logic ? Convert free text fields to discrete values
Excel ●Mangles your data ○Barcodes, identifiers, and numeric data at risk ●Cannot fix carriage returns in data ●Crashes with large files ●OpenRefine is a better tool for situations where you think you need Excel
Keys to success Understand differences between the old and new systems Manually examine thousands of records Learn regular expressions Ask for help!
Watch out for ✓ Creative use of fields ○Inconsistencies and changing policies ○Embedded code ○Data that exploits buggy behavior ✓ Different data structures ○ Acq, licensing, electronic, items, etc ✓ Different types of data within fields (e.g. codes vs. text)
CONTENTdm migration example ●XML metadata export contained errors on every field that contained an HTML entity (& < > " ' etc) Oregon Health & Science University ●Error occurs in many fields scattered across thousands of records ●But this can be fixed in seconds!
Regular expressions to the rescue! ●“Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces” /^\s* ]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/
Regular expressions can... ● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements ● Convert free text into XML into delimited text or codes and vice versa ● Find complex patterns using proximity indicators and/or involving multiple lines ● Select preferred versions of fields
Confusing at first, but easier than you think! ●Works on all platforms and is built into a lot of software ●Ask for help! Programmers can help you with syntax ●Let’s walk through our example which involves matching and joining unknown fields across multiple lines...
Regular Expression Analysis /^\s* ]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/ ^Beginning of line \s*<Zero or more whitespace characters followed by “<” \([^>]\+>\)One or more characters that are not “>” followed by “>” (i.e. a tag). Store in \1 \(.*\)Any characters to next part of pattern. Store in \2 \(&[a-z]\+\)Ampersand followed by letters (HTML entities). Store in \3 <\/\1\n“</ followed by \1 (i.e. the closing tag) followed by a newline \s*<\1Any number of whitespace characters followed by tag \1 /<\1\2\3;/Replace everything up to this point with “<” followed by \1 (opening tag), \2 (field contents), \3, and “;” (fix HTML entity). This effectively joins the fields
A simpler example ●Find a line that contains 1 to 5 fields in a tab delimited file (because you expect 6) ^\([^\t]*\t\)\{0,4}[^\t]*$ ● To automatically join it with the next line with a space /^\(\([^\t]*\t\)\{0,4}[^\t]*\)\n/\1 / However, it would be much safer and easier to use syntax that detects the first or last field
If you want a GUI, use OpenRefine ●Sophisticated, including regular expression support and ability to create columns from external data sources ●Convert between different formats ●Up to a couple hundred thousand rows
Normalization is more conceptual than technical ●Every situation is unique and depends on the data you have and the config of the new system ●Don’t fob off data analysis on technical people who don’t understand library data ●It’s not possible to fix everything because the systems work differently (if they didn’t, migrating would be pointless)
Questions? Kyle Banerjee