Presentation is loading. Please wait.

Presentation is loading. Please wait.

Normalizing Data for Migration Kyle Banerjee

Similar presentations


Presentation on theme: "Normalizing Data for Migration Kyle Banerjee"— Presentation transcript:

1 Normalizing Data for Migration Kyle Banerjee banerjek@ohsu.edu

2 Migrations are a fact of life Acquisitions data Item data ERM bibliographic Patron data Statistics Holdings Information Content Management Systems Link resolver Circulation data Archival management software Institutional Repository

3 You can do a lot without programming skills Absolutely! ✓ Carriage returns in data ✓ Retain preferred value of multivalued fields ✓ Missing or invalid data ✓ Find problems following complex patterns Maybe.. ? Conditional logic ? Changes based on multifield logic ? Convert free text fields to discrete values

4

5 Excel ●Mangles your data ○Barcodes, identifiers, and numeric data at risk ●Cannot fix carriage returns in data ●Crashes with large files ●OpenRefine is a better tool for situations where you think you need Excel http://openrefine.orghttp://openrefine.org

6 Keys to success  Understand differences between the old and new systems  Manually examine thousands of records  Learn regular expressions  Ask for help!

7 Watch out for ✓ Creative use of fields ○Inconsistencies and changing policies ○Embedded code ○Data that exploits buggy behavior ✓ Different data structures ○ Acq, licensing, electronic, items, etc ✓ Different types of data within fields (e.g. codes vs. text)

8 CONTENTdm migration example ●XML metadata export contained errors on every field that contained an HTML entity (& < > " &apos; etc) Oregon Health &amp Science University ●Error occurs in many fields scattered across thousands of records ●But this can be fixed in seconds!

9 Regular expressions to the rescue! ●“Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces” /^\s* ]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/

10 Regular expressions can... ● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements ● Convert free text into XML into delimited text or codes and vice versa ● Find complex patterns using proximity indicators and/or involving multiple lines ● Select preferred versions of fields

11 Confusing at first, but easier than you think! ●Works on all platforms and is built into a lot of software ●Ask for help! Programmers can help you with syntax ●Let’s walk through our example which involves matching and joining unknown fields across multiple lines...

12 Regular Expression Analysis /^\s* ]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/ ^Beginning of line \s*<Zero or more whitespace characters followed by “<” \([^>]\+>\)One or more characters that are not “>” followed by “>” (i.e. a tag). Store in \1 \(.*\)Any characters to next part of pattern. Store in \2 \(&[a-z]\+\)Ampersand followed by letters (HTML entities). Store in \3 <\/\1\n“</ followed by \1 (i.e. the closing tag) followed by a newline \s*<\1Any number of whitespace characters followed by tag \1 /<\1\2\3;/Replace everything up to this point with “<” followed by \1 (opening tag), \2 (field contents), \3, and “;” (fix HTML entity). This effectively joins the fields

13 A simpler example ●Find a line that contains 1 to 5 fields in a tab delimited file (because you expect 6) ^\([^\t]*\t\)\{0,4}[^\t]*$ ● To automatically join it with the next line with a space /^\(\([^\t]*\t\)\{0,4}[^\t]*\)\n/\1 / However, it would be much safer and easier to use syntax that detects the first or last field

14 If you want a GUI, use OpenRefine http://openrefine.org ●Sophisticated, including regular expression support and ability to create columns from external data sources ●Convert between different formats ●Up to a couple hundred thousand rows

15

16 Normalization is more conceptual than technical ●Every situation is unique and depends on the data you have and the config of the new system ●Don’t fob off data analysis on technical people who don’t understand library data ●It’s not possible to fix everything because the systems work differently (if they didn’t, migrating would be pointless)

17 Questions? Kyle Banerjee banerjek@ohsu.edu


Download ppt "Normalizing Data for Migration Kyle Banerjee"

Similar presentations


Ads by Google