USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING

USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING
A case study from the University of Michigan Library Matt Carruthers Metadata Projects Librarian University of Michigan

OpenRefine openrefine.org
OpenRefine is a great open source tool that is widely used for cleaning messy data openrefine.org

OpenRefine openrefine.org That's even the tagline for the software
It's often employed in libraries to clean messy bibliographic data openrefine.org

(He’s lying.) But our bibliographic data at the University of Michigan is flawless (LOL) So we aren't interested in cleaning the data Rephrase: For one particular project, we weren't interested in cleaning data, but we needed to efficiently sort and filter through large sets of bibliographic data in order to make collection management decisions. So over the next ten minutes or so, I'm going to take you through a project we did recently at the library to assess the collections of six library buildings. I'll show you how the project started, how it changed once we implemented OpenRefine, and the impact that it had on our outcome.

Project Summary Identify resources in our collection for which we own more than one copy. Assess whether or not they are low-use items. Withdraw duplicate copies from the collection or send to remote storage. At UM Library, we decided to undertake a collection management project to identify resources where we had more than one copy in our collections, and assess if they were not often used or outdated based on certain criteria from collection managers. Once those items were identified, one or more of the duplicate copies would be withdrawn from the collection or sent to remote storage to free up space on our shelves at the libraries for new collection growth.

Project Summary DataMart
Library’s reporting tool for extracting data from our ILS. Reports are generated in spreadsheet format. Spreadsheets contain a mix of bibliographic information and usage statistics. Source of data: Datamart - library's data reporting tool to extract reports from the ILS Select call number ranges - spreadsheets produced are anywhere from 100,000 rows to 500,000 rows Spreadsheets contain bibliographic information as well as circulation statistics

Project Summary Find items which:
Have more than one copy in the Library. Are not part of a series. Are at least five years old. Have circulated less than six times. Are not “in process”. Are not attached to a provisional cataloging record. Are part of the circulating collection. Selection criteria: Find items which: Have more than one copy in the library Are not part of a series Are at least five years old Have circulated less than 6 times Are not "in process" - aren't currently on loan, in conservation for repair, labelled as missing, etc. (i.e. must be on the shelf) Are not attached to a provisional record Must be part of the circulating collection (i.e. nothing from special collections or reference resources)

In the Beginning Started using Excel for processing:
Used sorting, filters, and conditional formatting to identify and isolate duplicates that met the criteria General metric - at least 2 hours of staff time to process 100,000 rows because we had to manually rerun each command to process each spreadsheet After doing this for a while and seeing how time consuming, error prone, and generally not fun this process actually was, we started looking for a more efficient way to process the spreadsheets. So we turned to OpenRefine to essentially fully automate the processing Using a combination of the facet functions and various data transformation commands, we cut that processing time down to about 15 seconds per 100,000 rows. So how did we do that? I won't take you through the process in OpenRefine step-by-step because there are over 20 commands from start to finish. That may sound a little daunting, but it actually took less than an hour to set up the process, and OpenRefine actually a lot of the heavy lifting for you.

So I'll take you through some of the highlights of the process:
First, we start with a typical spreadsheet of about 100,000 rows. We use a combination of the facet feature (which allows you to quickly identify and edit only the rows you are interested in) and custom transformation commands written in the Google Refine expression language.

For example, we can filter out anything that is part of a series using the customized facet command "facet by blank" on the Description column, since any values in that column indicate things like volume number. So any row with a value (i.e. not blank) in the Description column can be deleted.

Additionally, we can use a custom transformation written in the Google Refine Expression Language (or GREL) to calculate how old a resource is.

You can save command sets to run automatically on additional spreadsheets.

Outcomes Cut processing time by approximately 99.8%.
Cut weekly staff time on project from 34 hours to 7 hours. Processed over 2 million items. Withdrawn 9,000 low-use duplicates from our collections. Outcomes: Used to take 2 hours to process 100,000 items, now takes 15 seconds using OpenRefine Cut processing time by approximately 99.8% We had been spending 34 hours of staff time a week on this project, now we can cut it down to 7 hours per week and still be much more efficient - so we can free up staff to do other things Allows us to identify little used duplicate items faster than we have ever been able to do We can now make more informed collection management decisions much more quickly Processed over 2 million items Withdrawn 9,000 duplicate items from our collection Requires no programming expertise or special skills, so anyone can run the processing

Questions? OpenRefine command scripts on Github:
@mattadata2 Project is ongoing and we are beginning to implement OpenRefine for other projects in similar ways You can check out my Github repo for the command script used in this project I also have a few other repositories for other OpenRefine command scripts for various other projects

USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING

Similar presentations

Presentation on theme: "USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING

Similar presentations

Presentation on theme: "USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING"— Presentation transcript:

Similar presentations

About project

Feedback