Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhancing Workflow through Batch Import from Excel to DSpace

Similar presentations


Presentation on theme: "Enhancing Workflow through Batch Import from Excel to DSpace"— Presentation transcript:

1 Enhancing Workflow through Batch Import from Excel to DSpace
Sai Deng, Susan Matveyeva, Baseer Khan, Wichita State University Libraries

2 Outline Introduction: Batch import in cataloging workflow
Batch Import from Excel to DSpace The collections, team and workflow An add-on to facilitate DSpace batch import procedure Program installation, customization and execution Server upload of generated data packages New challenges of the herbarium project (Dublin Core vs. Darwin Core standards) Reflections and conclusion Appendices

3 “Book in hand” cataloging
Traditionally, original and copy cataloging workflow is based on “book in hand” approach and does not require batch processing of cataloging records: Copy cataloging: Have item in hand – Search OCLC – Identify the matching copy – Import record from OCLC to a local ILS – Edit record as needed – Save to ILS Original cataloging – Have item in hand – Create record in OCLC – Import record from OCLC to a local ILS – Save to ILS Database maintenance – find and correct errors manually one by one

4 Batch processing in early days of library automation
Batchloading: “A process by which records to be processed are collected into batches. The records in a batch are loaded all at once”. Has been introduced in early days of automation (1980s in library literature); The major batchload projects at WSU (80s-90s): load of OCLC records to NOTIS; migration from NOTIS to Voyager, and also load of Marcive; Tapes were used before the Internet era.

5 Batch processing today
Batchloading becomes a standard part of the cataloging workflow due to: Increased granularity of cataloging in order to improve access to “hidden collections” (article level; separate poems from a collection, etc.); Availability of records from vendors and OCLC; Repurposing metadata for using in different systems (e.g. from IR to ILS and vice versa).

6 WSU batchloading services
Regular batch load: Acquisitions records from jobbers (e.g. YBP, Blackwell's); Marcive files (monthly); Serials Solutions records for e-journals; LTI authority processing (weekly; quarterly; and yearly). One time load of purchased set of records: Anti-Slavery Collection (microforms) Early English Books (microforms) net-Library (e-books)

7 Workflow for batch import of MARC records to a library catalog
Sample file is received and reviewed by the cataloging administrator (may include subject librarians); Records quality is evaluated; Decision is made on local customization (do we need it; who will do it if needed: vendor or library catalogers); Decision is made to proceed/not proceed with the purchase of records.

8 Workflow for batch import of MARC records to a library catalog (Cont.)
The cataloging administrator creates a profile (template) for test loads; Testing (sample is loaded and reviewed; may be repeated as needed); Production load (done in University Computing Center); Review and clean up (done by cataloging staff and coordinated by the cataloging administrator).

9 Workflow for batch import of spreadsheet data to DSpace
IR Librarian works with faculty to prepare data; Spreadsheet data enhancement and enrichment by the IR Librarian and the Metadata Cataloger; Install and set up the Java-based add-on to facilitate DSpace batch import procedure by the DSpace Technician and the Metadata Cataloger; Program customization, sample data testing and source data adjustment by the Metadata Cataloger with assistance of a student programmer (when needed); Generate Submission Information Packages (SIPs) for individual collections; DSpace Server upload by the DSpace Technician.

10 Typical roles in batch import workflow
Professional catalogers/metadata experts create profiles (templates) for batch import /export; TS administrator coordinates batch loading to ILS; IR manager coordinates teamwork in batch loading to DSpace.

11 Typical roles in batch import workflow (Cont.)
In MARC/ILS projects, copy catalogers perform quality control (review records for duplicates, diacritics, errors, omissions & local customization (notes, subjects, etc.) In DC/Dspace projects, metadata cataloger performs data standardization (field value format, e.g., date format, escape special characters in xml, etc.) with assistance of a student programmer.

12 The collections, team and workflow
The Waconda Lake collection Digital images of the Kansas archaeological artifacts; Source data provided by the Anthropology Dept. Graduate Research and Scholarly Projects (GRASP) collection Conference papers presented at the annual symposiums of GRASP. Started in 2005; The Graduate School delivers a book and all the digital files of the papers to the library; Catalog and manually add records to DSpace from

13 The collections, team and workflow (Cont.)
The WSU Herbarium collection Contains over 4,500 specimens, date back as far as 1895; Samples collected from Kansas, Massachusetts, Colorado, New Mexico, and Arizona; Faculty and interns in the Biological Science Department are responsible for digitization of the specimen and editing of the collected data. WSU Libraries provide long term preservation and access to these collections through DSpace. Collaborate with faculty, department and Graduate School; Consultation in digitization specification and department workflow; Compile, enrich, enhance and standardize source data; Batch load data to DSpace server; Additional services: creating image zoom site… Library team: The Institutional Repository Librarian, the Metadata Cataloger and the DSpace Technician, with assistance from two WSU students

14 An add-on to facilitate DSpace batch import procedure
Google Summer of Code Project 2008; Java-based program; Transforms data prepared in Excel to DSpace batch import format; Systems and environment (different packages created for): Windows environment/Microsoft Excel Windows environment/OpenOffice.org 2.4 Calc (spreadsheet) Linux environment/OpenOffice.org 2.4 Calc (spreadsheet) Written by Blooma Mohan John, the Nanyang Technological University; It has been implemented in Nanyang (Singapore) and other universities in France, India, China and the U.S.

15 An add-on to facilitate DSpace batch import procedure (Cont.)
What is in the Metadata Import Windows_Excel package? MetaDataImport (Java file) cEJ (sample spreadsheet) Documentation (in PDF) Program Wiki: Program download:

16 Submission Information Packages
Submission Information Packages (SIP) Created for DSpace batch upload through the ItemImport procedure; Each generated record folder will include: dublin_core.xml jpg image (or PDF, Word…) contents (content list file) metadata_dwc.xml (for herbarium collection, generated from the customized program)

17 Algorithm of MetaDataImport
Start Input the Main Submission Information Package (SIP) folder name Create SIP folder with the name mentioned in Step 2 Store digital object metadata and location details in a Resultset For each record in Resultset do Step 6 to Step 13 Create individual SIP folder Create an xml file named dublin_core inside the SIP folder Create a file named contents inside the SIP folder Check the type of digital object Add comments about digital object to dublin core file Add digital object file name to contents file Copy digital object from an external location to individual SIP folder Add metadata details to Dublin Core file Stop (From

18 Program installation and running
Program installation and setup Install Java SE 6 for Windows (see appendix 1); Install Java IDE and JCreator LE version for Windows (see appendix 2); Download the program at The Metadata_Import_Windows_Excel package will be used in the WSU cases; Create a Microsoft Excel Driver (DSN) to link the Java program to the source spreadsheet (see appendix 3). Run the program (see appendix 4)

19 Tips What is defined in the DSN needs to match the name and location of the spreadsheet, and What is described in the resourceLocation field of the spreadsheet needs to match the actual locations of the PDF or image files.

20 Data preparation in Excel
An example: Graduate Research and Scholarly Projects (GRASP) papers GRASP fields: Dublin Core fields: title, contributor.author, identifier.citation, date.issued, description, description.abstract Several common fields were added: publisher, language.iso, type, relation.ispartofseries; Values in these fields can be dragged and copied in a spreadsheet. Naming fields in the Excel source file: contributorAuthor, dateIssued…

21 Program customization
Customize the section which includes all the metadata fields: Change field strings, output of the DC elements and qualifiers, messages e.g. String titleDC = rs.getString("title"); if (titleDC == null || titleDC.trim().length () == 0) { System.out.println("You have given null value for the title Dublin Core element of "+ archiveFolderName + "_“ + initialDocumentNo);} else outDC.write(" <dcvalue element=\"title\" qualifier=\"none\">"); outDC.write(titleDC); outDC.write("</dcvalue>\n"); } Data needs to be standardized and validated to guarantee correct display and program execution.

22 Tips Fields in the Java program need to be correctly defined to reflect the columns in the Excel source file; All uppercases and lowercases of the metadata fields and their qualifiers (if any) need to be exactly matched; Date needs to be in a parseable format; It is important to save the source file before re-running the program.

23 Data standardization Date (e.g. “dateIssued”): Special characters:
Format date to “yyyy-mm-dd” in the Java program. Special characters: Especially shown in the paper abstract field, such as “&”, “<“, “>“, “‘“ and “““; These XML reserved characters were replaced in the Java program. Results: By running the customized MetaDataImport program, the SIP packages were generated for DSpace batch import. Each record package includes a dublin_core xml file, a PDF file and a contents list file.

24 DSpace specification Dspace Version 1.4.2 Java 1.4.2_17
Apache Ant 1.6.5 Tomcat Postgres 8.1.4 We tested batch import in our Dspace server (version 1.4.2) and also a DSpace test instance (version 1.5.2)

25 DSpace directories Source Directory Layout Installed Directory Layout
[dspace-source] Installed Directory Layout [dspace] assetstore/ - asset store files bin/ - shell and Perl scripts config/ - configuration, with sub-directories as above handle-server/ - Handles server files history/ - stored history files (generally RDF/XML) lib/ - JARs, including dspace.jar, containing the DSpace classes log/ - Log files reports/ - Reports generated by statistical report generator search/ - Lucene search index files upload/ - temporary directory used during file uploads etc. webapps/ - location where DSpace installs all Web Applications JSPUI Web application XMLUI Web Application (for version 1.5.2)

26 Server upload Transfer generated data to the Linux server via flash drive: e.g. mnt]fdisk –l mnt]mount /dev/sde1 /mnt/usbflash/ Run DSpace ItemImport command: e.g. bin]./dsrun org.dspace.app.itemimport.ItemImport –a –e –c 10057/2239 –s /mnt/usbflash/GRASP –m mapfileGRASP 10057/2239 is the collection handle in DSpace and mapfileGRASP is the target file name; The e-person (e.g. should have rights to the collection (e.g /2239). What if the process stops? Check the place where the program stopped running; Need to adjust the fields in the Java program, standardize and validate data; Make sure all fields are spelt correctly and registered in the DSpace metadata registry.

27 New challenges of the herbarium project
DarwinCore (DwC): DwC is a biodiversity information standard designed to facilitate the exchange of information of species and specimens in collections (Biodiversity Information Standards, Taxonomic Database Working Group, 2009). Debates: Use DC elements only, or DwC only, use mostly DC, or mostly DwC; Use DSpace-based SOAR, or “Specify”, a database system designed for museum and herbarium data processing, or both. Due to the time constraints of the faculty, SOAR was utilized as the depository and two intern students were sought to assist in specimen digitization and source data editing. In order to deposit data to SOAR, DC standard needs to be followed, while an additional DwC metadata registry could be added as a supplement.

28 Specimen fields and mapping
The specimen-level information: family, scientific name, common name locality, country, state, county, elevation, latitude, longitude, habitat collector, identifier, date collected WSU collection number, image name Mapped most of the elements to DC when it is appropriate: “scientific name” is mapped to DC element “title” “common name” to “title.alternative” “locality”, “country”, “state” and “county” all mapped to “coverage.spacial” “collector” to “contributor.author” “date collected” to “date.created” “WSU collection number” to “identifier” When there seems no DC element to be mapped to, a DwC element will be added.

29 DarwinCore registry and data preparation
A local DarwinCore registry was added to DSpace; A few DwC elements were included: Family, habitat, verbatimElevation, verbatimLatitude and verbatimLongitude In order to make these fields searchable, local indexes need to be built. The library further enriched the data by adding DC elements such as subject, rights, source, publisher, relation and relation.uri. The locations of the access images were also specified in the spreadsheet. Only the access image will be shown on the record homepage, and the large sized herbarium image will be used as a zoomify source file in an external website; The zoomed image page will be linked back to its corresponding DSpace record through the dc.identifier.uri field.

30 MetaDataImport modification
The MetaDataImport Java program was modified to generate the SIP packages. Following the same steps to create the DC xml file (dublin_core.xml), additional code was added to the program to generate a separate xml file (metadata_dwc).

31 dublin_core.xml <?xml version="1.0" encoding="iso-8859-1"?>
<!-- title of jpg HERBARIUM_3.jpg--> <dublin_core> <dcvalue element="identifier" qualifier="none">2040.0</dcvalue> <dcvalue element="title" qualifier="none">Asclepias stenophylla A. Gray</dcvalue> <dcvalue element="title" qualifier="alternative">Narrow-leaved milkweed</dcvalue> <dcvalue element="coverage" qualifier="spacial">Sedgwick Co., KS</dcvalue> <dcvalue element="coverage" qualifier="spacial">North America</dcvalue> <dcvalue element="coverage" qualifier="spacial">United States</dcvalue> <dcvalue element="coverage" qualifier="spacial">Kansas</dcvalue> <dcvalue element="contributor" qualifier="author">Raugust, Barry M.</dcvalue> <dcvalue element="type" qualifier="none">image</dcvalue> <dcvalue element="date" qualifier="created"> </dcvalue> <dcvalue element="date" qualifier="issued"> </dcvalue> <dcvalue element="subject" qualifier="none">Asclepidaceae</dcvalue> <dcvalue element="subject" qualifier="none">Asclepias stenophylla A. Gray</dcvalue> <dcvalue element="subject" qualifier="none">Narrow-leaved milkweed</dcvalue> <dcvalue element="subject" qualifier="none">United States -- Kansas -- Sedgwick </dcvalue> <dcvalue element="rights" qualifier="none">Copyright Wichita State University, 2010</dcvalue> <dcvalue element="source" qualifier="none">WSU herbarium</dcvalue> <dcvalue element="identifier" qualifier="uri"> <dcvalue element="publisher" qualifier="none">Wichita State University. Dept. of Biological Sciences</dcvalue> <dcvalue element="relation" qualifier="none">U.S. Dept. of Agriculture Natural Resources Conservation Service</dcvalue> <dcvalue element="relation" qualifier="uri"> </dublin_core>

32 metadata_dwc.xml <?xml version="1.0" encoding="iso-8859-1"?>
<!-- title of jpg HERBARIUM_3.jpg--> <dublin_core schema="dwc"> <dcvalue element="family" qualifier="none">Asclepidaceae </dcvalue> <dcvalue element="habitat" qualifier="none"> Dry, rocky prairies, glades, ledges of bluffs. </dcvalue> <dcvalue element="verbatimElevation" qualifier="">1400’</dcvalue> </dublin_core>

33 A sample record (in detailed view)
DC Field Value Language dc.contributor.author Raugust, Barry M. [collector] en_US dc.contributor.author Raugust, Barry M. [cataloger] en_US dc.date.accessioned T20:44:25Z - dc.date.available T20:44:25Z - dc.date.created en_US dc.date.issued en_US dc.identifier en_US dc.identifier.uri en_US dc.identifier.uri - dc.format.extent bytes dc.format.mimetype image/jpeg dc.language.iso en_US dc.publisher Wichita State University. Dept. of Biological Sciences en_US dc.relation U.S. Dept. of Agriculture Natural Resources Conservation Service en_US dc.relation.uri en_US dc.rights Copyright Wichita State University, 2010 en_US dc.source WSU herbarium en_US dc.subject Asclepidaceae en_US dc.subject Asclepias stenophylla A. Gray en_US dc.subject Narrow-leaved milkweed en_US dc.subject United States -- Kansas -- Sedgwick County en_US dc.title Asclepias stenophylla A. Gray en_US dc.title.alternative Narrow-leaved milkweed en_US dc.type image en_US dc.coverage.spacial Sedgwick Co., KS en_US dc.coverage.spacial North America en_US dc.coverage.spacial United States en_US dc.coverage.spacial Kansas en_US dwc.family Asclepidaceae - dwc.verbatimElevation 1400' - dwc.habitat Dry, rocky prairies, glades, ledges of bluffs. en_US Appears in Collections: Vascular Plants

34 A sample record as shown in DSpace
Headeraa Header

35 The external zoom site Header

36 Some reflections Data mapping and transformation challenges;
Excel as source data and tool; “Mashing up” data from various sources; Add additional metadata registry to DSpace; System upgrades, program compatibility and the library’s reaction; Collaborations.

37 Conclusion Metadata repurposing and batch processing at WSU Libraries promotes sharing and reusing of data resources and significantly improves the cataloging workflow; While there are many challenges in each new metadata related project in a library, standards, XML technology, data processing strategies and metadata management all help pave the way to possibility and success; Tools and programs can play an important role in metadata batch processing; Collaboration and coordination within the library and beyond put the pieces together and make a project possible.

38 Appendices Appendix 1: Install Java SE 6 for Windows
Appendix 2: Install Java IDE, JCreator LE version for Windows Appendix 3: Create DSN (Microsoft Excel Driver) Appendix 4: Run the MetaDataImport program Appendix 5: Upload SIP to DSpace Server

39 Appendix 1: Install Java SE 6 for Windows
To install Java which is the pre-requisite for the DSpace batch import program go to the link below; Install Java SE 6 for Windows Download this on your computer and run it.

40 Appendix 1: Install Java SE 6 for Windows
Read carefully the rules; Check the radio button on the top which says “yes” and click on “next” to go to the next step.

41 Appendix 1: Install Java SE 6 for Windows
Select the directory where you want to install Java, remember the location as we have to set this as the default home directory for Java later in the program.

42 Appendix 1: Install Java SE 6 for Windows
It will ask you to create a new directory, Select the option carefully as we have to choose this directory for the JCE creator program.

43 Appendix 1: Install Java SE 6 for Windows
If you have a secure environment, it will generate a security alert. Click on “allow access” to add an exception to your firewall.

44 Appendix 1: Install Java SE 6 for Windows
Type in the admin credentials; you can leave the default ports.

45 Appendix 1: Install Java SE 6 for Windows
If you need any of the additional services, you can select from this menu or you can leave them as default.

46 Appendix 1: Install Java SE 6 for Windows
Now it has everything configured; click on “install now” to install the program.

47 Appendix 1: Install Java SE 6 for Windows
Now you should be able to see the progress of the installation.

48 Appendix 1: Install Java SE 6 for Windows
You can either register with sun or you can skip the registration process.

49 Appendix 1: Install Java SE 6 for Windows
Once the installation is done, click on “finish”, and this completes the java installation which is the prerequisite for the next program installation.

50 Appendix 2: Install Java IDE, JCreator LE version for Windows
You can download this at

51 Appendix 2: Install Java IDE, JCreator LE version for Windows
In our scenario, we have selected Windows XP. Click on the “Download” button to start the installation.

52 Appendix 2: Install Java IDE, JCreator LE version for Windows
Read the license agreements carefully and click on “I accept the agreement” to agree and click on “next” to go to the next step.

53 Appendix 2: Install Java IDE, JCreator LE version for Windows
Choose the directory where you want to install this program and click on “next” to go to the next step.

54 Appendix 2: Install Java IDE, JCreator LE version for Windows
If you have selected the default option from the previous step, it will ask you to create a new folder; click “yes” to create a new folder.

55 Appendix 2: Install Java IDE, JCreator LE version for Windows
Select the option which you want to have and click on “next” to go to the next step.

56 Appendix 2: Install Java IDE, JCreator LE version for Windows
Once the program is installed, click on “finish” to launch the program.

57 Appendix 2: Install Java IDE, JCreator LE version for Windows
Here are the settings for the JCreator, carefully follow the steps sequentially.

58 Appendix 2: Install Java IDE, JCreator LE version for Windows
Click on “next” with the following selections.

59 Appendix 2: Install Java IDE, JCreator LE version for Windows
This is the most important step where you select the path of Java (the prerequisite which we have installed). Refer to Appendix 1 for the default Java home and click on “new” to browse and add the path. When you are done click on apply this completes the prerequisites of the batch import program. If you are not familiar with Jcreator IDE, go through the help section "Creating your first application”.

60 Appendix 3: Create DSN (Microsoft Excel Driver)
Go to “Start-Control Panel-Administrative Tools-Data sources (ODBC)”, and click “Add…”

61 Appendix 3: Create DSN (Microsoft Excel Driver)
Select “Microsoft Excel Driver (*.xls) and click “Finish”.

62 Appendix 3: Create DSN (Microsoft Excel Driver)
Fill in “Data Source Name” (DSN) as “connExcelJava”; Click “Select Workbook” to choose the Excel source file.

63 Appendix 3: Create DSN (Microsoft Excel Driver)
DSN for Excel driver is created.

64 Appendix 4: Run the MetaDataImport program
Run MetaDataImport using Java IDE.

65 Appendix 4: Run the MetaDataImport program

66 Appendix 4: Run the MetaDataImport program
Results (with only Dublin Core elements)

67 Appendix 4: Run the MetaDataImport program
Results (with Darwin Core elements)

68 Appendix 4: Run the MetaDataImport program
Generated SIP Individual record package

69 Appendix 5: Upload SIP to DSpace Server
bin]# ./dsrun org.dspace.app.itemimport.ItemImport -a -e -c 10057/1996 -s /mnt/usbflash/HERBARIUM/ -m mapfileHERBARIUM Destination collections: Owning Collection: HERBARIUM Adding items from directory: /mnt/usbflash/HERBARIUM/ Generating mapfile: mapfileHERBARIUM Adding item from directory HERBARIUM_1 Adding item from directory HERBARIUM_2 Adding item from directory HERBARIUM_3 Loading dublin core from /mnt/usbflash/HERBARIUM//HERBARIUM_3/dublin_core.xml Schema: dc Element: identifier Qualifier: none Value: Schema: dc Element: title Qualifier: none Value: Asclepias stenophylla A. Gray Schema: dc Element: title Qualifier: alternative Value: Narrow-leaved milkweed Loading dublin core from /mnt/usbflash/HERBARIUM/HERBARIUM_3/metadata_dwc.xml Schema: dwc Element: family Qualifier: none Value: Asclepidaceae Schema: dwc Element: habitat Qualifier: none Value: Dry, rocky prairies, glades, ledges of bluffs. Schema: dwc Element: verbatimElevation Qualifier: none Value: 1400’ Processing contents file: /mnt/usbflash/HERBARIUM//HERBARIUM_3/contents Bitstream: HERBARIUM_3.jpg Processing handle file: handle

70 Acknowledgemnts Susan Matveyeva, Catalog and Institutional Repository Librarian, Sai Deng, Metadata Cataloger, Baseer Khan, Technology Support Consultant II and DSpace Technician, Levi Wang, WSU Graduate, Java Programming, Thank Nancy Deyoe and the Cataloging Department at WSU Libraries for their help in providing historical projects information during Susan’s informal interview in the introduction part of this presentation.

71 Thank you!


Download ppt "Enhancing Workflow through Batch Import from Excel to DSpace"

Similar presentations


Ads by Google