Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO.

Slides:



Advertisements
Similar presentations
Heinrich Stamerjohanns Institute for Science Networking Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University.
Advertisements

Database Management Using Microsoft Access Xinhua Chen, Ph.D. Chinese Association of Professionals in Science and Technology March 23, 2003.
EXECUTION PLANS By Nimesh Shah, Amit Bhawnani. Outline  What is execution plan  How are execution plans created  How to get an execution plan  Graphical.
EXtensible Catalog David Lindahl University of Rochester.
Jennifer Bowen, University of Rochester ALA Annual Conference 2009, Chicago, Illinois 1 The eXtensible Catalog's Metadata Services Toolkit Lowering the.
EXtensible Catalog Jennifer Bowen University of Rochester.
The eXtensible Catalog’s Drupal Toolkit: a Discovery Interface to Address Users’ Needs Jennifer Bowen University of Rochester, Rochester, NY ALA LITA Drupal.
Jennifer Bowen, University of Rochester code4lib 2012 February 7, Seattle, WA “Linked-Data-Ready” Software For Libraries: The eXtensible Catalog (XC)
Jennifer Bowen, University of Rochester Canadian Library Association, Program C15 June 3, 2010, Edmonton, Alberta Preparing for the Next Generation of.
EXtensible Catalog Software Portfolio David Lindahl, Co-Executive Director, XCO.
Introduction to Model-View-Controller (MVC) Web Programming with TurboGears Leif Oppermann,
EXtensible Catalog Software Portfolio Ben Anderson, Software Engineer, XCO.
EXtensible Catalog Software Portfolio Part 1: Overview.
EXtensible Catalog XC Drupal Toolkit. XC Software Overview User Interface for searching and browsing Library Website (on Drupal) VoyagerUR Research XC.
Chapter 10 ADO. What is ADO? ADO is a Microsoft technology ADO stands for ActiveX Data Objects ADO is a programming interface to access data in a database.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
River Campus Libraries Metadata That Supports Real User Needs Jennifer Bowen Head of Cataloging University of Rochester Libraries David Lindahl Director.
River Campus Libraries Metadata That Supports Real User Needs Jennifer Bowen Head of Cataloging University of Rochester Libraries David Lindahl Director.
Project Update David Lindahl University of Rochester Libraries.
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
Putting it all together for Digital Assets Jon Morley Beck Locey.
At the North of England Institute of Mining and Mechanical Engineers Library, Newcastle upon Tyne.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
EXtensible Catalog Software Portfolio David Lindahl, Co-Executive Director, XCO.
EWD VistA Update 2010 Rob Tweed M/Gateway Developments Ltd.
Envisioning an “eXtensible” Future Opportunities presented by the eXtensible Catalog (XC) Project Jennifer Bowen University of Rochester ACRL NY Annual.
GOODWILL OF NORTHWEST NORTH CAROLINA, INC. EMPLOYEE TRAINING DATABASE PROTOTYPE.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
Jennifer Bowen, University of Rochester Cornell University May 8, 2012, Ithaca, NY The eXtensible Catalog (XC): Transitioning to a Post-MARC Environment.
Jennifer Bowen, University of Rochester ALA Midwinter Conference January 22, 2012, Dallas, TX The eXtensible Catalog (XC): Transitioning to a Post-MARC.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
EXtensible Catalog David Lindahl University of Rochester.
Software Engineering 2003 Jyrki Nummenmaa 1 CASE Tools CASE = Computer-Aided Software Engineering A set of tools to (optimally) assist in each.
Jennifer Bowen, University of Rochester CLA Preconference, Shaping Tomorrow’s Metadata with RDA June 2, 2010, Edmonton, Alberta The eXtensible Catalog.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Universiti Utara Malaysia Chapter 3 Introduction to ASP.NET 3.5.
University of Illinois at Urbana-Champaign A Unified Platform for Archival Description and Access Christopher J. Prom, Christopher A. Rishel, Scott W.
Relational Database CISC/QCSE 810 some materials from Software Carpentry.
Writing macros and programs for Voyager cataloging Kathryn Lybarger ELUNA 2013 May 3, #ELUNA2013.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
1 © 1999 Microsoft Corp.. Microsoft Repository Phil Bernstein Microsoft Corp.
Endeca: a faceted search solution for the library catalog Kristin Antelman & Emily Lynema UNC University Library Advisory Council June 15, 2006.
TOS / TIS Code Architecture Copyright © 2008 Talend. All rights reserved.
Introduction to Archon for CARLI Members Jen Masciadrelli, Library Systems Coordinator, CARLI Office Sarah Horowitz, Special Collections Librarian, Augustana.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Plug-In Architecture Pattern. Problem The functionality of a system needs to be extended after the software is shipped The set of possible post-shipment.
Blazing an “eXtensible” Trail at the University of Rochester Jennifer Bowen University of Rochester ALCTS President’s Program, January
 Project Team: Suzana Vaserman David Fleish Moran Zafir Tzvika Stein  Academic adviser: Dr. Mayer Goldberg  Technical adviser: Mr. Guy Wiener.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Understanding Core Database Concepts Lesson 1. Objectives.
Laurentian University
VI-SEEM Data Discovery Service
The eXtensible Catalog Drupal Toolkit Király Péter, Software Engineer
Building Search Systems for Digital Library Collections
Introduction to Computers
Johannes Peter MediaMarktSaturn Retail Group
Lecture 1: Multi-tier Architecture Overview
ARCH-1: Application Architecture made Simple
Developing a Model-View-Controller Component for Joomla Part 3
Chapter 10 ADO.
Getting Started With Solr
Tonga Institute of Higher Education IT 141: Information Systems
Tonga Institute of Higher Education IT 141: Information Systems
NIEM Tool Strategy Next Steps for Movement
Understanding Core Database Concepts
Plug-In Architecture Pattern
Presentation transcript:

Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO

Download this presentation: 2

Timeline 3 Jennifer Bowen presented at code4lib 2/10 I began at XCO 3/10 work began on 0.3 4/ released 1/ released 1.0 released

MARCXML (6M records) DC-TERMS (13k records) XC Software Components User Interface for searching and browsing Library Website (on Drupal) Integrated Library System Repository XC Drupal Toolkit Tools for automated processing of large batches of metadata XC Metadata Services Toolkit Tools for connectivity between XC and an ILS XC Circ. Status/Req. Authentication XC NCIP Toolkit 4 XC OAI Toolkit irplus

5

Learn More about XC at 6

One Example of Process Flow 7 MARC BIB record from external repository Normalized MARC BIB record from normalization service FRBRized records from transformation service work expression manifestation

M S T Logical Process 8 OAI-PMH Harvest MARC Normalization Service MARC-XC Transformation Service Pseudo OAI-PMH Harvests OAI-PMH Harvestable provider caches repo

Add an External Repository 9

Schedule a Harvest 10

Configure Processing Rules 11

Browse Records 12

Goals for 0.3 Each service should process one million records per hour on an “average library server” – 1.5 GHz SPARC V9 – 8G RAM (3G for the JVM) – 10k RPM hard drive Services should have little to no degradation as the size of a repository grows – University of Rochester has 6M records Implementing a service should be easy – it should require no knowledge of MST internals – it should not be up to the service implementer to figure out how to build and package their service 13

Determine Throughput of 0.2 Using the MARC Normalization service as our metric, the first million records processed at average at a speed of: – 29 ms/record = 120k/hr (goal is 3.6 ms/rec = 1M/hr) Before the service processed 2 million records, the process crawled to a halt (goal was little to no degradation of at least 6 million records). 14

Determine Bottlenecks with TimingLogger 15 This code produces this output

Bottleneck Breakdown 29 ms per record – 2.5 ms to create DOM – 5 ms for actual service processing (the innards of the MARC-Normalization service) – 21 ms for querying solr and inserting This is the average - both querying and inserting are done in batch. I had a hard time separating the two 16

0.2 Design 17 All data needed for the UI except for searching and browsing records All data needed for configuring harvests, services, processing rules, etc Text indexes necessary for searching and browsing records All record/repository data

0.3 Design Change to use MySQL 18 All data needed for the UI except for searching and browsing records All data needed for configuring harvests, services, processing rules, etc All record/repository data Doesn’t store any data Use only for indexing records to support searching in the UI

0.3 Design – Keep the table sizes small 19 One index for all repositories Each external repository cache and each service gets its own set of database tables external provider repo normalization repo transformation repo

one or more per record zero or more per record one per record 0.3 Design - Yes, a boring ERD 20 record_updates record_id update_date records_xml record_id xml record_sets records_xml record_id xml record_predecessors record_id pred_record_id

Did that improve things? ms per record (previously 29) – 2.5 ms to create DOM – 5 ms for actual service processing (the innards of the MARC-Normalization service) – 3.5 ms (previously 21) for querying MySQL and inserting into MySQL again, both querying and inserting are done in batch The query time is almost nill - it’s the inserting that takes time. It’s faster, but still nearly 3x slower than our goal The performance showed little to no degradation

Get rid of XPath 22 XPath isn’t a bad technology, but when you’re optimizing for performance, it can be beneficial to find other ways to accomplish the same task. So, I changed this code… to this code…

Did that improve things? 23 7 ms per record (previously 11) – 2.5 ms to create DOM – 1.0 ms (previously 5) for actual service processing (the innards of the MARC-Normalization service) – 3.5 ms for MySQL inserts It’s faster, but still nearly 2x slower than our goal

Delayed Indexing in MySQL MySQL modifies table indexes with each insert. It is faster to the drop indexes, insert lots of rows into the tables, and then add the indexes back. – This is the way mysqldump works – This means you can’t read the data while doing an insert. No big deal – we’ll just do it during large loads. 24

Did that improve things? 25 6 ms per record (previously 11) – 2.5 ms to create DOM – 1.0 ms for actual service processing (the innards of the MARC-Normalization service) – 2.2 ms (previously 3.5) for MySQL inserts It’s faster, but still nearly 2x slower than our goal

Batch Prepared Statements 26 Java/JDBC provides an extremely highly performant method for sending large chunks of data to the db at once using batch prepared statements. There’s no way to speed this part up… or so I thought…

LOAD DATA INFILE 27 When discussing db optimizations with XC’s Drupal Toolkit developer, Peter Kiraly, he said PHP didn’t have the same ability. Instead he’d have to write out a csv file and load that in. I figured I might as well try it.

Did that improve things? 28 4 ms per record (previously 6) – 2.5 ms to create DOM – 1.0 ms for actual service processing (the innards of the MARC-Normalization service) – 0.6 ms (previously 2.2) for MySQL inserts Pretty close, but still not there

Sometimes it’s the little things 29 DomFactoryBuilderDOAServiceFactoryFactoryImpl I knew enough not to create the DocumentBuilderFactory each time, but didn’t realize creating the DocumentBuilder each time would have that much of an effect. Code was Code is now

Did that improve things? 30 3 ms per record (previously 4) – 0.9 ms (previously 2.5) to create DOM – 1.0 ms for actual service processing (the innards of the MARC-Normalization service) – 0.6 ms for MySQL inserts WE DID IT! We have exceeded our goal!

0.2 Service Development 31 Internals of the MST were exposed to the service developer and the developer was expected to re-implement much of this internal code.

code.google.com/p/xcmetadataservicestoolkit/ 32

0.3.x Service Development Install Java, Ant, MySQL 33 $ wget ' $ unzip example dev-env.zip $ cd example $ ant retrieve $ ant -Dtest=ProcessFiles test $ ls -ladh./build/test/actual_output_records/1/* $ ant zip

Input Files for Testing $ ls -1./test/input_records/1/* | xargs cat oai:mst.rochester.edu:bib:1 pb&j oai:mst.rochester.edu:bib:1 pb&j

Output Files from Testing $ ls -1./build/test/actual_output_records/1/* | xargs cat oai:mst.rochester.edu:example/1 oai:mst.rochester.edu:bib:1 pb&j you've been foobarred! 35

Implementing in Code 36

More tidbits for interested implementers The MST now is configured via spring – each service is given it’s own application context as well as it’s own classloader This means it can use all the objects and services from the MST while not worrying about name collisions (naming or dependencies) w/ other services Each service is given it’s own db schema (again, so you don’t have to worry about name collisions). The db schema is prefixed w/ “xc_” 37

Other Services MARC-XC-Transformation Just as fast as the marcnormalization service DC-XC-Transformation Initially contributed by Kyushu University (in Japan) – now one of our core services. 38

Photo Credits All photos taken from flickr.com – “Brick Wall” by somenametoforget – “Snail” by DRB62 – “Paris Train” by Pictr 30D – “Spaghetti with tomato sauce” by HatM – “Hawk in Flight” by Nick Chill – “Tortoise” by GraphicReality 39

Final Numbers k records / hr 29 ms / record fell down before 2M records processed not easily extensible M records / hr 3.0 ms / record processed 16M records with no degradation easily extensible 1.5 GHz CPU

Download XC software at eXtensibleCatalog.org contact me at