Download presentation
Presentation is loading. Please wait.
1
U of R eXtensible Catalog Team MetaCat
2
Problem Domain
3
A Modern Library Card catalogs are stored on a computer Card catalogs store metadata about books Subject Author(s) Searching for a book is done via an OPAC (Online Public Access Catalog) Example: http://albert.rit.edu/http://albert.rit.edu/
4
Card Catalog Metadata Two types of records A bibliographic record represents a book, and is linked to multiple authority records. An authority record represents a single author or subject. Metadata has been hand-typed by librarians across the country MARC: MAchine Readable Cataloging (XML), specifies for both bib. and auth. record formats Dublin Core: also XML format, but only bib. records
5
Metadata Issues Since metadata has been hand-typed, it may be inconsistent An author could be: “Mark Twain” “Twain, Mark” “M. Twain” “Samuel Clemens” If a user searches for “Mark Twain”, the search may not return all related books
6
Goals Bibliographic Record Author field Name Date of Birth, Death Authority Record Authorized Form Alternate Forms: Alternate form 1 Alternate form 2 … See Also References to other authority records
7
Sponsor’s Solution
8
Iterative Process Flow
9
Metrics Effort by type of activity Test metrics (JUnit) Defects by types
10
Effort by Type MeetingDevelopmentDocumentation Before~40 hrs00 1/12-1/1845292 1/18-1/2520435 1/26-2/1 (R1)24414 2/2-2/820317 2/9-2/15 (R2)2420 Total13314618
11
Effort by Type
12
Issuetracker Initially, all the issues are not recorded properly. Issue Tracker is used to track 1.Issues (design, documentation, process) 2.Bugs 3.Discussions (new features, nice to have)
13
Issuetracker
14
Defects by Type
15
Status 3.1Import a record into database (R1) FR-1.1: The system shall parse the XML record. (R2) FR-1.2: The system shall store the information that obtained from parsing the XML record into MySQL database. (R1) FR-1.3: The system shall be able to import multiple records at once. (Batch processing) (R1) FR-1.4: The system shall normalize strings.
16
Status cont. 3.2Matching records (R1) FR-2.1: The system shall create a new authority record. (R2) FR-2.2: The system shall match two strings and give a confidence level of the matching. (R2) FR-2.3: The system shall store the results of the matching that includes the degree of certainty, and the link(s) matched authorized record(s). (R1) FR-2.4: The system shall identify all unprocessed records in the records database. The unprocessed records are the records that have not yet been matched against. (R1) FR-2.5: The system shall create a new authority record, and store it in the database.
17
Status cont. (R1) FR-2.6: The system shall replace the data in authority- controlled fields with its authorized form and store the link to its authorized form if the degree of certainty is above auto-accept threshold. (R2) FR-2.7: The system shall mark the record to be reviewed by a person if the degree of certainty is between auto-accept threshold and auto-reject threshold. FR-2.8: The system shall create a new authority record using the information from the current record, and create a link between those two records if the degree of certainty is below auto-reject threshold. (R1) FR-2.9: The system shall analyze unprocessed records on demand. (R1) FR-2.10: The system shall attempt to match records first by comparing authority names. (R2) FR-2.11: The system shall attempt to match records by comparing alternative names if the first attempt (FR-2.10) failed.
18
Status cont. 3.5Review possible matches (R2) FR-5.1: The system shall gather a collection of records that are marked to review from the database. The questionable matches have the degree of certainty level between auto-accept threshold and auto-reject threshold. (R2) FR-5.2: The system shall replace the data in authority- controlled fields with its authorized form and store the link to its authorized form if the user approves the matching. (R2) FR-5.3: The system shall replace the data in authority- controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.
19
Our Solution
20
Architecture
21
Matcher In NACM, we need to be able to match Bibliographic records (books) to Authorized records (authors). The information in the records may not always match exactly, or may match multiple records!
22
Matching Problems Different forms of the same name Nate verses Nathan, typos Different authors with the same name George Bush (41) versus George Bush (43) Aliases or pen names Samuel Clemens verses Mark Twain
23
Matching Problems To assist in matching different forms of an author’s name, Authority records have a list of alternate names in addition to the authorized form. Alternate names may not be distinct.
24
Matcher Design We need a matching strategy that is easy to extend to add new matching rules, while still being fast.
25
Matching Subsystem
26
MatchStrategy Abstract class that defines the basics of a matching rule Matching method Match confidence All matching strategies extend this class
27
StringTransformer Abstract class for string manipulation rules String transform method Transformation confidence All string manipulation rules extend this class
28
MatchDriver Handles performing a match Creates pairs of strategies & transformations Sorts Pairs based on overall confidence Iterates through the pairs looking for matches
29
Matcher Extensibility Adding new rules Extend MatchStrategy or StringTransformer implement new matching or transforming rules Assign a confidence Add to MatchDriver MatchDriver takes care of the rest
30
Importer Takes in input streams and parses them to extract authority and bibliographic data Uses a SAX parser into a Document Object Model (DOM) object Data is extracted from document, normalized, and inserted into the database
31
Importer
32
MySQL data model
33
Using Hibernate Transparent Data Persistence Manages relationships between entities Benefits Query caching Lazy-loading of associated entities Automatic flagging of changes Programmatic API for complex queries
34
How it Works Define Schema Define Domain Model Use XML to map fields in classes to columns in tables Define cascading behavior
35
Hibernate Caveats Designed with transactions in mind But, we use batch processing! Query language lacks some of the power of SQL Not 100% transparent Design and use of domain model is affected
36
Results Viewing GUI A table displaying all created links Can be filtered, sorted, and paged
37
Future Plans Verify that matching algorithm is doing the right things Implement string transformers Create new XC records Merge and update records with new data upon import Configuration files for the system
38
Demo!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.