Presentation is loading. Please wait.

Presentation is loading. Please wait.

U of R eXtensible Catalog Team MetaCat. Problem Domain.

Similar presentations


Presentation on theme: "U of R eXtensible Catalog Team MetaCat. Problem Domain."— Presentation transcript:

1 U of R eXtensible Catalog Team MetaCat

2 Problem Domain

3 A Modern Library Card catalogs are stored on a computer Card catalogs store metadata about books  Subject  Author(s) Searching for a book is done via an OPAC (Online Public Access Catalog)  Example: http://albert.rit.edu/http://albert.rit.edu/

4 Card Catalog Metadata Two types of records  A bibliographic record represents a book, and is linked to multiple authority records.  An authority record represents a single author or subject. Metadata has been hand-typed by librarians across the country  MARC: MAchine Readable Cataloging (XML), specifies for both bib. and auth. record formats  Dublin Core: also XML format, but only bib. records

5 Metadata Issues Since metadata has been hand-typed, it may be inconsistent An author could be:  “Mark Twain”  “Twain, Mark”  “M. Twain”  “Samuel Clemens” If a user searches for “Mark Twain”, the search may not return all related books

6 Goals Bibliographic Record  Author field  Name  Date of Birth, Death Authority Record  Authorized Form  Alternate Forms:  Alternate form 1  Alternate form 2  …  See Also  References to other authority records

7 Sponsor’s Solution

8 Iterative Process Flow

9 Metrics Effort by type of activity Test metrics (JUnit) Defects by types

10 Effort by Type MeetingDevelopmentDocumentation Before~40 hrs00 1/12-1/1845292 1/18-1/2520435 1/26-2/1 (R1)24414 2/2-2/820317 2/9-2/15 (R2)2420 Total13314618

11 Effort by Type

12 Issuetracker Initially, all the issues are not recorded properly. Issue Tracker is used to track 1.Issues (design, documentation, process) 2.Bugs 3.Discussions (new features, nice to have)

13 Issuetracker

14 Defects by Type

15 Status 3.1Import a record into database  (R1) FR-1.1: The system shall parse the XML record.  (R2) FR-1.2: The system shall store the information that obtained from parsing the XML record into MySQL database.  (R1) FR-1.3: The system shall be able to import multiple records at once. (Batch processing)  (R1) FR-1.4: The system shall normalize strings.

16 Status cont. 3.2Matching records  (R1) FR-2.1: The system shall create a new authority record.  (R2) FR-2.2: The system shall match two strings and give a confidence level of the matching.  (R2) FR-2.3: The system shall store the results of the matching that includes the degree of certainty, and the link(s) matched authorized record(s).  (R1) FR-2.4: The system shall identify all unprocessed records in the records database. The unprocessed records are the records that have not yet been matched against.  (R1) FR-2.5: The system shall create a new authority record, and store it in the database.

17 Status cont.  (R1) FR-2.6: The system shall replace the data in authority- controlled fields with its authorized form and store the link to its authorized form if the degree of certainty is above auto-accept threshold.  (R2) FR-2.7: The system shall mark the record to be reviewed by a person if the degree of certainty is between auto-accept threshold and auto-reject threshold.  FR-2.8: The system shall create a new authority record using the information from the current record, and create a link between those two records if the degree of certainty is below auto-reject threshold.  (R1) FR-2.9: The system shall analyze unprocessed records on demand.  (R1) FR-2.10: The system shall attempt to match records first by comparing authority names.  (R2) FR-2.11: The system shall attempt to match records by comparing alternative names if the first attempt (FR-2.10) failed.

18 Status cont. 3.5Review possible matches  (R2) FR-5.1: The system shall gather a collection of records that are marked to review from the database. The questionable matches have the degree of certainty level between auto-accept threshold and auto-reject threshold.  (R2) FR-5.2: The system shall replace the data in authority- controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.  (R2) FR-5.3: The system shall replace the data in authority- controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.

19 Our Solution

20 Architecture

21 Matcher In NACM, we need to be able to match Bibliographic records (books) to Authorized records (authors). The information in the records may not always match exactly, or may match multiple records!

22 Matching Problems Different forms of the same name  Nate verses Nathan, typos Different authors with the same name  George Bush (41) versus George Bush (43) Aliases or pen names  Samuel Clemens verses Mark Twain

23 Matching Problems To assist in matching different forms of an author’s name, Authority records have a list of alternate names in addition to the authorized form. Alternate names may not be distinct.

24 Matcher Design We need a matching strategy that is easy to extend to add new matching rules, while still being fast.

25 Matching Subsystem

26 MatchStrategy Abstract class that defines the basics of a matching rule Matching method Match confidence All matching strategies extend this class

27 StringTransformer Abstract class for string manipulation rules String transform method Transformation confidence All string manipulation rules extend this class

28 MatchDriver Handles performing a match Creates pairs of strategies & transformations Sorts Pairs based on overall confidence Iterates through the pairs looking for matches

29 Matcher Extensibility Adding new rules Extend MatchStrategy or StringTransformer implement new matching or transforming rules Assign a confidence Add to MatchDriver MatchDriver takes care of the rest

30 Importer Takes in input streams and parses them to extract authority and bibliographic data Uses a SAX parser into a Document Object Model (DOM) object Data is extracted from document, normalized, and inserted into the database

31 Importer

32 MySQL data model

33 Using Hibernate Transparent Data Persistence Manages relationships between entities Benefits  Query caching  Lazy-loading of associated entities  Automatic flagging of changes  Programmatic API for complex queries

34 How it Works Define Schema Define Domain Model Use XML to map fields in classes to columns in tables  Define cascading behavior

35 Hibernate Caveats Designed with transactions in mind  But, we use batch processing! Query language lacks some of the power of SQL Not 100% transparent  Design and use of domain model is affected

36 Results Viewing GUI A table displaying all created links Can be filtered, sorted, and paged

37 Future Plans Verify that matching algorithm is doing the right things Implement string transformers Create new XC records Merge and update records with new data upon import Configuration files for the system

38 Demo!


Download ppt "U of R eXtensible Catalog Team MetaCat. Problem Domain."

Similar presentations


Ads by Google