Presentation on theme: "Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist."— Presentation transcript:
Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005
Research Goals Develop a service to support advanced collection intelligence Cluster collected objects based on their issuing entity As can be determined via metadata about the objects Gain intelligence about the nature of individual publishers Collection intelligence Acquisition patterns User behavior
Research Objectives Resolve ISBN prefixes to publisher name Variant publisher names to a preferred form Capture and make available for use various attributes of individual publishers Location of publisher Language(s) of materials published Genre(s)/format(s) of materials published Dominant subject domain(s) of the publisher's output Parent company and subsidiaries
Theoretical Foundation: Authority Control Adhere to authorized form Personal names Corporate entities Why no authorized form for publishing entities?
Pragmatic Foundation: Collection Development Identified publisher series Retrospective conversion project (1984) Family tree Which publishers are related? Approval plans Which publishers publish which subjects?
Pragmatic Foundation: OCLC WorldCat Data Mining Collection Analysis Which libraries have the most items by a publisher in a particular subject area? How do library holdings by publisher compare? E-books for a particular STM publisher (2000) Cataloged as reproductions 2 publishers!
Pragmatic Foundation: Citation Analysis Sweetland (1989) Reader functions of citations Information retrieval via citation databases Document retrieval Includes interlibrary loan verification Bibliometrics Faculty and researcher productivity measure Other functions Creation of references/bibliographies
Pragmatic Foundation : Education for Librarians Collection development & acquisitions librarian education Subject focuses of publishers Parent and subsidiary relationships
Specialized Corporate Authority Files ACOLIT (Ruggeri, 2004) Names, uniform titles, Italian and international Catholic institutions, Catholic religious communities, and institutions Related to the Catholic Church, Papal State, and Vatican City State COPAR (Boddaert, 2004) French official corporate bodies Mainly national and preceding the French Revolution CORELI (Boddaert, 2004) Religious corporate bodies from 3 French ancient specialized catalogues
Specialized Corporate Authority Files Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004) Chinese authors of expanded works and Chinese corporate bodies since 1912 Chinese Name Authority Database (Hu, Tam & Lo, 2004) Mainly Taiwanese personal names with some Taiwanese corporate bodies
Specialized Corporate Authority Files Case study by Elias & Fair (1983) Standard Oil Co.s Media Query File No authority control 3 professionals in 6 months averaged 12 telephone calls/day from reporters Decided against canonical list for media names Noted 20 unique variants for Wall Street Journal including WSJ, Wall St. Jnl, Wall Street Jnl
Specialized Corporate Authority Files Case study by French, Powell & Schulman (1997, 2000) Smithsonian Astrophysical Observatorys Astrophysics Data System database Programmatically identify author affiliations and map variant names to canonical name Investigated various techniques separately and iteratively to bring variants together including: Lexical cleanup Data clustering algorithms Approximate string-matching Reduced number of unique strings by 55% Required manual review of clusters
Literature: Database Quality Intner (1989) Reviewed 215 matching records in OCLC and RLIN Errors relating to publishers: OCLCRLIN Count (Total) % % Application of AACR2 & LCRI 64 (205) (191) 27.2 MARC tagging in 260 field 4 (25) (26) 11.5 Typographic errors4 (32) (45) 13.3
Literature: Database Quality Romero (1994) Evaluated cataloging of library science students Noted 221 errors (28.22%) in the publisher description area
Issues: Historical Practices Different rules for abbreviations LC Rule Interpretation B.14 State postal (2-letter) abbreviation if it appears in the item along with the place Anglo-American Cataloguing Rules, Revised (2002) Abbreviations included in Appendix B.14
Issues: Historical Practices ALA Catalog Rules (1941) Multiple places of publication and publishers and neither or first is prominent Include first listed first, indicate omission Multiple places of publication and publishers and first is not prominent Include prominent first Include first listed second Unknown place of publication – [n.p.]
Issues: Historical Practices Anglo-American Cataloging Rules (1967) Multiple places of publication and publishers and neither or first is prominent Include first listed only, omit others Multiple places of publication and publishers and first is not prominent Include prominent only, omit others Unknown place of publication – [n. p.]
Issues: Historical Practices Anglo-American Cataloguing Rules, Revised (2002) Multiple places of publication and publishers and neither or first is prominent Include first listed only, omit others Multiple places of publication and publishers and first is not prominent Include first listed first Include prominent second Unknown place of publication – [S.l.]
Issues: Historical and Local Practices u.a. At least one German institution uses u.a. as mark of omission Means et al. Not an AACR2r rule Local practice? Is local practice/policy an error?
Issues: Historical and Local Practices WorldCat enhanced records Eliminate or lessen the probability of these issues
Examining Quality of WorldCat
WorldCat: Publisher Name Selection Criteria Fixed field lang = eng
WorldCat: ISBN Validation Errors WorldCat records with ISBNs: 22.69%
WorldCat: ISBN Validation Errors English Language Valid7,561, % Invalid7, % All Languages Valid13,147, % Invalid15, %
WorldCat: MARC Tagging Errors Examined English language records based on some known issues and manual evaluation Total MARC tagging errors found: 11,874 (0.03%)
WorldCat: MARC Tagging Errors MARC 260 vs 300 tagging In 260 field, information from 300 field in $a, $b, $c and/or $e Dates tagging Date in $a or $b Five digit year cm follows year
WorldCat: Typographical Errors Used Typographical Errors in Library Databases to identify and quantify English language WorldCat errors (Ballard, 2005) Total errors: 26,599 (0.08%) Require manual examination to determine if actual errors Searching for Institi* Misspelled: American Institite of Physics British Standards Institition Spelled correctly: Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin Institute for Advanced Studies)
WorldCat: Typographical Errors Top words (10.4%): WordProbability According to Ballard Error TypeWorldCat Count WorchesterHighestInsertion398 MetheunHighTransposition355 Universt*HighestOmission299 Unives*HighestOmission275 Westminister [and] PressHighestInsertion266 Niagr*HighOmission260 Phildel*HighOmission235 TallahaseeHighOmission234 John Hopkins PressHighestOmission227 Institi*HighSubstitution226
WorldCat: Typographical Errors Westminister Only included on Ballard list in combination with other words Total errors in WorldCat: 628 (2.36%) Require manual review
Where are we now?
WorldCat: MARC 260 Evaluation Top 10 terms in 260 $b in WorldCat TermCount press2,094,111 co1,664,005 university1,550,435 dept1,084,647 pub984,234 research853,954 service710,314 institute660,346 office649,794 chu ban she620,735
WorldCat: MARC 260 Evaluation University Press names in 260 $b in WorldCat TermCount oxford35,804 hopkins22,564 cambridge21,951 harvard17,069 cornell11,305 stanford10,900 purdue5,468 yale5,076 princeton4,746 rutgers3,854
Clustering Attempting programmatic clustering of publishers using ISBN prefixes Data clustering (The Free Dictionary) "The science of extracting useful information from large data sets or databases" Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Data in each subset (ideally) share some common trait
WorldCat: Clustering Example Used ISBN prefix 019 (Oxford University Press) Total WorldCat records: 58,004,317 Records with ISBN prefix 019: 84,276 (0.15%) Non-unique publisher names from ISBN prefix records: 91,528 One or more 019 ISBN All 019 ISBNs NACO normalized unique publisher names 1,5501,386 Number of clusters Non-singleton clusters 222 (24.16%) 205 (25.66%) Largest cluster82 text strings81 text strings
Challenges: Publisher Name Authority File Quality issue Level of acceptance for cluster What is acceptable? Subsidiaries and Relationships Oxford & Auckland Examined manually to determine relationship Form of name What is acceptable? Likely to use the most prominent form of name
Questions and Discussion Contact Information: Project Web Site: