Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist.

Slides:



Advertisements
Similar presentations
Introduction: the New Price Index Manuals Presentation Points IMF Statistics Department.
Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Chapter 6 Flowcharting.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.
Division ÷ 1 1 ÷ 1 = 1 2 ÷ 1 = 2 3 ÷ 1 = 3 4 ÷ 1 = 4 5 ÷ 1 = 5 6 ÷ 1 = 6 7 ÷ 1 = 7 8 ÷ 1 = 8 9 ÷ 1 = 9 10 ÷ 1 = ÷ 1 = ÷ 1 = 12 ÷ 2 2 ÷ 2 =
Mirror Mirror on the wall does your repository reflect it all? Peter West and Timothy Miles-Board EPrints Services University of Southampton Southampton,
Using Citation Analysis to Study Changes in the Information Seeking Behavior of Medical Researchers Brian Bunnett UT Southwestern Library October 24, 2005.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
© 2012 Association for Computing Machinery Intro to the ACM Digital Library February 24, 2012 Intro to the ACM Digital Library February 24, 2012.
When parallels collide: Parallel records, parallel fields and hybrid records OCLC Users Group Annual Meeting 3/6/2004 Hsi-chu Bolick University of North.
OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
The Reinberger Childrens Library Center Step-by-step instructions for capturing a MARC record and adding a 658 Tag to a record.
Mining for Digital Resources: Identifying and Characterizing Digital Materials in WorldCat Brian Lavoie Lynn Silipigni Connaway Ed ONeill ACRL 12 th National.
Ithaka A Systemwide View of Library Collections Brian Lavoie, OCLC Research Roger C. Schonfeld, Ithaka CNI Spring Task Force Meeting April 5, 2005.
OCLC Online Computer Library Center Use of Circulation Statistics and Interlibrary Loan Data in Collection Management Lynn Silipigni Connaway, Ph.D. Office.
Capturing Untapped Descriptive Data: Creating Value for Librarians and Users Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006.
Modeling Continuing Resources in FRBR [and More] Judith A. Kuhagen CPSO, Library of Congress FRBR Workshop - OCLC May 2, 2005.
Cataloging Electronic Resources with OCLC CORC (Cooperative Online Resource Catalog) Special Libraries Association Conference Transportation Division June.
Future of Cataloging RDA and other innovations Pt. 2.
A centre of expertise in digital information management UKOLN is supported by: The Tools of our Trade: AACR2/RDA and MARC.
UKOLN, University of Bath
1 Finding bibliographic information about books on the WWW: an evaluation of available sources Maike Somers Librarian, Public Library, Niel Paul Nieuwenhuysen.
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
1 Making Changes to Existing Name and Work/Expression Authority Records Module 7. Making Changes to Existing Name and Work/Expression Authority Records.
SEARCHING BOOKS prepared by Literature Searching Team
Week 2 The Object-Oriented Approach to Requirements
The basics for simulations
Configuration management
Information Systems Today: Managing in the Digital World
RDA Test Train the Trainer Module 2: Structure [Content as of Mar. 31, 2010]
1 IMDS Tutorial Integrated Microarray Database System.
1 The information industry and the information market Summary.
Database Design Process
Yong Choi School of Business CSU, Bakersfield
How to Use SPBQ Data to Improve Student Learning
Introducing WebDewey 2.0. Introducing WebDewey 2.0.
Chapter 10: Virtual Memory
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
Benchmark Series Microsoft Excel 2013 Level 2
Using Reference Sources Fleet RISD. Why Use Reference Sources? Reference Sources provide an overview of a subject at the beginning of the research.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Labour Force Historical Review Sandra Keys, University of Waterloo DLI OntarioTraining University of Guelph, Guelph, ON April 12, 2006.
For Events Studies. Session Outline: The Research Cycle: 5 stages Finding information - Events subject guide Searching the library catalogue Finding magazine.
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Center on Knowledge Translation for Disability and Rehabilitation Research Information Retrieval for International Disability and Rehabilitation Research.
School Census Summer 2010 Headlines 1 Jim Haywood Product Manager for Statutory Returns Version 1.0.
DIKLA GRUTMAN 2014 Databases- presentation and training.
Chapter 13 The Data Warehouse
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
1 Subject Authority Control and Cooperative Cataloging May 18, 2005.
WEB OF KNOWLEDGE 5.2
INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
How the University Library can help you with your term paper
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
© Copyright 2011 John Wiley & Sons, Inc.
RDA & Serials. RDA Toolkit CONSER RDA Cataloging Checklist for Textual Serials (DRAFT) CONSER RDA Core Elements Where’s that Tool? CONSER RDA Cataloging.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
Using Metadata in CONTENTdm Diana Brooking and Allen Maberry Metadata Implementation Group, Univ. of Washington Crossing Organizational Boundaries Oct.
OCLC Online Computer Library Center Data Mining Library Collection Silos: Print Books and E-books in Library Collections Lynn Silipigni Connaway Ed O’Neill.
Grey Literature, E-Repositories and Evaluation of Academic & Research Institutes. The case study of BPI e-repository Maria V. Kitsiou - Head Librarian,
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Subject To Change automatic catalog enrichment with subject headings and codes 10th IGeLU conference Budapest, Marcus Zerbst Zentralbibliothek.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
1 Making Changes to Personal Name and Corporate Body Authority Records Module 7. Making Changes to Existing Name and Work/Expression Authority Records.
AN ARCHETYPE FOR INFORMATION ORGANIZATION AND CLASSIFICATION OCLC WorldCat.
DATABASES By: Hanna Ben-Or Phone:
Presentation transcript:

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005

Introduction

Research Goals Develop a service to support advanced collection intelligence Cluster collected objects based on their issuing entity As can be determined via metadata about the objects Gain intelligence about the nature of individual publishers Collection intelligence Acquisition patterns User behavior

Research Objectives Resolve ISBN prefixes to publisher name Variant publisher names to a preferred form Capture and make available for use various attributes of individual publishers Location of publisher Language(s) of materials published Genre(s)/format(s) of materials published Dominant subject domain(s) of the publisher's output Parent company and subsidiaries

Theoretical Foundation: Authority Control Adhere to authorized form Personal names Corporate entities Why no authorized form for publishing entities?

Pragmatic Foundation: Collection Development Identified publisher series Retrospective conversion project (1984) Family tree Which publishers are related? Approval plans Which publishers publish which subjects?

Pragmatic Foundation: OCLC WorldCat Data Mining Collection Analysis Which libraries have the most items by a publisher in a particular subject area? How do library holdings by publisher compare? E-books for a particular STM publisher (2000) Cataloged as reproductions 2 publishers!

Pragmatic Foundation: Citation Analysis Sweetland (1989) Reader functions of citations Information retrieval via citation databases Document retrieval Includes interlibrary loan verification Bibliometrics Faculty and researcher productivity measure Other functions Creation of references/bibliographies

Pragmatic Foundation : Education for Librarians Collection development & acquisitions librarian education Subject focuses of publishers Parent and subsidiary relationships

Specialized Corporate Authority Files ACOLIT (Ruggeri, 2004) Names, uniform titles, Italian and international Catholic institutions, Catholic religious communities, and institutions Related to the Catholic Church, Papal State, and Vatican City State COPAR (Boddaert, 2004) French official corporate bodies Mainly national and preceding the French Revolution CORELI (Boddaert, 2004) Religious corporate bodies from 3 French ancient specialized catalogues

Specialized Corporate Authority Files Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004) Chinese authors of expanded works and Chinese corporate bodies since 1912 Chinese Name Authority Database (Hu, Tam & Lo, 2004) Mainly Taiwanese personal names with some Taiwanese corporate bodies

Specialized Corporate Authority Files Case study by Elias & Fair (1983) Standard Oil Co.s Media Query File No authority control 3 professionals in 6 months averaged 12 telephone calls/day from reporters Decided against canonical list for media names Noted 20 unique variants for Wall Street Journal including WSJ, Wall St. Jnl, Wall Street Jnl

Specialized Corporate Authority Files Case study by French, Powell & Schulman (1997, 2000) Smithsonian Astrophysical Observatorys Astrophysics Data System database Programmatically identify author affiliations and map variant names to canonical name Investigated various techniques separately and iteratively to bring variants together including: Lexical cleanup Data clustering algorithms Approximate string-matching Reduced number of unique strings by 55% Required manual review of clusters

Database Quality

Literature: Database Quality Review by ONeill & Vizine-Goetz (1988) Busch (1981) < 35% of 141 OCLC libraries routinely reported errors Pollock & Zamora (1983) Noted misspellings comprise 90-96% of errors & include: Omission Insertion Substitution Transposition

Literature: Database Quality Intner (1989) Reviewed 215 matching records in OCLC and RLIN Errors relating to publishers: OCLCRLIN Count (Total) % % Application of AACR2 & LCRI 64 (205) (191) 27.2 MARC tagging in 260 field 4 (25) (26) 11.5 Typographic errors4 (32) (45) 13.3

Literature: Database Quality Romero (1994) Evaluated cataloging of library science students Noted 221 errors (28.22%) in the publisher description area

Issues: Historical Practices Different rules for abbreviations LC Rule Interpretation B.14 State postal (2-letter) abbreviation if it appears in the item along with the place Anglo-American Cataloguing Rules, Revised (2002) Abbreviations included in Appendix B.14

Issues: Historical Practices ALA Catalog Rules (1941) Multiple places of publication and publishers and neither or first is prominent Include first listed first, indicate omission Multiple places of publication and publishers and first is not prominent Include prominent first Include first listed second Unknown place of publication – [n.p.]

Issues: Historical Practices Anglo-American Cataloging Rules (1967) Multiple places of publication and publishers and neither or first is prominent Include first listed only, omit others Multiple places of publication and publishers and first is not prominent Include prominent only, omit others Unknown place of publication – [n. p.]

Issues: Historical Practices Anglo-American Cataloguing Rules, Revised (2002) Multiple places of publication and publishers and neither or first is prominent Include first listed only, omit others Multiple places of publication and publishers and first is not prominent Include first listed first Include prominent second Unknown place of publication – [S.l.]

Issues: Historical and Local Practices u.a. At least one German institution uses u.a. as mark of omission Means et al. Not an AACR2r rule Local practice? Is local practice/policy an error?

Issues: Historical and Local Practices WorldCat enhanced records Eliminate or lessen the probability of these issues

Examining Quality of WorldCat

WorldCat: Publisher Name Selection Criteria Fixed field lang = eng

WorldCat: ISBN Validation Errors WorldCat records with ISBNs: 22.69%

WorldCat: ISBN Validation Errors English Language Valid7,561, % Invalid7, % All Languages Valid13,147, % Invalid15, %

WorldCat: MARC Tagging Errors Examined English language records based on some known issues and manual evaluation Total MARC tagging errors found: 11,874 (0.03%)

WorldCat: MARC Tagging Errors MARC 260 vs 300 tagging In 260 field, information from 300 field in $a, $b, $c and/or $e Dates tagging Date in $a or $b Five digit year cm follows year

WorldCat: Typographical Errors Used Typographical Errors in Library Databases to identify and quantify English language WorldCat errors (Ballard, 2005) Total errors: 26,599 (0.08%) Require manual examination to determine if actual errors Searching for Institi* Misspelled: American Institite of Physics British Standards Institition Spelled correctly: Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin Institute for Advanced Studies)

WorldCat: Typographical Errors Top words (10.4%): WordProbability According to Ballard Error TypeWorldCat Count WorchesterHighestInsertion398 MetheunHighTransposition355 Universt*HighestOmission299 Unives*HighestOmission275 Westminister [and] PressHighestInsertion266 Niagr*HighOmission260 Phildel*HighOmission235 TallahaseeHighOmission234 John Hopkins PressHighestOmission227 Institi*HighSubstitution226

WorldCat: Typographical Errors Westminister Only included on Ballard list in combination with other words Total errors in WorldCat: 628 (2.36%) Require manual review

Where are we now?

WorldCat: MARC 260 Evaluation Top 10 terms in 260 $b in WorldCat TermCount press2,094,111 co1,664,005 university1,550,435 dept1,084,647 pub984,234 research853,954 service710,314 institute660,346 office649,794 chu ban she620,735

WorldCat: MARC 260 Evaluation University Press names in 260 $b in WorldCat TermCount oxford35,804 hopkins22,564 cambridge21,951 harvard17,069 cornell11,305 stanford10,900 purdue5,468 yale5,076 princeton4,746 rutgers3,854

Clustering Attempting programmatic clustering of publishers using ISBN prefixes Data clustering (The Free Dictionary) "The science of extracting useful information from large data sets or databases" Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Data in each subset (ideally) share some common trait

WorldCat: Clustering Example Used ISBN prefix 019 (Oxford University Press) Total WorldCat records: 58,004,317 Records with ISBN prefix 019: 84,276 (0.15%) Non-unique publisher names from ISBN prefix records: 91,528 One or more 019 ISBN All 019 ISBNs NACO normalized unique publisher names 1,5501,386 Number of clusters Non-singleton clusters 222 (24.16%) 205 (25.66%) Largest cluster82 text strings81 text strings

Challenges: Publisher Name Authority File Quality issue Level of acceptance for cluster What is acceptable? Subsidiaries and Relationships Oxford & Auckland Examined manually to determine relationship Form of name What is acceptable? Likely to use the most prominent form of name

Questions and Discussion Contact Information: Project Web Site: