Presentation is loading. Please wait.

Presentation is loading. Please wait.

Charleston Conference 7 November 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher.

Similar presentations


Presentation on theme: "Charleston Conference 7 November 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher."— Presentation transcript:

1 Charleston Conference 7 November 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File

2 Overall Research Goals To Build a Database that Will: Identify Authoritative strings for publisher names Common variants for names and locations Hierarchical references indicating relationships and nesting of subsidiaries Definitions of publishing entities

3 Overall Research Goals To Build a Database that Will: Produce Profiles, including data-mined information regarding formats, languages, subjects, etc. for publishers Conform to international authority and standards practice, and inter-operate with other OCLC products

4 Issues & Challenges Database Quality: Historical Practices …the shortest form in which it can be understood. [AACR2 2004] Different versions of cataloging rules Abbreviations Errors and misspellings Local Practices

5 Method: Data Mining in an Aggregate Collection Data Mining and Analysis of WorldCat: …affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making. Lavoie, B.F., Connaway, L. S., & ONeill, E. T. (2007). Mapping WorldCats digital landscape. Library Resources & Technical Services, 51, at 107.

6 WorldCat: July 2008 Total holdings: 1,292,763,300 Manifestations (records): 108,828,533 Works: 84,096,107 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion

7 Global Origins of WorldCat Materials US 28% UK 8% Canada 3% Rest of World 27% Unknown 17% France 4% Germany 10%

8 Global Origins of WorldCat Materials Content Languages: % of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Content Languages: % of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million

9 OCLC Publisher Name Server

10 Publisher Name Server: Objectives Resolve for data mining and quality of WorldCat ISBN prefixes to publisher name Variant publisher names to a preferred form Complement Collection Analysis Service Librarians & Publishers

11 Publisher Name Server: Objectives Capture and profile attributes of individual publishers: Location(s) Language(s) of materials published Genre(s)/format(s) Dominant subject domain(s) Parent company and subsidiaries

12 Publisher Name Server: Methodology Programmatically cluster publishers records using ISBN prefixes Data clustering Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Hand parse the entities and resolve ISBN prefixes

13 Publisher Name Server: Database 1750 publishing entities Relational database, preserving hierarchical relationships Begins with high-occurrence entities: Top 10 lists Top 10 university presses Mergers and acquisitions, last 8 years

14 Example: Top U.S. Publishing Entities by ISBN

15 Publisher Name Server: Data Captured Data: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL Languages Formats Conspectus Subjects Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers Weekly Online Hoovers Handbook Online Standard and Poors Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING

16

17 Publisher Name Server: Current Scope More than 56,000 separate strings mapped to 1750 entities 8.5 million OCLC records 22% of these are Library of Congress records ~490 million holdings Hierarchical relationships maintained

18 Entity-Parsing in a World of Mergers and Acquisitions Prentice-Hall, Inc. Pearson Education, Inc. Addison-Wesley Publishing Company Allyn and BaconDominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co. Pearson PLC Pearson CanadaPearson Technology Group Copp ClarkAdobe PressCisco Press Penguin Books Allen LaneLadybird BooksRiverhead Books Puffin BooksPutnam BooksBerkeley Publishing Group Avery

19 Publisher Profiles within WorldCat Oxford University Press 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) Pearson PLC Includes 14 subsidiaries and acquisitions Aggregate: 291,433 records (0.27% of WorldCat) Springer (Firm) 197,263 records (0.18% of WorldCat) Reed Elsevier PLC Includes dozens of subsidiaries Aggregate: 370,029 records (0.34% of WorldCat)

20 WorldCat Publisher Profiles – Top Languages Oxford Univ. Press: English 96.74% Latin0.51% German0.39% Chinese0.39% French0.37% Spanish0.28% Afrikaans0.14% Middle English0.13% Malay0.09% Swahili0.09% Pearson PLC: English95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04%

21 WorldCat Publisher Profiles – Top Languages Springer (Firm): English 61.25% German 37.10% French1.02% Italian0.29% Polish0.13% Czech0.04% Spanish0.04% Hungarian0.03% Dutch0.02% Danish0.02% Reed Elsevier PLC: English83.64% French 9.34% Dutch 2.32% Spanish 0.95% Italian 0.60% Latin 0.27% Afrikaans 0.16% Ancient Greek 0.12% Portuguese 0.09% Polish 0.06%

22 WorldCat Publisher Profiles - Formats Oxford University Press: Printed Material89.57% Computer File 8.23% Microform 1.39% Sound Recording 0.50% Video Recording 0.16% Springer (Firm): Printed Material81.69% Computer file17.51% Microform 0.71% Video Recording 0.05% Pearson PLC: Printed Material92.98% Microform 2.82% Computer File 2.15% Video Recording 0.70% Sound Recording 0.67% Reed Elsevier PLC: Printed Material92.31% Computer File 5.46% Microform 1.85% Video Recording 0.14%

23 WorldCat Publisher Profiles – Conspectus Divisions Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature 18.67% Business/ Economics 13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75%

24 WorldCat Publisher Profiles – Conspectus Categories Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music 3.48% Vocal music 3.09% Literature on music 2.26% History – Britain 1.82% Economic history 1.38% American lit. 1.35% History – S. Asia 1.30% General history 1.29% Pearson PLC: English language 7.74% Business admin. 4.62% English literature 3.63% Economics 2.94% Comp. programming 2.39% Electrical engineering 2.24% Early childhood ed. 2.05% Computer software 1.88% U.S. federal law 1.80% Computer Science 1.54%

25 WorldCat Publisher Profiles – Conspectus Subjects Oxford Univ. Press: English – modern 5.57% English lit. – prose 2.51% English lit. – 19 th c. 2.23% Juvenile lit. 1.06% English lit. – poetry 1.03% English lit. – collections 0.80% Biographies 0.76% English lit. – % Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern 7.68% Management 2.53% Programming 1.74% Arithmetic 1.09% Economic theory 1.06% Marketing 1.06% General algebra 1.04% Accounting 0.97% Juvenile lit. 0.93% English lit. – 19 th c. 0.89%

26 WorldCat Publisher Profiles – Conspectus Divisions Springer (Firm): Computer Science 16.83% Engineering 15.12% Mathematics 12.96% Medicine 9.93% Physical Sciences 9.83% Biology 5.22% Business/ Economics 5.13% Health Professions 4.48% Chemistry 3.14% Geography 2.58% Reed Elsevier PLC: Language/ Literature 14.18% Law 11.78% Engineering 11.73% Business/ Economics 6.82% Medicine 6.50% Physical Sciences 5.01% History 4.57% Biology 4.32% Health Professions 3.70% Chemistry 3.51%

27 WorldCat Publisher Profiles – Conspectus Categories Springer (Firm): Computer science 5.23% General math 4.48% Health professions 4.03% Electrical engineering 3.73% General engineering 3.25% Mathematical analysis 3.06% Computer software 2.37% Comp. programming 2.34% Probability/ Statistics 2.20% Mech. engineering 2.17% Reed Elsevier PLC: English literature 5.84% Health professions 3.40% English language 2.79% U.S. federal law 2.32% General engineering 2.26% Electrical engineering 2.10% General law 1.70% Industrial economics 1.65% Business admin. 1.53% U.S. state law 1.46%

28 WorldCat Publisher Profiles – Conspectus Subjects Springer (Firm): Health professions 3.56% Math collections 2.76% Computer science 1.84% Programming 1.46% Access/ security 1.10% Artificial intelligence 1.03% Mathematical stats 1.03% Analytical physics 1.02% Industrial management 0.99% Engineering materials 0.90% Reed Elsevier PLC: English – modern 2.68% English - prose 2.06% Health professions 1.92% U.S. state law 1.37% Industrial management 1.22% Legal periodicals 1.16% English lit % Engineering materials 0.86% English fiction 0.83% Nuclear physics 0.68%

29 Projected MARC coding of Authorized Forms 710 Added Entry – Corporate Name Add $4 for publisher name Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) 752 Added Entry – Hierarchical Place Name Add $2 FAST where place of publication matches FAST geographical subject headings

30 Ongoing Research Further data mining Profile other aspects of publication output Profile other publishers Trends over time Author clusters Geographic holdings patterns Collection Analysis

31 Ongoing Research Plan for long-term maintenance ISBN-13 compliance File expansion of ongoing mergers/ acquisition activities Deeper scaling into WorldCat (beyond ISBN)

32 OCLC Publisher Name Server Project page:

33 Thank You! Questions and Discussion Lynn Silipigni Timothy J.


Download ppt "Charleston Conference 7 November 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher."

Similar presentations


Ads by Google