Presentation is loading. Please wait.

Presentation is loading. Please wait.

ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral.

Similar presentations


Presentation on theme: "ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral."— Presentation transcript:

1 ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Beyond Data Mining: Delivering the Next Generation of Services from Library Data

2 WorldCat as an Aggregate Collection Data Mining and Analysis of WorldCat: …affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making. Lavoie, B.F., Connaway, L. S., & ONeill, E. T. (2007). Mapping WorldCats digital landscape. Library Resources & Technical Services, 51, at 107.

3 WorldCat: July 2008 Total holdings: 1,292,763,300 Manifestations (records): 108,828,533 Works: 84,096,107 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion

4 Global Origins of WorldCat Materials US 28% UK 8% Canada 3% Rest of World 27% Unknown 17% France 4% Germany 10%

5 Global Origins of WorldCat Materials Content Languages: % of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Content Languages: % of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million

6 WorldCat as a Decision-Making Resource Collection management Cooperative collection development Comparative collection analysis Collection assessment Mass digitization Off-site storage Preservation

7 WorldCat as a Decision-Making Resource Services Virtual reference Recommender services Social networking Systems Precision

8 WorldCat as a Decision-Making Resource Three Areas of Data Mining Research: OCLC WorldMap Audience Level Publisher Name Server

9 OCLC WorldMap

10 OCLC WorldMap TM : Objectives Geographically represent WorldCat data Titles published in each country Holdings for titles published in each country Languages represented for titles published in each country

11 OCLC WorldMap TM : Objectives Geographically represent data from UNESCO, ARL, and NCES for each country Number of Libraries Library volumes Certified/degreed librarians Registered library users Library expenditures Cultural heritage institutions (museums and archives) Publishers

12 OCLC WorldMap TM : Objectives Research prototype Support OCLC data mining research Visually display data for review and analysis Internal use Sales and marketing External use Library collection assessment and comparison Data may be processed AT A GLANCE Complement the AAU/ARL Global Resources Network project Project of the Council on Library and Information Resources (CLIR)

13

14

15

16

17

18

19

20

21

22

23 OCLC Audience Level

24 Audience Level: Rationale and Objectives Thus we can infer materials audience level from holdings patterns, which in turn can support: Collection management Readers advisory services Reference services Information retrieval Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file Selections serve the interests of a librarys target community … Associate community (audience level) to library profiles - e.g., ARL, non-ARL academic, public, K- 12 school … ?

25

26

27

28

29 Example Computation: Build Community Library symbol Library nameLibrary type Weight OHIState Library of OhioOtherx OCOColumbus Metropolitan LibraryPublic0.33 CDCCedarville UniversityAcademic0.67 LIMLima Public LibraryPublic0.33 OUNOhio UniversityResearch1.00 OSDSEO Automation ConsortiumOtherx BGUBowling Green State UniversityAcademic0.67 MIAMiami UniversityAcademic0.67 AKR University of AkronAcademic0.67 BGFFirelands CollegeAcademic0.67 CINUniversity of CincinnatiResearch1.00 TOLUniversity of ToledoAcademic0.67 KSUKent State UniversityResearch1.00 HIRHiram CollegeAcademic0.67 YNGYoungstown State UniversityAcademic0.67

30 FRBRizing Audience Level Results Calculate Audience Level for each Manifestation Aggregate weighted holdings for Work OCLC NumberTotal HoldingsUsable Holdings Manifestation Audience Level x

31

32 Evaluating the OCLC Audience Level Random sample of 30 Zoology books, all audience levels Human subjects Ranked books in increasing order of difficulty Strong statistical correlation between human subjects ranking and programmatic ranking

33 Evaluating the OCLC Audience Level

34

35 OCLC Publisher Name Server

36 Publisher Name Server: Research Objectives Resolve for data mining and quality of WorldCat ISBN prefixes to publisher name Variant publisher names to a preferred form Complement Collection Analysis Service Librarians Publishers Capture and profile attributes of individual publishers Location(s) Language(s) of materials published Genre(s)/format(s) Dominant subject domain(s) Parent company and subsidiaries

37 Publisher Name Server: Methodology Programmatically cluster publishers records using ISBN prefixes Data clustering (The Free Dictionary) "The science of extracting useful information from large data sets or databases" Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Data in each subset (ideally) share some common trait Hand parse the entities and resolve ISBN prefixes

38 Publisher Name Server: Database 1750 publishing entities Relational database, preserving hierarchical relationships Begins with high-occurrence entities: Top 10 lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand) Top 10 university presses Mergers and acquisitions, last 8 years

39 Publisher Name Server: Data Captured Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL Languages Formats Conspectus Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers Weekly Online Hoovers Handbook Online Standard and Poors Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING

40

41 Publisher Name Server: Database More than 56,000 separate strings mapped to 1750 entities 8.5 million OCLC records 22% of these are Library of Congress records ~490 million holdings Hierarchical relationships maintained

42 Entity-Parsing in a World of Mergers and Acquisitions Prentice-Hall, Inc. Pearson Education, Inc. Addison-Wesley Publishing Company Allyn and BaconDominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co. Pearson PLC Pearson CanadaPearson Technology Group Copp ClarkAdobe PressCisco Press Penguin Books Allen LaneLadybird BooksRiverhead Books Puffin BooksPutnam BooksBerkeley Publishing Group Avery

43 Publisher Profiles Oxford University Press 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) Pearson PLC Includes 14 subsidiaries and acquisitions Aggregate: 291,433 records (0.27% of WorldCat)

44 Publisher Profiles – Top Languages Oxford Univ. Press: English 96.74% Latin0.51% German0.39% Chinese0.39% French0.37% Spanish0.28% Afrikaans0.14% Middle English0.13% Malay0.09% Swahili0.09% Pearson PLC: English95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04%

45 Publisher Profiles – Conspectus Divisions Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature18.67% Business/ Economics13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75%

46 Publisher Profiles – Conspectus Categories Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music3.48% Vocal music3.09% Literature on music2.26% History – Britain1.82% Economic history1.38% American lit.1.35% History – S. Asia1.30% General history1.29% Pearson PLC: English language7.74% Business admin.4.62% English literature3.63% Economics2.94% Comp. programming2.39% Electrical engineering2.24% Early childhood ed.2.05% Computer software1.88% U.S. federal law1.80% Computer Science1.54%

47 Publisher Profiles – Conspectus Subjects Oxford Univ. Press: English – modern 5.57% English lit – prose 2.51% English lit – 19 th c. 2.23% Juvenile lit. 1.06% English lit – poetry 1.03% English lit – collections 0.80% Biographies 0.76% English lit – % Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern7.68% Management2.53% Programming1.74% Arithmetic1.09% Economic theory1.06% Marketing1.06% General algebra1.04% Accounting0.97% Juvenile lit.0.93% English lit – 19 th c.0.89%

48 Projected MARC coding of Authorized Forms 710 Added Entry – Corporate Name Add $4 for publisher name Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) 752 Added Entry – Hierarchical Place Name Add $2 FAST where place of publication matches FAST geographical subject headings

49 Future Research Further data mining Profile aspects of publication output Deeper scaling into WorldCat (beyond ISBN) Plan for long-term maintenance ISBN-13 compliance File expansion of ongoing mergers/ acquisition activities

50 Thank You! Questions and Discussion Lynn Silipigni Timothy J.


Download ppt "ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral."

Similar presentations


Ads by Google