Matching names in parallel T. Hickey Access 2006 2006 October.

Slides:



Advertisements
Similar presentations
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Advertisements

Demystifying Endeca’s search results ranking Kristina Spurgin with input & support from Ben Pennell & Jeff Campbell UNC Libraries.
CSCI3170 Introduction to Database Systems
VIAF for NAAC 2012 October Eric Childress OCLC Research.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
The Virtual International Authority File Thomas Hickey ACIG 2009 July 12 ALA, Chicago IL.
Authorities in a connected world Indiana Library Federation 2011 November 16 Thomas Hickey OCLC Chief Scientist.
Information & Library Services Australian Education Index, British Education Index and ERIC Sally Giffen August 2006.
Cataloging: Millennium Silver and Beyond Claudia Conrad Product Manager, Cataloging ALA Annual 2004.
Creation of an online catalog of dissertations using Access & ASP – slide 1 Creation of an online catalog of dissertations using Access & ASP: from Datatel.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Enhancing bibliographic records in the INNOPAC by adding URLs of book reviews and other value-added information from online bookstores 10 th December 2002.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
LSTA Digital Imaging Grants Presentation Projects Workshop September 13, 2002 Wendy Sistrunk Music Catalog Librarian University of Missouri—Kansas City.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Leveraging Names with Linked Data Karen Smith-Yoshimura Ralph LeVan 2010 RLG Partnership Annual Meeting Chicago, IL 9 June 2010.
Is Cataloging Dead: Advocacy for Bibliographic Control Randy Roeder and Rebecca Routh ILA/ACRL Spring Conference Davenport, Iowa March 3, 2008.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Libraries Australia Cataloguing Parallel Session Bemal Rajapatirana / Rob Walls.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
VIAF (Virtual International Authority File) Building Blocks for the Future: Making Controlled Vocabularies Available for the Semantic Web Dr. Barbara B.
Subject To Change automatic catalog enrichment with subject headings and codes 10th IGeLU conference Budapest, Marcus Zerbst Zentralbibliothek.
Society of American Archivists Research Forum 18 August 2015 A Deep Dive into the Archival MARC Records in WorldCat (and ArchiveGrid) Jackie Dooley Program.
Project Overview Bibliographic merging, Endeca, and Web application.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Future of Cataloging RDA and other innovations pt.1.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Prague 24 November TEL-ME-MOR/M-CAST Seminar on Subject Access The Virtual International Authority File (VIAF) Christel Hengel.
OCLC Online Computer Library Center V irtual I nternational A uthority F ile Ed O’Neill Prepared with the assistance of Rick Bennett Australian Committee.
Midterm Hardware vs. Software Everyone got this right!
MapReduce How to painlessly process terabytes of data.
Library needs and workflows Diane Boehr Head of Cataloging National Library of Medicine, NIH, DHHS
Cross Curricular Resources What do we have? WHAT DO I SELL?
Data Management Console Synonym Editor
Module 3: Creating Maps. Overview Lesson 1: Creating a BizTalk Map Lesson 2: Configuring Basic Functoids Lesson 3: Configuring Advanced Functoids.
A Future for the Library Catalogue T. Hickey ACRL/DVC Bryn Mawr 3 November 2006.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
1 Making Changes to Personal Name and Corporate Body Authority Records Module 7. Making Changes to Existing Name and Work/Expression Authority Records.
Authority Control and Bib Enhancement with Marcive Mark Sandford William Paterson University
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
The physical parts of a computer are called hardware.
Evaluating Global Change Queue Proposals Jenifer K. Marquardt University of Georgia Libraries
VIAF Update Thomas Hickey Chief Scientist OCLC Research Singapore, 2013.
I. Understanding Record Loading and EDIS II. Database Statistics & Top 10 Search III. Problem with merging records IV. Pseudo Tag (Special 035 Tag ) V.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
Combine_and_stir (Aleph data + RDF + Python + other things) IGeLU 2015 Developer’s Day Budapest, Hungary Laura Akerman.
Thomas Hickey Chief Scientist, OCLC Research 2015 August VIAF Council State of VIAF VI AF.
| Barbara Pfeifer | VIAF workshop Strasbourg | VIAF partners: Deutsche Nationalbibliothek (DNB) Barbara Pfeifer.
Lihong Zhu Interim Cataloging Manager/Monographic Cataloging Librarian Washington State University Libraries
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Rep change 1590 (ver 18) Access to Google books
Yoel Kortick Senior Librarian
Map Reduce.
Publishing to OCLC Yoel Kortick Senior Librarian.
Importing Serial Prediction Patterns Via the Service Import 85X records (Serial-52) Yoel Kortick.
Tools and Techniques to Clean Up your Database
Enhancing VIAF with WorldCat
A Future for the Library Catalogue
IDEALS at the University Of Illinois: A Case Study of Integration Between an IR and Library Discovery Systems Sarah L. Shreeves University of Illinois.
Alternate graphic representation 880 field
Map Reduce, Types, Formats and Features
Presentation transcript:

Matching names in parallel T. Hickey Access October

Virtual International Authority File  Link national authority records  Build on their authority work  Move towards universal bibliographic control Allow national or regional variations in authorized forms to co-exist Support needs for variations in preferred language, script, and spelling 10 million WorldCat records in non-English metadata

Joint VIAF Project

Matching Variations In the LCNAF and PND authority files:  Same name, same person  Same name, different people  Different names, same person  Missing person in one file

Two Different People – One Name Adams, Mike  PND: a golfer  LCNAF: author of a Beatles collector's guide Same Name Different People

One Person – Two Names  LCNAF: Morel, Pierre  PND: Morellus, Petrus Same Person Different Names

Enhancing the Authorities Bibliographic Record Derived Authority Record Enhanced Authority

Strong Matching Attributes  A work (title) in common  Common control numbers (ISBN, ISSN, or LCCN)  Exact birth and death year  Joint authors  Name as subject

Weaker Attributes  Only one of birth/death date(s) (allows some variation)  Subject area of works (two levels)  Format (books, films, musical scores, etc.)  Language  Publisher  Partial title match  Date of publication  Country  Role (author, illustrator, composer, etc.)  Format (books, films, musical scores, etc.)

Computing it  Standard approach Generate keys and data Load information into a database Index it Extract fields needed  Map/Reduce approach Split the database up Run parallel jobs Bring information together via map/reduce Assemble information in stages

Map/Reduce  Two stages Map Read in source file (e.g. MARC-21) Write out key + data Reduce Read in array of data for each unique key Write out key + data

Overview of MapReduce Source: Dean & Ghemawat (Google)

Our Implementation  Written in Python  Uses ssh and XML-RPC for control and communication  Map/Reduce seems to add ~ 10% overhead  Ran an earlier implementation on a 48 cpu cluster  Current VIAF cluster is a 12 cpu cluster on 4 nodes  Running Linux and 64-bit Python

VIAF Matching Code  17 modules  1,100 lines of code  Plus 600 lines configuration 2,755 lines of tables embedded in code

VIAF Data Flow get changed Ids eliminate forename, date conflicts from buckets Extract Data build buckets surname: forename,date compare build compare data id:tag, data build compare data id:tag, data build name:id map name:id map authorities authority id: bib id changed authority ids potential pairs identify compare data pair id:[bib/auth]id select compare data pair id: compare data map authorities authority id: bib id name:id build name:id map pair id: scores identify compare data pair id:[bib/auth]id select compare data pair id: compare data LC Authority Extract Data LC CatalogPND Authority Extract Data PND Catalog Extract Data PND Catalog

WorldCat Identities  Bring together all of WorldCat’s information about people Name(s) Works by and about Subjects Dates Fiction/non-fiction Roles Co-authors  Add links Wikipedia Authority files

Sample Identity

Statistics  Nearly 19 million different ‘identities’ in WorldCat  80 million (nominally) controlled headings  The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)

Identities Data Flow Stage 1 NameInfoCitation Stage 3 Stage 4 NameInfoCitations Stage 2 Cover ArtWorldCatFRBRAudience Authorities Identities Wikipedia

Identities Stage 1 Extract Data From WorldCat  Input: WorldCat (MARC-21)  Map output: NameKey WorkID  Reduce output: WorkID NameKey

Identities Stage 2 Extract Data From Authorities  Input: NACO Authorities file (MARC-21)  Map output NameKey XTos XFroms  Reduce output NameKey

Identities Stage 3 Connect Citations with Names  Input Stage 1 output WorkID ’s NameKey  Map output NameKey

Identities Stage 4 Create Identities  Input Authority info from stage 2 Merged name info from stage 3 Merged citations from stage 3  Map output Pass through  Reduce output Pnkey

Schedules  Identities Up this year?  VIAF Reload, rematch this year Public service up early 2007

Conclusions  Our merged files (e.g. WorldCat) are really quite large  More processing power opens up new ways of manipulating and looking at our data  Parallel processing is the only way to obtain the cycles needed  Map-Reduce is an attractive way to do parallel processing Forces decomposition Scales well Opens up new possibilities

Thank you T. Hickey VIAF.org Access October