Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic cataloging & classification Eric Childress OCLC Research OCLC Members Council Research and New Technologies Interest Group 25 October 2005.

Similar presentations

Presentation on theme: "Automatic cataloging & classification Eric Childress OCLC Research OCLC Members Council Research and New Technologies Interest Group 25 October 2005."— Presentation transcript:

1 Automatic cataloging & classification Eric Childress OCLC Research OCLC Members Council Research and New Technologies Interest Group 25 October 2005

2 The key question Can machines be leveraged for? –Baseline metadata Critical data present Accurate tagging Accurate values –Ideal: Enriched metadata The answer: –Yes…with caveats Human Labor Metadata Input Output Status quo

3 Automation approaches Harvesting: Drawing from extant metadata in one or more sources Extraction: Drawing from attributes of the resource and/or content in the resource Both: Integrating both harvesting & extraction in metadata generation

4 Approaches (cont) Harvesting & extraction can be integrated with other tactics: –Point-of-transaction capture: Manual and/or automatic capture of metadata during the lifecycle of resource and/or metadata (e.g., the source agency, date of record) –Human review/prompting: Integrating human decision-making to address cases machines cannot handle efficiently (e.g., linking name references to correct authority file when several names are similar)

5 Harvesting options New record, same database: –OCLC derive record technique External metadata files: –Z39.50/Zing/MXG –OAI harvesting –Citation tools (e.g., EndNote) Embedded metadata harvesting: –Processes structured metadata –Various tools (e.g., DC tools list)DC tools list Many harvesting tools include some extraction features (and vice-versa) –Example: InfoLibrarian applianceInfoLibrarian appliance

6 Extraction landscape Many tools from many sources –Features vary widely –Some are narrow-band (e.g., domain-specific, narrow scope of data work) –Standalone or highly integrated in systems (often as part of digital access mgt. systems) Frequently-encountered features: –Simple: document statistics, file type –Complex: (reliable) language detection, audience level, topics, entities represented, document parts, taxonomy derivation

7 Extraction approaches Information extraction: –Automatically extract structured or semistructured information from unstructured machine-readable documents - WikipediaWikipedia Natural language processing –A range of computational techniques for analyzing and representing naturally occurring text (free text) at one or more levels of linguistic analysis (e.g., morphological, syntactic, semantic, pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications - AHIMAAHIMA –Extracts both explicit & implicit meaning

8 Some work of interest Library of Congress NSF-funded NSDL projects AMeGA iVia software RLGs Automatic Exposure

9 Library of Congress BEAT (Bibliographic Enrichment Advisory Team) activities & projects: BEAT –MARC records fromharvesting: E-CIP Web access to publications in series –Numerous enrichment activities: TOCs: E-CIP, ONIX, dTOC project, more Reviews: HNET, Outstanding Reference Sources, HLAS reviews, MARS Best Free Reference Sites Contributor biographic information, ONIX descriptions, sample texts Links to e-versions of various texts Special projects for select LC collections –Work with bibliographies & pathfinders

10 NSDL-related projects (selected) MetaExtract: An NLP System to Automatically Assign Metadata –CNLP (Syracuse U) & SIS (Syracuse U)CNLPSIS –Builds on several previous projects including: Breaking the MetaData Generation Bottleneck [ ] –CNLP (Syracuse U) & U Washington iSchooliSchool –Application of NLP to automatically generate metadata for course- oriented materials Lenny –Cornell NSDL group & INFOMINE –Orchestrated application of a suite of activities OAI harvesting with metadata augmentation using iVia Loosely-coupled third party services to provide metadata enhancements (correction, augmentation) to metadata destined for a central repository Interactions orchestrated by centralized software application

11 MetaExtract study findingsfindings Auto-generated versus manually-assigned: –Comparable Performance in Retrieval Quality of most elements (for Browsing) –Better Coverage of metadata elements Auto-generated versus full-text: –Comparable Performance in Retrieval –Better Enables Fielded searching Enables Browsing of results –Provides useful structuring of data

12 Other projects AMeGA (Automatic Metadata Generation Applications Project) –UNC-CH SILS Metadata Research CenterSILSMetadata Research Center –Research initiated to fulfill LC Bibliographic Control Action Plan 4.2 (deliver specifications for tools to effect automated processing of Web-based resources) –Final report identifies and recommends functionalities for automatic metadata generation applicationsreport iVia software iVia –Developed by INFOMINE & in use by NSDL, various other digital library projects; LC looking at using iViaINFOMINE –Sophisticated open source harvester software that can assign LCSH, LCC Automatic Exposure –RLG-led initiative advocates capturing standard technical metadata about digital images automatically, as part of image creation

13 OCLC activities OCLC Research projects: –Automatic classification –FRBR-related record harvesting –SchemaTrans OCLC production services: –OCLC Digital Archive –WorldCat link –OCLC Connexion

14 Automatic classification work Scorpion –Open source software that implements a system for automatically classifying Web-accessible text documents –Incorporated into Connexion extractor FAST as a knowledge base for automatic classification project FAST as a knowledge base for automatic classification –Evaluated FAST as a database to support automatic classificationFAST ePrints-UK project ePrints-UK –A collaboration with RDN to pilot Web services to classify records by DDC and provide authority control for personal names for RDN eprint metadata records

15 Other OCLC Research activities FRBR-related record harvesting –Best elements of all records in workset used to build a work record (Fiction Finder) SchemaTrans project –Adopts a novel approach to translating structured metadata between schemes –Should be friendly to modular augumentation/correction activities

16 OCLC products OCLC Digital Archive –Various harvesting options Capture of technical metadata Start descriptive records in Connexion WorldCat link –Scheduled ingest of metadata from OAI servers and batch processing into WorldCat OCLC Connexion –Extractor processes metadata from web sites Relatively sophisticated harvesting Processes non-canonical metadata Slated for significant upgrade in 2006 –Rules-aided LCSH assignment while editing bibs –Automatic base authority record generation from relevant bibliographic record (NACO)

17 Links Recommended reading: –Liddy, Elizabeth, Metadata: A Promising Solution in EDUCAUSE Review, v. 40, n. 3 (May/June 2005)Liddy, Elizabeth, Metadata: A Promising Solution in EDUCAUSE Review, v. 40, n. 3 (May/June 2005) OCLC Research links: –Automatic classification projectsAutomatic classification projects –SchemaTransSchemaTrans –ResearchWorksResearchWorks

18 The following slides were not included in the original presentation due to time limitations

19 NSDL/GEM-related projects MetaTest [ ] –CNLP (Syracuse U) & Cornell HCIHCI –Evaluated retrieval effectiveness - and subject expert judgment of quality - of automatically & manually generated StandardConnection [ ] –CNLP (Syracuse U) & U Washington iSchool –Used NLP to correlate educational resources in digital libraries to educational content standards

20 Other projects Data Fountains –UC Riverside –Fully-automated collection aggregation and metadata generation Project CLiMB (Computational Linguistics for Metadata Building) Project CLiMB –Columbia University –Extracts metadata from associated scholarly texts

Download ppt "Automatic cataloging & classification Eric Childress OCLC Research OCLC Members Council Research and New Technologies Interest Group 25 October 2005."

Similar presentations

Ads by Google