1 July 2004 – METS Opening Day UK www.ccs-gmbh.de1 docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content.

Slides:



Advertisements
Similar presentations
Home-Grown Digital Library System Built Upon Open Source XML Technologies and Metadata Standards David Lacy Villanova University
Advertisements

1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
METS Awareness Training An Introduction to METS Digital libraries – where are we now? Digitisation technology now well established and well-understood.
Putting together a METS profile. Questions to ask when setting down the METS path Should you design your own profile? Should you use someone elses off.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Microsoft Word 2013 An Overview. Your Environment Quick Access Toolbar Customizable toolbar for one-click shortcuts Tabs Backstage View Tools located.
Services Digitisation & Content Management. 600 People – India.
PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
Joachim Bauer Senior System Engineer, CCS
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Publishing Workflow for InDesign Import/Export of XML
6/15/20151 Opportunities for Collaboration: The HEARTH Project Joy Paulson and Nathan Rupp Cornell University Digital Library Federation Spring Forum New.
Ingest and Loading DigiTool Version 3.0. Ingest and Loading 2 Ingest Agenda Ingest Overview and Introduction Ingest activity steps Transformers Task Chains.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Challenges for the DL and the Standards to solve them Alan Hopkinson Technical Manager (Library Systems) Learning Resources Middlesex University.
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
UNIVERSITY OF MACEDONIA ECONOMIC AND SOCIAL SCIENCES Support and Inclusion of students with disabilities at higher education institutions in Montenegroz.
The Cornell Veterinarian A Metadata Perspective.
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 5 September
A METS Application Profile for Historical Newspapers
Create and Manage METS in retrodigitization Markus Enders Goettingen State and University Library
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Leonardo da Vinci Programme Project ACCELERATE Nicosia, May 2001 Services offered toVIP by the University of Graz, Austria Services to individuals Services.
1 April 2004 – METS Opening Day West docWORKS/METAe Automated Conversion Of Printed Documents Into Fully Tagged METS Objects Claus Gravenhorst.
European Metadata Initiatives: The METAe Metadata Engine Simon Tanner Higher Education Digitisation Service
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Open Textbooks and Electronic Publishing Formats/Standards Arctic Virtual Learnng Tools
The TARO Project Texas Archival Resources Online Fred Gilmore Sr Operating Systems Specialist UT Austin General Libraries April.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Creating an International Environment for Research in Library Materials Adolf Knoll National Library of the Czech Republic
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
A Basic Web Page. Chapter 2 Objectives HTML tags and elements Create a simple Web Page XHTML Line breaks and Paragraph divisions Basic HTML elements.
An Introduction to METS Morgan Cundiff Network Development and MARC Standards Office Library of Congress Metadata Encoding and Transmission Standard.
1 Bridging the gap between the paper past and digital future.
International Seminary on Digitisation: Experience and Technology 11 th May 2004 | National Library | Lisbon – Portugal DIGITAL ARCHIVE OF PORTUGUESE ART.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
METS: Implementing a metadata standard in the digital library Richard Gartner Oxford University Library Services
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
UoS Libraries 2011 EndNote X5 - basic graduate session.
Digital library of Spanish old newspapers and magazines National Library of Spain.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
National Library of Finland Metadata in the Digitisation Process Cultural unity and diversity of the Baltic Sea Region – common history, different languages,
National Library of Finland Strategic, Systematic and Holistic Approach in Digitisation Cultural unity and diversity of the Baltic Sea Region – common.
Tiziana // Alessandra Lenzi - MG Breaking down the walls Project Museo Galileo and the Linked Open Data A joint project between.
Topic Maps for Cultural Heritage Collections Conal Tuohy Senior Developer New Zealand Electronic Text Centre
WORLD CONSORTIUM Welcome to. An overview by Phil Elliott Satzconcept Skandinavia a.s.
International Co-operation in Building Access to Digitized Resources Adolf Knoll National Library of the Czech Republic
FACES General Overview ViRR (Virtueller Raum Reichsrecht) Software Solutions Kristina Büchner and Bastien Saquet Contact:Kristina Buechner:
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Developing a Dark Archive for OJS Journals Yu-Hung Lin, Metadata Librarian for Continuing Resources, Scholarship and Data Rutgers University 1 10/7/2015.
THESIS & DISSERTATION FORMATTING
S.Rajeswari Head , Scientific Information Resource Division
Adobe Acrobat Pro DC – Introduction to Accessible PDFs
Introduction to Metadata
Metadata - Catalogues and Digitised works
Metadata to fit your needs... How much is too much?
Lars Ballieu Christensen Advisor, Ph.D., M.Sc. Tanja Stevns
My Program Session Title
Adobe Acrobat DC Accessibility - Metadata, Reading Order, Links
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Current Challenges in Digitization
Quick and Dirty: the art of OCR
Presentation transcript:

1 July 2004 – METS Opening Day UK docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content Conversion Specialists

2 July 2004 – METS Opening Day UK CCS – Offices What is docWORKS/METAe? Production tool for conversion of printed documents into fully tagged digital objects The METAe edition of docWORKS is the result of the EU-funded project METAe Start of project: September 2000 End of project: August 2003 Product launch: March 2003, CeBIT exhibition

3 July 2004 – METS Opening Day UK CCS – Offices The project group 1.Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria 2.Universität Linz, Institut für Angewandte Informatik, University of Linz, Austria 3.Mitcom Neue Medien GmbH (ABBYY Europe), Germany 4.CCS Compact Computer Systeme, Germany 5.Universidad de Alicante, Spain 6.Friedrich-Ebert-Stiftung, Germany 7.Cornell University Library. Department of Preservation and Conservation, USA 8.Bibliothèque nationale de France 9.The National Library of Norway, Rana division, Norway 10.Biblioteca Statale A. Baldini, Italy 11.Dipartimento di Sistemi e Informatica, University of Florence, Italy 12.Karl-Franzens-Universität Graz, Universitätsbibliothek, Austria 13.Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy 14.Higher Education Digitisation Service HEDS, UK

4 July 2004 – METS Opening Day UK CCS – Offices Challenges  Digitization and retro-conversion of printed or textual material is getting more and more important: Keep knowledge and cultural heritage alive Preserve the origin Enable quick and enhanced access by high structured documents Open up new dimensions of research Provide standardized output formats

5 July 2004 – METS Opening Day UK CCS – Offices Goals  Automate the conversion process  Make digitization more effective and safer  Increase the added value of digitized collections  Provide a standardized output format in order to allow transformation of metadata into various applications and systems

6 July 2004 – METS Opening Day UK CCS – Offices docWORKS – System Overview document METS/ALTO METS/TEI PDF TIFF, JPEG Image Pre-Processing Layout Analysis Character Recognition Structural Analysis Scanning Import Correction Export Rules DB docWORKS engineInputOutput

7 July 2004 – METS Opening Day UK CCS – Offices docWORKS – recording as much metadata as possible! Available data Descriptive metadata Administra- tive metadata Structural metadata - logical Structural metadata - physical Formats Library records, e.g. MARC TIFF Images METS DC or MODS linking to catalogue record METS incl. NISO (mix) METS Structural map ALTO (Analyzed Layout and Text Object) docWORKS engine Import of subsets, linking to record Creates descriptive records for articles, pictures,… Records metadata Suggests labels of logical elements and structures Provides suggestion for physical structure User mode AutomatedSemi- automated Correction recommended Fully- automated after defining a profile Automated Correction recommended Automated Correction in special cases

8 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Matching of Image Files and Page Numbers Image- file PaginationPage- Number tifNot countedNp tifNot countedNp tifCountedI tifCountedII tifCountedIII tifCountedIV tifCountedV tifCountedVI tifCounted tifCounted, not paginated(2) tifCounted tifCounted4 placeholderMissing page5 placeholderMissing page tifCounted tifCounted8

9 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Structural Analysis FRONT MAIN BACK

10 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Structural Analysis Chapter 1 Chapter 2 Subchapter 1 Subchapter 2

11 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Structural Analysis Preface Table of contents Title page Statement page

12 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Document layers  Various document layers are differentiated automatically and while using certain levels enable well directed searches as well as the presentation of electronic text without unnecessary items Body text independently from its presentation Margin notes, footnotes Pictures and captions Advertisement Annex and supplements Navigation layer: Table of contents, running title, document index, page number, volume index Book: Separation of „intellectual“ and „artifical“ content

13 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Digitization of books and journals (METAe)

14 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Digitization of books and journals (METAe)

15 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Digitization of scientific documents

16 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Manual editing of descriptive metadata / volume

17 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Manual editing of descriptive metadata / illustration

18 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Basic Workflow Digitization Scanning Digitization Scanning DB OPAC MARC Quality Control Images Quality Control Images Conversion Quality Control Output Quality Control Output Export Presentation XML/METS PDF Presentation XML/METS PDF

19 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Scalable Client / Server architecture Server 1 Server 2 Server n.... Scan Import Scan Import Quality Control Quality Control Server 3  Auto-Import  Image Preprocessing  Layout Analysis  OCR  Structural Analysis  Export

20 July 2004 – METS Opening Day UK CCS – Offices docWORKS – METS / ALTO METS document TIFFALTO ALTO – Analyzed Layout and Text Object

21 July 2004 – METS Opening Day UK CCS – Offices docWORKS – METS Header MODS or DC, descriptive metadata NISO (mix), technical metadata Structural Map: Physical Structure Structural Map: Logical Structure

22 July 2004 – METS Opening Day UK CCS – Offices docWORKS – ALTO Styles - Paragraph (alignment, linespacing, etc.) - Font (name, size, bold, italic, etc.) Layout - Printspace - TopMargin - InnerMargin - OuterMargin - BottomMargin Objects in 5 areas above: - Text block - Text lines - Strings [coordinates, string (as printed), substitution (hyphenation)] - Spaces - Composed block - Picture - Table - Formula

23 July 2004 – METS Opening Day UK CCS – Offices docWORKS – METS / physical structure METS DC FILEGRP PHYS LOGICAL DC FILEGRP PHYS LOGICAL ORDER … LABEL II III IV V VI … ORDERLABEL I II III IV V VI …

24 July 2004 – METS Opening Day UK CCS – Offices docWORKS – METS / physical structure par fptr METS DC FILEGRP PHYS LOGICAL DIV (page) FILEID ALTO FILEID IMAGE

25 July 2004 – METS Opening Day UK CCS – Offices docWORKS – METS / logical structure seq fptr METS DC FILEGRP PHYS LOGICAL DIV (paragraph) DIV (volume) DCMD_PHYS DCMD_ELEC DIV (issue) DCMD_ISSUE# DIV (contrib.) DCMD_#CONT# FILEID ALTO Those who have read the History of Columbus will, doubtless, remember the character and exploits... XSLT text block BEGIN FILEID Coordinates DIV (chapter) DCMD_CHAP#

26 July 2004 – METS Opening Day UK CCS – Offices docWORKS – ALTO / page layout and text content

27 July 2004 – METS Opening Day UK CCS – Offices docWORKS – ALTO / hyphenated word

28 July 2004 – METS Opening Day UK CCS – Offices docWORKS – ALTO / hyphenated word

29 July 2004 – METS Opening Day UK CCS – Offices docWORKS – Workshop UK 2004  University Library of Southampton September 28/29, free of charge  1st day Product information Output, metadata standards Workflow, use cases  2nd day „Hands on“ – Working with your own samples Individual consultancy sessions  Contact Simon Brackenbury - Hartmut Janczikowski -

30 July 2004 – METS Opening Day UK CCS – Offices Thank you! Claus Gravenhorst Content Conversion Specialists