© 2010 Microsoft Corporation. All rights reserved. Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling.

Slides:



Advertisements
Similar presentations
Max Kaiser: PLANETS Testbed
Advertisements

Preservation by Migration to XML Dirk Roorda. work on a preservation strategy positioning of the XML preservation strategy implementing the strategy in.
WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
Characterisation Adrian Brown The National Archives, UK.
Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
Digital Preservation: Logical and bit-stream preservation using Plato and Eprints Introduction: Digital Preservation Recap Hannes Kulovits Andreas Rauber.
UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN
Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.
From paper to SharePoint… with IRISPowerscan™ A document capture, indexation, OCR and Microsoft SharePoint integration solution.
Adaptability of learning objects by appropriate knowledge representation Anastas Misev Institute of Informatics Faculty of Natural Science and Mathematics.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
Interoperability via OpenXML Wolfgang Keber DIaLOGIKa – Germany
Automatic Evaluation of Migration Quality in Distributed Networks of Converters Miguel Ferreira Supervisors Ana Alice Baptista.
Preservation and Long-term access through Networked Services Adam Farquhar, The British Library iPres2006 Cornell University, October 2006.
A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.
P reservation and L ong-term A ccess through N ETworked S ervices.
Preserving and Providing Access to Complex Objects PASIG Washington DC, 23 May, 2013 (Speaker Info) Frodo Baggins Ring Bearer FOTR, LLC Natasa Milic-Frayling.
PLANETS (PA/4) Microsoft Conversion Tools Overview and Interoperability Aspects Natasa Milic-Frayling/Microsoft & Wolfgang Keber/DIaLOGIKa 13 July 2007.
PLANETS (PA/4) Microsoft Conversion Tools Overview and Next Steps Wolfgang Keber/DIaLOGIKa 1 October 2007.
Proprietary: MS Word is to Open Source: ABI Word A comparative study of a proprietary software to its open source counterpart.
The PLANETS-Ontology in the context of the PLANETS-Testbed and the XCL-Software.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
Web Site Creation: Good Practice Guidelines Standards For Project Web Sites Brian Kelly UK Web Focus UKOLN University of Bath UKOLN is supported by: .
Presentation Outline (hidden slide) Technical Level: 100 Intended Audience: TDMs, ITPros, ITDMs, BI specialists Objectives (what do you want the audience.
1 DELOS Network of Excellence on Digital Libraries with a focus on the Preservation Cluster Andreas Rauber Vienna University of Technology
Sustainable models for digital preservation Adam Farquhar The British Library Sustainability Models for Digital Preservation, Brussels, Nov, 2007.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Access Across Time: How the NAA Preserves Digital Records Andrew Wilson Assistant Director, Preservation.
DASISH Final Conference Common Solutions to Common Problems.
Extensible Markup Language (XML) Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879).ISO 8879 XML is a.
This work is licensed under a Creative Commons Attribution 2.0 Germany License eSciDoc NIMS Malte Dreyer.
CLARIN work packages. Conference Place yyyy-mm-dd
Long-Term Preservation of At- Risk Digital Geospatial Data: A Cooperative Agreement with Library of Congress Steve Morris NCSU Libraries Zsolt Nagy NC.
European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November
Standards, Reusability, and the Mating Habits of Learning Content Robby Robson Eduworks Corporation
Microsoft Research Faculty Summit Natasa Milic-Frayling & Vijay Rajagopalan Microsoft Corporation.
Week 11: Open standards and XML MIS 3537: Internet and Supply Chains Prof. Sunil Wattal.
The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.
GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
Enterprise Solutions Chapter 10 – Enterprise Content Management.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
M-1 INGEST OVERVIEW Don Sawyer National Space Science Data Center NASA/GSFC October 13, 1999.
Preservation metadata and the Cedars project Michael Day UKOLN: UK Office for Library and Information Networking University of Bath
Avanade Confidential – Do Not Copy, Forward or Circulate © Copyright 2014 Avanade Inc. All Rights Reserved. For Internal Use Only SharePoint Insights (BETA)
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Package! Publish! Print! Brian Adelberg Digital Document Solutions Software Development Lead Microsoft Corporation.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
National Geospatial Enterprise Architecture N S D I National Spatial Data Infrastructure An Architectural Process Overview Presented by Eliot Christian.
March 2004 At A Glance The AutoFDS provides a web- based interface to acquire, generate, and distribute products, using the GMSEC Reference Architecture.
Digital Asset Management Systems and Digital Preservation EUAN COCHRANE – DIGITAL PRESERVATION MANAGER YALE UNIVERSITY LIBRARY.
Bottomline’s Advanced Document Processing Solution for Dynamics AX Allen Jones, Regional Manager, Bottomline Technologies 1 Bottomline’s Advanced Document.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Meeting of the Member States Expert Group on Digitisation and Digital Preservation , Luxembourg European Archival Records and Knowledge Preservation.
Migrating from Legacy ECM Repositories to Alfresco Ray Wijangco Technology Services Group Alfresco Practice Lead.
Click anywhere to start the presentation. Steps to Resolve Error Code "17099" in MS Outlook Mac 2011 Fix Mac Outlook Corruption Issues OLM to PST Converter.
Redmond Protocols Plugfest 2016 Jinghui Zhang Office Interoperability Test Tools (Test Suites and Open Source Projects) Software Engineer Microsoft Corporation.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Office 365 FastTrack Planning Engagement Kickoff
An Introduction to Tessella and The Safety Deposit Box Platform
Digital Archiving & Preservation : How to compare and contrast
XML Based Interoperability Components
Digital Preservation Planning:
Malte Dreyer – Matthias Razum
Presentation transcript:

© 2010 Microsoft Corporation. All rights reserved. Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling Microsoft Research Cambridge UK

© 2010 Microsoft Corporation. All rights reserved. What is the problem? Digital is a victim of its own success i.e., the advances in digital technologies that make digital media broadly used and adopted Document formats, software and hardware are becoming obsolete faster than we can ensure the forward compatibility of the content.

© 2010 Microsoft Corporation. All rights reserved. What are technical solutions? We have two main strategies: emulation and simulation – Create emulators of hardware and simulators of software systems to enable old programmes to run and old data to be used. content migration – Migration to standards that are likely to be supported in the future.

© 2010 Microsoft Corporation. All rights reserved. Preservation and Long-term Access through NETworked Services Ensure long-term access to Europe’s cultural and scientific heritage Improve decision-making about long term preservation Ensure long-term access to valued digital content Control the costs through automation, scalable infrastructure Ensure wide adoption across the user community Establish market place for preservation services and tools Build practical solutions Integrate existing expertise, designs and tools Share and build

© 2010 Microsoft Corporation. All rights reserved. The British Library National Library, Netherlands Austrian National Library State and University Library, Denmark Royal Library, Denmark National Archives, UK Swiss Federal Archives National Archives, Netherlands Hatii at University of Glasgow University of Freiburg Technical University of Vienna University at Cologne Tessella Plc IBM Netherlands Microsoft Research, Cambridge ARC Seibersdorf research PLANETS Partners

© 2010 Microsoft Corporation. All rights reserved. PLANETS Sub Projects

© 2010 Microsoft Corporation. All rights reserved. CONVERSION TOOLS preserving office documents

© 2010 Microsoft Corporation. All rights reserved. Microsoft & PLANETS: Preserving Office Documents Microsoft Research role within PLANETS: – Conversion of binary Microsoft Office Documents into Office Open XML File Format (OpenXML) We extended the effort to include other formats – More legacy formats, e.g. WordPerfect – Other open standards, e.g. Open Document Format. Binary MS Office OpenXML WordPerfect ODF Binary MS Office OpenXML DOS Word UOF

© 2010 Microsoft Corporation. All rights reserved. Document Conversion Tools – Our Approach Three-step approach, resulting in a modular and extendible infrastructure – Identify existing conversion tools and libraries – Wrap these tools and libraries into re-usable components – Integrate these components into PLANETS and other systems. If possible, do not use the office applications (e.g., Microsoft Office or OpenOffice.org) – They are designed as interactive applications – Message boxes might pop up (“Do you want …”) – Unclear license question when running on a server.

© 2010 Microsoft Corporation. All rights reserved. Reusable Components Transformer Box (Wrapper) “Binary  OpenXML” TB Interface TB Interface Watch Folder Tool Web Service ToooXML (GUI)

© 2010 Microsoft Corporation. All rights reserved. Extendible Architecture Transformer Box (Wrapper) “ODF  OpenXML” Transformer Box (Wrapper) “WP  OpenXML” Transformer Box (Wrapper) “Binary  OpenXML” TB Interface TB Interface Watch Folder Tool Web Service ToooXML (GUI)

© 2010 Microsoft Corporation. All rights reserved. More Technical Details (1) Currently two types of wrappers for – Command-line tools (stand-alone executables) OpenXML/ODF Translator (OpenXML  ODF) OpenXML Document Viewer (OpenXML  HTML) – Microsoft conversion libraries (CNV libraries) WordPerfect  RTF RTF  OpenXML … We allow wrappers to be chained – WordPerfect  RTF  OpenXML  ODF.

© 2010 Microsoft Corporation. All rights reserved. Microsoft Word More Technical Details (2) Microsoft conversion libraries (CNV libraries) – Originally designed to import/export “foreign” document formats into/from Microsoft Word – Based on the Microsoft Conversion API Foreign2RTF RTF2Foreign – Transformer Box CNV Wrapper follows this API. Transformer Box CNV Wrapper CNV Library RTF2Foreign Foreign2RTF

© 2010 Microsoft Corporation. All rights reserved. Supported Formats  Source formats  WordPerfect 5  WordPerfect 6  DOS Word  Word 2, 6, 95  Word  RTF  ODF  OpenXML  Target formats  OpenXML  ODF  UOF  HTML  XCDL (format defined in PLANETS/PC)

© 2010 Microsoft Corporation. All rights reserved. CONVERSION SERVICES preserving office documents

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved.

SIMILARITY ASSESSMENT understanding the quality criteria

© 2010 Microsoft Corporation. All rights reserved. How do we explore and compare digital artefacts Perceptive aspects of the digital object – In the past printed version of the document and screen display Interactive aspects of the digital objects – Dynamic content includes both individual artefacts and the `stream characteristics‘. Non-perceptive aspects of the digital objects – Document object model, cashed data, action generated metadata, hidden formulas, etc.

© 2010 Microsoft Corporation. All rights reserved.

EXAMPLE: Perceptive features for Word Documents Two objects in different formats are mapped onto the normalize form – E.g., a WP file converted into.docx. For both we create an XPS representation of the document Feature extraction and comparison – For each feature develop a `digital object probe‘ that extract the feaeture and measure a property of the feature – E.g., pass XPS through OCR package and extract various layout features.

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. Conversion applications and service

© 2010 Microsoft Corporation. All rights reserved. What is ahead of us? Research – What is the relationship between the human criteria and automated measurements? What usage scenarios do we aim for? Technology – What ‘instruments’ do we need to extract and measure properties of the digital content? – How do we automate the process of inspection and quality assurance? Legal – How do we run legacy software as services? We need updated licensing agreements. – How to provide services that combine open source and non-open source software?

© 2010 Microsoft Corporation. All rights reserved. THANK YOU Contact: Natasa Milic-Frayling Microsoft Research Cambridge