Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Formats: Factors for Sustainability, Functionality, and Quality Caroline R. Arms Carl Fleischhauer DLF Forum November 17, 2003 Temporary URL for.

Similar presentations


Presentation on theme: "Digital Formats: Factors for Sustainability, Functionality, and Quality Caroline R. Arms Carl Fleischhauer DLF Forum November 17, 2003 Temporary URL for."— Presentation transcript:

1 Digital Formats: Factors for Sustainability, Functionality, and Quality Caroline R. Arms Carl Fleischhauer DLF Forum November 17, 2003 Temporary URL for this slide show and related documents: http://memory.loc.gov/ammem/techdocs/digform/

2 Analysis of Digital Formats Goal: provide information to help LC staff develop strategies and practices for incoming content Begin by identifying preferred formats – a continuing process Later, move to appropriate actions with non-preferred Project includes media-independent formats ("intangible"), e.g., MP3 files Project excludes media-dependent formats ("tangible"), e.g., audio CDs, DVDs Synergy with the proposed Global Digital Format Registry Initial analysis focuses on four “easy” categories: still images, audio, video, and text. Draft documents reviewed by a small group of LC and outside readers. Additional format categories as the work proceeds

3 Section I. Factors for Evaluation

4 Two Types of Evaluation Factors Sustainability factors for all formats –influence feasibility and cost of preserving content in the face of future change Quality and functionality factors that vary by content category –reflect considerations that will be expected by future users Factors compete and the process of selection entails finding a good balance

5 Sustainability: Disclosure Degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content. A spectrum of disclosure levels can be observed for digital formats. What is most significant is not approval by a recognized standards body, but the existence of complete documentation. Preservation of content in a given digital format over the long term is not feasible without an understanding of how the information is represented (encoded) as bits and bytes in digital files. Examples: –TIFF image format well documented, many products and shareware –MrSID image format partially documented, proprietary elements protected

6 Sustainability: Adoption Degree to which the format is already used by the primary creators, disseminators, or users of information resources. This includes use as a master format, for delivery to end users, and as a means of interchange between systems. If a format is widely adopted, it is less likely to become obsolete rapidly, and tools for migration and emulation are more likely to emerge from industry without specific investment by archival institutions. Examples: –PDF text format very widely used –Microsoft eBook Reader not widely used

7 Sustainability: Transparency Degree to which the digital representation is open to direct analysis with basic tools, such as human readability using a text-only editor. Digital formats in which the underlying information is represented simply and directly will be easier to migrate to new formats and more susceptible to digital archaeology; easier development of rendering software for new technical environments. Examples: –Uncompressed raster image bitstream easy to interpret –Lossy compressed image bitstream requires complex decoding

8 Sustainability: Self-documentation Self-documenting digital objects contain basic descriptive, technical, and other administrative metadata. Self-documenting digital objects are likely to be easier to sustain over the long term and to transfer reliably from one archival system to another, including a successor system. LC wants to take advantage of the trend towards embedded metadata for business reasons. Some metadata will be extracted to support discovery and collection management. Examples: –JPEG (.jpg) image files contain very scant metadata –EXIF JPEG wraps JPEG compression with richer metadata –JPEG2000 (.jpx) image files may contain metadata ‘boxes’ – can include an extensive DIG35 record

9 Sustainability: External Dependencies Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments. Some interactive digital content is designed for use with specific hardware, such as a joystick. Scientific datasets built from sensor data may require specialized software for analysis and visualization. External dependencies will make content more difficult and costly to sustain than static content. The specialized software required by some scientific datasets may itself be very difficult to sustain. Examples: Adobe eBooks require a Microsoft Passport account. Open eBook format is free of external dependencies

10 Sustainability: Impact of Patents Degree to which the ability of archival institutions to sustain content in a format will be inhibited by patents. Although the costs for licenses to decode current standard formats are often low, the development of open source decoders will be inhibited. Tools to transcode content in these formats when they become obsolete may be more costly to develop. This factor was recently added to our list. We are uncertain whether it will prove significant and welcome the chance to discuss with others. Examples: –Makers of tools for MPEG-1 moving image format do not [appear to] require licenses –Makers of tools for MPEG-2 and MPEG-4 moving image formats must pay for licenses and/or pass through royalties

11 Sustainability: Tech Protection Mechanisms Implementation of mechanisms such as encryption that prevent the preservation of content by a trusted repository. Preservation of the digital content requires replicating it on new media, migrating and normalizing it in the face of changing technology. Protection mechanisms may also prevent the dissemination of content to authorized users. Exploitation of technical protection mechanisms is generally optional; their use depends on how a format is used in a particular context. Examples: –Sound recordings from Audible.com will only play with software and/or devices from Audible –MP3 files play anywhere

12 Sustainability Evaluation (not weighted) TEI- XML TIFF- uncomp WAVE- LPCM Real Media Disclosure +++- Adoption +/-+++ Transparency +++- Self documentation +-/+-+ External dependencies +++- Patents +++- Technology protection when implemented Not available w/in fmt Avail- able

13 Quality and Functionality Factors Vary according to content type, e.g., text, image, sound Pertain to current and future usefulness, e.g., for scholarship or repurposing Identification of factors reflects consideration of what are likely to be significant or essential features of some content items –Surround sound for audio –Color maintenance for still images –Logical structure for text documents Trying to get at this at a simplified high level

14 Quality & Functionality: Still Images Normal rendering includes on-screen viewing and printing to paper; likely to also include the ability to zoom in to study detail or produce publication quality output Clarity (support for high image resolution) –Format allows for high pixel count and bit depth –Prefer implementations that eschew or minimize compression loss or effect of watermarking Color maintenance (support for color management) –Format allows for color management, e.g., ICC profiles Functionality beyond normal image rendering (vector graphics, 3-D models, etc.)

15 Quality & Functionality: Sound Normal rendering includes playback in mono and stereo; typical software provides user control over volume, tone, and balance, as well as navigation (fast forward, go-to- segment, etc.). Fidelity (support for high audio resolution) –For LPCM bitstream, format allows for high sampling frequency and word length –Prefer implementations that eschew or minimize compression loss or watermarking effect Sound field (support for multi-channel audio) –Allows for surround sound or other multi-track representation Functionality beyond normal rendering applies to music notation formats, e.g., MIDI

16 Quality & Functionality: Video Normal rendering includes playback of a single image stream with sound in mono or stereo through speakers or headphones; typical software provides user control over picture elements (brightness, hue, contrast), sound elements (volume, tone, balance), and navigation (fast forward, go-to-segment, etc.). Clarity (support for high image resolution) –Format allows for large picture size (pixels), progressive scan option –Prefer implementations that eschew or minimize compression loss or watermarking effect Fidelity (support for high audio resolution) Sound field (support for multichannel audio) –Considerations for fidelity and sound field same as for sound formats Functionality beyond normal rendering –Work in progress: “animation” formats (ShockWave), frame-accurate editing

17 Quality & Functionality: Text A work in progress... Normal rendering includes linear reading on screen, print to paper, search for words, and index for searching; rendering must reflect the intent of the author in representing individual characters, paragraph structure, lists, headings, and indicators of emphasis. Support for integrity of document structure and navigation –Format allows for navigation and automated analysis that reflects the logical structure of a work; important for directories, encyclopedias, works that use a formal structure Support for integrity of layout, font, and other design features –Allows for reliable presentation in terms of look and feel, when exact choices of features like font and column layout are essential to meaning Support for rendering for mathematics, formulae, diagrams, etc. –Allows for accurate rendering of non-textual elements that are crucial to informational content (markup languages sometimes fall short in this area) Functionality beyond normal rendering –More work in progress, e.g., talking books (ANSI/NISO Z39.86 for the blind)

18 Evaluate All Factors, Example of Sound WAVE- LPCM WAVE- BWF-LPCM MP3AAC (MPEG-4) Real Media Disclosure ++++- Adoption +++++ Transparency ++--- Self documentation -++++ External dependencies ++++- Patents +++-- Tech protect when implemented Not avail- able w/in format Not avail- able w/in format (?) Avail- able (?) Fidelity ++--- Sound Field ---+?

19 Section II. Relationships

20 What is a format? Working definition from format registry proposal: A format is a fixed, byte-serializing encoding of an information model. Working definition we have used: Formats are packages of information that can be stored as data files or sent via network as data streams (aka bitstreams, byte streams)

21 Formats: Types & Relationships file formats –at the level indicated by file extensions, e.g.,.mp3 –as indicated by Internet MediaType (aka MIME type), e.g. text/html –versions develop through time –refinements are tailored to specific purposes, e.g., TIFF-EP for electronic photography class of related formats whose familial characteristics are important –e.g., the WAVE audio format is an instance of the RIFF format class "wrappers" distinguished in terms of their underlying bitstreams –e.g., WAVE files may contain linear pulse code modulated [LPCM] audio (like a CD) or highly compressed audio as used for digital telephony. file formats may have optional features significant to sustainability –e.g. for encryption bundling formats bind together files comprising a single digital work –e.g., text and supporting illustrations, or a movie with sound tracks in different languages

22 Simple Example: WAVE Wrapper for different bitstreams Simple, but extensible method for embedding metadata subtype ofRIFF may containLinear PCM, μ-law, A-law (bitstreams) has subtypeBroadcast WAVE (Linear PCM + EBU metadata) has subtypeAES46-2002 (BWF + cart metadata)

23 More Complex Example -- PDF Much more than text A file format, a wrapper, a bundling format, all in one Complexity of relationships has version1.3 (July 2000, 696 pages) has version1.4 (December 2001, 978 pages) has version1.5 (August 2003, 1172 pages) may containTIFF, JPEG, JPEG2000, etc., etc., etc. (all at once) has subtypeTagged PDF (can represent logical document structure) has subtypeAccessible PDF (tagged + further constraints) has subtypePDF/X (ISO standard, for pre-press use, e.g., submission of graphics to magazine publishers) has subtypePDF/A (Under development as ISO standard, for archiving)

24 Sidebar on PDF/A for Archiving To be open standard, not proprietary to Adobe Constrained for sustainability –No encryption (transparency, technical protection) –No audio or video embedded –No Javascript or executable file launches –All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering (disclosure, self-documentation) –Colorspaces must be specified in a device independent manner (external dependencies) –Embedded metadata must be in XMP -- XML-based (transparency, self-documentation) LC & other DLF participants involved in development

25 Format Description Documents Structured documents –Identification & description –Local use/preference/expertise/tools –Sustainability factors –Quality & functionality factors –Useful references Intended primarily for people making plans/decisions now –Choosing formats –Making provision for systems –Assembling documentation & tools Plan to develop XML Schema Clearly much in common with other efforts –Diffuse, PRONOM, Wotsit –Global Digital Format Registry

26 Synergy with Registry Global Digital Format Registry plans to build a resource that can support more automated services Common challenges Granularity of identification –How finely to identify Are different types of subtype relationships needed? –Simple restriction on bitstream encoding –Formal sections or explicit profiles in standards –Mandatory metadata –Mandatory structural features –Options outlawed Hard questions. Clear value from collaboration

27 Complexity Increasing New standards have portmanteau nature –Many parts, many options, as already noted in the case of PDF JPEG2000 –Part 1..jp2 (core lossless and lossy compression schemes for continuous tone, replacement for JPEG) –Part 2..jpx (extensions, including more capabilities for embedding metadata) –Part 6..jpm (multi-layer images, can embed other bitstream encodings, including bitonal) MPEG-4 –Many ‘profiles’ for different contexts Which parts of these standards will be widely adopted?

28 Section III. Content States in a Production and Distribution Cycle

29 Content States in a Production Process A bit of a simplification but... Content in a publishing or distribution stream can be seen as existing in three states: –Initial “while the author is creating it” –Middle “while the publisher manages and archives it” –End “what is presented or sold to an end-user” Different formats are often associated with these three states, appropriate to the task at hand Debt owed to Mellon-funded journals projects (and others) for these ideas

30 Initial State, Early Creative Processes Example for sound recording Multiple separate tracks in a recording studio Complex multipart entity, e.g., twelve tracks for instruments and voices “Edit decision list" manages elements Very high fidelity Specialized production formats, e.g., proprietary format produced by the SADiE digital audio workstation Other examples Text: writer using word processing software, e.g., MS-Word Video: multi-segment work in progress has elements in AAF wrapper Initial state formats are often proprietary and may be limited to creator's favorite software package

31 Middle State, in Hands of Publisher Example for sound recording Mixed master, often in stereo or surround sound; possible multi-track with “sub-mixes” completed Not as complex as the studio session recordings, ready or close-to-ready for distribution as digital-file or compact disk Edit decision list may still be required Very high fidelity Specialized industry formats, e.g., AES-31 recorded sound format No technological protections embedded in bitstream Other examples Text: author’s journal article marked up and in document management system Video: program archived in MXF format, transmitted by TV network at designated time Middle state formats used by industry to send or exchange data, may emerge as preferred formats for archiving within an industry.

32 Final State, Distributed to End Users Example for sound recording Simple entity, may be high, moderate, or low fidelity Common, current media-independent formats, e.g., WAVE-LPCM file, Windows Media Audio (.wma), or MP3 Security elements may be embedded in the bitstream Other examples Journal article “published” as PDF file Video program disseminated as MPEG-4 compressed file Final state formats are for items in the marketplace. “This year, we released the song in RealAudio, next year we’ll probably reissue it as encrypted AAC.”

33 Prefer the Middle-state Formats? For Library of Congress collections the best formats for the long term may well be the middle state formats. These are likely to have higher quality than final-state formats, may be easier to manage for preservation, and may be the focus of developing archiving approaches by industry. Of course, we do sometimes collect initial-state works, and will often receive final-state.

34 Middle-state Preference Challenges Seeking middle-state digital formats for LC collections would be different than the most widespread current practice. The selection of best editions authorized by copyright law and LC practices today is generally limited to works in their final state.

35 Section IV. Curator’s Judgment

36 Putting Format Preferences to Work Illustration from the realm of sound recordings How a curator might –analyze significant or essential characteristics for categories of content –combine that analysis with technical information about formats, and –develop format-preference statements for content subcategories.

37 Sound content subcategories and their significant characteristics Illustrative example

38 Sound content subcategories and format preferences Illustrative example

39 Project Next Steps Continuing inside and outside expert review – we want YOUR comments! Plan to work with registry as it takes shape Implementation at LC: –Develop web site for LC staff –Launch process to identify which formats to analyze next


Download ppt "Digital Formats: Factors for Sustainability, Functionality, and Quality Caroline R. Arms Carl Fleischhauer DLF Forum November 17, 2003 Temporary URL for."

Similar presentations


Ads by Google