Download presentation
Presentation is loading. Please wait.
Published byAlvin Summers Modified over 6 years ago
1
Digital Formats Factors for Sustainability, Functionality, and Quality
Caroline R. Arms and Carl Fleischhauer IS&T Archiving Conference Washington April 29, 2005 The Library of Congress is drafting a decision-support framework pertaining to the preservation of digital content. The intent is to serve staff who evaluate born digital content for selection and make provisions to sustain that content. The resource is intended to . . .
2
Analysis of Digital Formats
Provide inventory of information about formats Identify and describe the formats promising for long-term sustainability Synergy with Global Digital Formats Registry and JHOVE Web site: . . . identify and describe the formats that are promising for long-term sustainability. We have been working closely with those planning for a Global Digital Format Registry, a collaborative activity based at Harvard, that aims for an active registry that will support the execution of operations on files, to identify, validate, and even transform them. We want our work to complement that effort, and the related JHOVE toolset.
3
Formats descriptions as of July 2004 - UPDATE
MPEG-4_AVC_BP, Advanced Video Coding, Baseline Profile MPEG-4_AVC_MP, Advanced Video Coding, Main Profile MPEG-4_AVC_EP, Advanced Video Coding, Extended Profile NITF, News Industry Text Format Ogg, Ogg File Format Ogg_Vorbis, Ogg Vorbis Audio Format PCM PDF, Portable Document Format PDF/A, PDF for Preservation Quicktime QTA_MP3, QuickTime Audio, MP3 Codec QTA_AAC, QuickTime Audio, AAC Codec QTV_Apple, QuickTime Video, Apple Codec QTV_Cinepak, QuickTime Video, Cinepak Codec QTV_Sorenson, QuickTime Video, Sorenson Codec QTV_MJPEG, QuickTime Video, Motion JPEG Codec QTV_MPEG, QuickTime Video, MPEG-1 Codec RealAudio_10, Version 10 RealAudio_RA, RealAudio Codec RealAudio_AAC, AAC Codec RealAudio_LL, Lossless Codec RealAudio_MC, Multichannel Codec RealVideo_10, Version 10 RMID, RIFF-based MIDI File Format Sorenson_3, Version 3 (video codec) SMF, Standard MIDI File Format SVG, Version 1.1 TIFF, Revision 6.0 and earlier Vorbis, Vorbis Audio Codec WAVE WAVE_LPCM WAVE_LPCM_BWF WMA, Windows Media Audio WMA_WMA9, Windows Media Audio File with WMA9 Codec WMA_WMA9_PRO, Windows Media Audio File with WMA9 Professional Codec WMAWMA9_LL, Windows Media Audio File with WMA9 Lossless Codec WMA9, Windows Media 9 Audio Codec WMA9_PRO, Windows Media 9 Professional Audio Codec WMA9_LL, Windows Media 9 Lossless Audio Codec WMV, Windows Media Video WMV_WMV9, Windows Media Video with WMV9 Codec WMV_WMV9_PRO, Windows Media Video with WMV9 Professional Codec WMV9, Windows Media 9 Video Codec WMV9_PRO, Windows Media 9 Professional Video Codec XMF, eXtensible Music Format XML AAC_MP2 (Advanced Audio Coding, MPEG-2) AAC_MP4 (Advanced Audio Coding, MPEG-4) AAC_ADIF (Advanced Audio Coding, MPEG-2, Audio Data Interchange Format) AAC_M4A (Advanced Audio Coding, MPEG-2, m4a File Format) AIFF (Audio Interchange File Format) AIFF_LPCM (AIFF File Format with LPCM Audio) ASF (Advanced Systems Format) AudCom (Audible.Com File Format) AudCom_MP3 (Audible.Com MP3) AVI (Audio Video Interleaved) AVI_MJPEG (AVI, MJPEG Codec) AVI_Indeo (AVI, Indeo Codec) AVI_Cinepak (AVI, Cinepak Codec) AVI_DivX (AVI, DivX Codec) Cinepak (video codec) DLS, Downloadable Sounds Format DivX_5, Version 5 (video codec) ID3 (ID3 Metadata for MP3) ID3v1 (ID3, version 1) ID3v2 (ID3, version 2) IFF (Electronic Arts Interchange File Format 1985) Indeo_3, Version 3 (video codec) Indeo_5, Version 5 (video codec) LPCM MIDI_SD, MIDI Sequence Data MJPEG (Motion JPEG) MODS, Module Music Format (Mods) MP3_ENC (MP3 Encoding) MP3_FF (MP3 File Format) MPEG-1 MPEG-2 MPEG-2_SP, Simple Profile MPEG-2_MP, Main Profile MPEG-2_422, 4:2:2 Profile MPEG-4 MPEG-4_V, Visual Coding (Part 2) MPEG-4_V_SP, Visual Coding, Simple Profile MPEG-4_V_SSP, Visual Coding, Simple Scalable Profile MPEG-4_V_ASP, Visual Coding, Advanced Simple Profile MPEG-4_V_CP, Visual Coding, Core Profile MPEG-4_V_MP, Visual Coding, Main Profile MPEG-4_V_SStP, Visual Coding, Simple Studio Profile MPEG-4_AVC, Advanced Video Coding (Part 10) The list of formats and “subformats” described at our Web site is already long, well over We believe that in order for custodians to preserve content in digital form, they must be able to distinguish between format refinements and variants. The relationships and dependencies between formats must be understood and expressed.
4
Sample format description, top
Our online description pages start off with a general description and indications of relationships to other formats. In this example NITF is based on XML. Sample format description, top
5
Sample format description, middle
Then our pages go on to highlight the factors that we think appropriate for comparing and contrasting the formats . . . Sample format description, middle
6
Sample format description, bottom
And, at the bottom, list relevant specifications or other information sources. Sample format description, bottom
7
Formats: Types & Relationships
file formats at the level indicated by file extensions, e.g., .mp3 as indicated by Internet MediaType (aka MIME type), e.g. text/html versions develop through time refinements are tailored to specific purposes, e.g., TIFF-EP for electronic photography class of related formats whose familial characteristics are important e.g., the WAVE audio format is an instance of the RIFF format class "wrappers" distinguished in terms of their underlying bitstreams e.g., WAVE files may contain linear pulse code modulated [LPCM] audio (like a CD) or highly compressed audio as used for digital telephony. bundling formats bind together files comprising a single digital work e.g., text and supporting illustrations, or a movie with sound tracks in different languages We cover not only formats at the level indicated by a file extension (e.g., .tif), but versions developed over time, refinements tailored to a particular use, and variants distinguished by different bitstream encodings, even if in a common wrapper. We also include format descriptions for bitstream encodings that may be incorporated into or used as the basis for various wrapper or bundling formats, for example, XML and LPCM (Linear Pulse Code Modulation, the audio equivalent to an uncompressed bitmap).
8
Simple Example: TIFF Wrapper for different bitstreams
Simple, but extensible method for embedding metadata may contain Uncompressed bitmap, LZW compressed bitmap, bitonal Group IV (bitstreams) has subtype TIFF/EP (for electronic photography) TIFF/IT (for prepress applications) DNG (Adobe’s proposed format for digital negatives) TIFF provides a relatively simple example of relationships. TIFF images may contain bitmaps represented by a number of different bitstream encodings: uncompressed, compressed using LZW, or for a bitonal image, compressed using ITU’s G4. Future migration of a bitonal G4 TIFF will be to a different target format than that selected for a 24-bit uncompressed TIFF. TIFF also has subtypes including ISO’s TIFF/EP (for electronic photography) and TIFF/IT (for prepress).
9
More Complex Example -- PDF
Much more than text A file format, a wrapper, a bundling format, all in one Complexity of relationships has subtype v.1.3 (July 2000, 696 pages) v.1.4 (December 2001, 978 pages) v.1.5 (August 2003, 1172 pages) v.1.6 (November 2004, 1236 pages) may contain TIFF, JPEG, JPEG2000, etc., etc., etc. (all at once) Tagged PDF (can represent logical document structure) Accessible PDF (tagged + further constraints) PDF/X (ISO standard, for pre-press use, e.g., submission of graphics to magazine publishers) PDF/A (Under development as ISO standard, for archiving) A more complex example is Adobe's Portable Document Format (PDF). PDF can act as a relatively straightforward format for paginated text, a wrapper for many different image formats, or a bundling format for complex documents and interactive multimedia.
10
Complexity Increasing
New standards have portmanteau nature Many parts, many options, as already noted in the case of PDF JPEG2000 Part 1. .jp2 (core lossless and lossy compression schemes for continuous tone, replacement for JPEG) Part 2. .jpx (extensions, including more capabilities for embedding metadata) Part 6. .jpm (multi-layer images, can embed other bitstream encodings, including bitonal) MPEG-4 Many profiles for different contexts, also advanced video coding Which parts of these standards will be widely adopted? New formats are very complex. This is evident in the versions and subtypes for PDF; similar differentiations pertain to JPEG 2000 and MPEG-4, whose specifications are published in multiple parts often intended for different uses. Many new formats have profiles (and sometimes “levels”) that indicate degrees of complexity. The most common rendering software will be described as capable of reading, say, the core or baseline profiles of a given format.
11
Object-based design: QuickTime example
Digital works created in these formats are also complex. The auto manufacturer BMW has sponsored short films--famous on the Web -- made by prominent directors, featuring well known actors and, of course, starring BMW cars.
12
Object-based design: QuickTime example
Several versions of these shorts can be downloaded. The enhanced QuickTime version is a particularly complex example. From a single mov file, you can switch from the normal soundtrack to a commentary track, display a text transcription . . . Multiple soundtracks, script as text
13
Object-based design: QuickTime example
or switch over to what they call a virtual reality presentation that shows off the car in all its splendor. Virtual reality navigation at right, choose from six points of view
14
Object-based design: QuickTime example
This QuickTime file, like its MPEG-4 counterparts, uses an object based design internally. The player lists all of the file's elements (in effect, the objects in the file) under the properties setting. List of tracks in the file
15
Content States in a Production Process
Content in a publishing or distribution stream can be seen as existing in three states, and different formats are often associated with these three states, appropriate to the task at hand. Initial: author creates Middle: publisher manages and archives Final: end user receives Different formats are favored in different stages of a content item's lifecycle. Albeit a bit of a simplification, we distinguish three states in a publishing or distribution stream: Initial: while the author is creating it Middle: while the publisher manages and archives it Final: what is presented or sold to an end-user
16
Initial State While the author is creating it
Still images: “raw” in a digital camera .psd Photoshop files, with layers for component images while experimenting with cropping, special effects, and color Initial state formats are often proprietary and may be limited to the creator's favorite software package. These formats tend to be complex, for example, retaining information about current choices for cropping and layering components of an image being prepared for advertising purposes.
17
Middle State While publisher manages content, generates final product, and archives it for future use High resolution, uncompressed/lossless compressed (.tif, .jp2) Self-describing bundles to support “blind” exchange between creators and print publishers (PDF/X, TIFF/IT) Middle state formats are used by industry to send or exchange data, as exemplified by the PDF/X or TIFF/IT files required for submitting digital art to a magazine. These prepress formats use separate layers to support color separation and spot color in ways compatible with printing technology. Middle-state formats may emerge as preferred for archiving within an industry. Example: submitting ads to a magazine
18
Final State What is presented or sold to an end-user
Low resolution, lossy compressed (.gif, .jpeg) May be watermarked or copy-protected Example: image on a Web page Final state formats are for items in the marketplace and are often transient. A record company might release a song in RealAudio one year and sell it through iTunes as encrypted AAC the next. Depending on the delivery system, the disseminated files may even be generated dynamically from a master to a customer's particular requirements, using some of the new scalable formats coming into use.
19
Middle-state Formats for Sustaining
Best formats for long term may be middle state formats. Likely to have higher quality than final-state formats, May incorporate metadata useful to support preservation Archiving and preservation practices may emerge from industry. We think that the best formats from a preservation perspective will be those in the middle state. These are likely to have higher quality than final-state formats and may also be the focus of developing archiving approaches by industry. HANDOFF TO CAROLINE
20
Two Types of Evaluation Factors
Sustainability factors for all formats influence feasibility and cost of preserving content in the face of future change Quality and functionality factors that vary by content category reflect considerations that will be expected by future users We have identified two types of factors to consider when selecting formats: sustainability factors and quality and functionality factors. Sustainability factors apply across digital formats for all categories of information, and they will be significant whether preservation strategies entail future migration to new formats, emulation of current software on future computers, a hybrid of migration and emulation, or normalization on receipt. Quality and functionality factors pertain to the ability of a format to represent the significant characteristics required or expected by current and future users. These factors will vary for particular genres or forms of expression.
21
Sustainability: Disclosure
Disclosure refers to the degree to which complete specifications and tools for validating technical integrity exist and are accessible. Non-proprietary, open standards are usually more fully documented and more likely to be supported by tools for validation than proprietary formats. However, what is most significant for sustainability is not approval by a recognized standards body, but the existence of complete documentation. Preservation of content in a given digital format is not feasible without an understanding of how the information is encoded. Examples: TIFF, well documented, many third-party tools MrSID, only partially documented I will run through our seven sustainability factors: First, disclosure. This refers to the degree to which complete specifications and tools for validating technical integrity exist and are accessible. Non-proprietary, open standards are usually more fully documented and more likely to be supported by tools for validation than proprietary formats. However, what is most significant for sustainability is not approval by a recognized standards body, but the existence of complete documentation.
22
Sustainability: Adoption
Adoption refers to the degree to which the format is already used by the primary creators, disseminators, or users of information resources. A widely adopted format is less likely to become obsolete rapidly, and tools for migration and emulation are more likely to emerge without specific investment by archival institutions. Examples: TIFF uncompressed, widely recommended as master PDF/X, increasingly required for submission to magazines, etc. JPEG2000 Part 1, increasingly adopted Adoption refers to the degree to which the format is already used by the primary creators, disseminators, or users. A widely adopted format is less likely to become obsolete rapidly, and tools for migration and emulation are more likely to emerge without specific investment by archival institutions. Evidence of adoption includes bundling of tools with personal computers, native support in web browsers or in market-leading tools for creators, and the existence of competing products for creation, manipulation, or rendering.
23
Sustainability: Transparency
Degree to which the digital representation is open to direct analysis with basic tools, such as human readability using a text-only editor. Digital formats in which the underlying information is represented simply and directly will be easier to migrate to new formats, more susceptible to digital archaeology, and allowing easier development of rendering software. Examples: Uncompressed raster image bitstream easy to interpret or reverse engineer Lossy compressed image bitstream requires algorithm to decode Transparency refers to the degree to which the digital representation is open to direct analysis with basic tools. Digital formats in which the underlying information is represented simply and directly will be easier to migrate to new formats, more susceptible to digital archaeology, and allowing easier development of rendering software. Uncompressed raster images are an example of direct, transparent encoding.
24
Sustainability: Self-documentation
Self-documentation. Digital objects that contain basic descriptive metadata (the analog to the title page of a book) as well as technical and administrative metadata will be easier to manage over the long term than data objects that do not incorporate the metadata needed to render or understand them. . Some metadata elements will likely be extracted to support discovery and collection management. Examples: JPEG (.jpg) image files contain very scant metadata EXIF JPEG combines JPEG compression with richer metadata JPEG2000 (.jpx) image files may contain metadata ‘boxes’ and can include an extensive DIG35 record Self-documentation. Digital objects that contain basic descriptive metadata (the analog to the title page of a book) as well as technical and administrative metadata will be easier to manage over the long term than data objects that do not incorporate the metadata needed to render or understand them. The value of richer metadata has been recognized in communities that create and exchange digital content. Related capabilities are built in to newer formats and standards (e.g., JPEG 2000, and the Extended Metadata Platform for PDF [XMP]). This development is illustrated by the progression from the original JPEG standard, with scant provision for metadata, to the EXIF JPEG used in many digital cameras, which combines JPEG compression with richer metadata, and now to JPEG 2000.
25
Sustainability: External Dependencies
Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments. Some interactive digital content is designed for use with specific hardware, such as a joystick. Scientific datasets built from sensor data may require specialized software for analysis and visualization. External dependencies will make content more difficult and costly to sustain than static content. The specialized software required by some scientific datasets may itself be very difficult to sustain. External dependencies refers to the need for particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in the future. For example, scientific datasets built from sensor data may be useless without specialized software for analysis and visualization
26
Sustainability: Impact of Patents
Degree to which the ability of archival institutions to sustain content in a format will be inhibited by patents. Although the costs for licenses to decode current standard formats are often low, the development of open source decoders will be inhibited. Tools to transcode content in these formats when they become obsolete may be more costly to develop. It is not the existence of patents that is a potential problem, but the terms that patent-holders might choose to apply. Examples: No patents are involved in use of uncompressed TIFF MrSID patent exploited through licensing terms with fees depending on transaction volume Impact of patents. This refers to the degree to which the ability of archival institutions to sustain content in a format may be inhibited by patents. Although the costs for licenses to decode current formats are often low or nil, the existence of patents may slow the development of free or reasonably priced encoders and decoders. It is not the existence of patents that is a potential problem, but the terms that patent-holders might choose to apply.
27
Sustainability: Tech Protection Mechanisms
Refers to the implementation of mechanisms such as encryption that prevent the preservation of content by a trusted repository. Preservation of the digital content requires replicating it on new media, migrating and normalizing it in the face of changing technology. Protection mechanisms may also prevent the dissemination of content to authorized users. Exploitation of technical protection mechanisms is generally optional; their use depends in a particular context may depend on business decisions. Finally, our seventh factor: Technical Protection Mechanisms. Encryption and digital rights management mechanisms applied to content files will complicate the task of a trusted repository. To preserve digital content and provide service to future users, custodians must be able to replicate the content on new media, and migrate or normalize it in the face of changing technology.
28
Quality and Functionality Factors
Vary according to content type, e.g., text, image, sound Pertain to current and future usefulness, e.g., for scholarship or repurposing Identification of factors reflects consideration of what are likely to be significant or essential features of some content items Surround sound for audio Color maintenance for still images Logical structure for text documents Trying to get at this at a simplified high level Quality and functionality factors vary by genre. To date, our analysis has been limited to four familiar content categories: still images, sound, textual materials, and video. Ahead lie categories like Web sites and datasets, which will probably be treated in subcategories: for example, geospatial data and social science surveys. We think that all formats worth considering must offer certain base-level functionality, what we call normal rendering. And we are cognizant of the several valuable “beyond normal” functions offered by formats like JPEG 2000.
29
Quality & Functionality Example: Still Images
Clarity (support for high image resolution) Bitmaps: high pixel count and bit depth Vector: choice of “clean edges” or “geometric precision” Avoid or minimize compression loss, watermarking, etc. Here’s what we use when we describe still image formats: Clarity refers to the degree to which "high resolution" content may be represented within this format. Quality tends to correlate to pixel counts and bit depth for bitmapped images. Some vector formats offer a choice between "clean edges" and "geometric precision."
30
Quality & Functionality Example: Still Images
Color maintenance (support for color management) Format has features that support color management, e.g., imbedded ICC profiles Color maintenance refers to the degree to which the color gamut represented in a given image can be managed, with an eye on inputs and outputs. Formats that imbed ICC profiles will be preferred.
31
Quality & Functionality Example: Still Images
Support for graphic effects and typography Applies to vector graphics formats or to layers in formats that support both bitmapped and vector layers Desirable: support for the use of shadows, filters or other effects as applied to fill areas and text levels of transparency specification of fonts and patterns. Support for graphic effects and typography is usually associated with vector graphics formats or formats that support bit-mapped and vector layers.
32
Finding the balance Preferences will be based on balancing the factors. Sometimes the factors compete. Trade-offs will come into play. Selection of acceptable formats will have to consider how digital content may be received. Sometimes, adoption may be paramount But for some content of high cultural value, particular functionality may outweigh sustainability factors In practice, preferences among digital formats will be based on finding a balance among all the factors. Sometimes the factors compete. For example, some formats adopted widely for delivery of content to end users are proprietary or apply lossy compression. In such cases, disclosure can substitute for transparency. For content of high cultural value and for which a special functionality has particular significance, the ability of a format to support that functionality may outweigh the sustainability factors.
33
Factor Scorecard for Still Images
TIFF_UNC EXIF-TIFF JPEG JP2 MrSID Disclosure + - Adoption Transparency Self-documentation External dependencies n/a Patents . Tech protection (possibility) Not available w/in frmt Not available w/in frmt (?) Available ? Clarity Color Maintenance In some of our discussions with curators, we have illustrated our points in a simplified tabular comparison. For example, this rough and ready table illustrates how one might use the factors to score some formats for bitmapped images.
34
Categories of Bit-mapped Images
Description Clarity Color maintenance 1 Pictorial expression of high value. Examples: Works by graphic artists, photographers, advertisers for whom the designated community has high interest in artist’s intent. ** 2 Images for which artist’s pictorial intent is less significant but color or tonality is significant. Examples: documentary photographs of nature, fashion, architecture; newspaper “file” photos; Landsat images * 3 Images for which spatial resolution is important, but color depth and precise color accuracy are not important. Examples: maps, graphs, technical drawings, Vector graphics "frozen" as bit-maps 4 Pictorial expression of lower artistic value, such as: routine output of a portrait studio; images with significance as the expression of everyday life (“snapshots”); interesting-but-not-artistically valuable images associated with oral histories. 5 Images incidental to Web harvesting, including animations consisting of only a few frames Some still image items warrant higher functionality and quality than others. For example the original artwork of a cartoonist, a digital snapshot from an oral history project, and a documentary nature photograph may warrant different balances of the factors. Here is an illustrative set of categories for bitmapped still images likely to be added to the Library’s collections. The significant characteristics are potentially different for each category.
35
Category 1 Format Preferences
Preferred Acceptable Encoding type File type, subtype 1 Pictorial expression of high value. Examples: Works by graphic artists, photographers, advertisers for whom the designated community has high interest in artist’s intent. Bit-mapped, rich color, uncompressed TIFF_UNC Lossless compression TIFF_LZW JPX_LL JP2_LL Same, produced by a digital camera TIFF/EP (EXIF) Same, prepared to produce printing plates, e.g., for a magazine PDF/X TIFF/IT We haven’t vetted this yet with our colleagues, but we have begin to develop a short list of preferred and acceptable formats. Here’s the first of the categories from the previous slide. Our current, admittedly conservative, general preference is for TIFF with no compression, although lossless JPEG 2000 is acceptable, especially if color management data is included in the file. If the image was created in a digital camera, we would prefer TIFF/EP or EXIF TIFF; if the image is graphic art for, say, a magazine, then PDF/X or TIFF/IT would be preferred. These preferences will not remain static. For example, we may cling less firmly to uncompressed TIFF as a preferred format as we overcome our shyness about JPEG We fret about the reduced transparency of JPEG 2000 encoding and wonder about support for so-called linear gamma. But as the format's adoption grows, we see its many advantages in terms of increased metadata, color management support, and functionality, remembering again its “beyond normal rendering” capabilities.
36
In conclusion, we plan to describe many more formats and hope to add new categories during 2005, for example GIS and CAD. We seek review from specialists -- our web site has an online comment form. Meanwhile, we hope to maximize our synergy with the Global Digital Formats Registry. We see our role as providing information to custodians of digital content, while GDFR and entities like JHOVE build tools that assist those custodians to execute tasks like identification and validation.
37
Thank you
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.