Presentation on theme: "Sustainability, Functionality, and Quality Analyzing Formats for Digital Content Caroline R. Arms and Carl Fleischhauer Digital Preservation Management."— Presentation transcript:
Sustainability, Functionality, and Quality Analyzing Formats for Digital Content Caroline R. Arms and Carl Fleischhauer Digital Preservation Management Workshop Cornell University Library July 21, 2004
National Digital Information Infrastructure Preservation Program Builds on experience with American Memory/National Digital Library Program Created by federal legislation (PL ) in December 2000 Up to $175 million potentially available –$5 million (available immediately to support planning) –$20 million (subject to Congressional approval) –$75 million (subject to $ for $ match from nonfed sources) –$75 million (private funds)
NDIIPP Activities in National Digital Strategy Advisory Board – 26 members Series of technical workshops – 47 experts NDIIPP plan drafted and approved by Congress Areas for investment –Network of Preservation Partners –Preservation Architecture –Digital Preservation Research
Network of Preservation Partners Capture digital content at risk Public policy and cultural heritage emphases Multiple partners Requires matching funds from applicants Selection process –December 2003-Summer 2004 Announcement expected soon
Preservation Architecture For the national infrastructure –Revised document on web, late 2003 –http://www.digitalpreservation.gov/repor/NDIIPP_v02.pdfhttp://www.digitalpreservation.gov/repor/NDIIPP_v02.pdf Related activity –Archive Ingest and Handling Test (AIHT) announced June 2004 –Resource from George Mason University, Sept. 11, ,000 digital images, texts, sound, video; Data extent: 12 GB –Project duration: 12 months –Joint project with Old Dominion, Johns Hopkins, Stanford, and Harvard Universities –Will help validate technical architecture
Digital Preservation Research Tools & technologies for preserving digital content Partnership with NSF –2002: Planning workshop –2003: Report –2004: Announcement of program by NSF Expectation: –12 to 15 grants –$2.3 M funding
Analysis of Digital Formats To support strategic planning at the Library To provide an inventory of information about current and emerging formats To identify and describe the formats that –are promising for long-term sustainability, and develop strategies for sustaining these formats –are not promising for long-term sustainability, and develop strategies for sustaining the content they contain
Analysis of Digital Formats Project includes media-independent formats ("intangible"), e.g., MP3 files Project excludes media-dependent formats ("tangible"), e.g., audio CDs, DVDs Synergy with the proposed Global Digital Format Registry Web site currently limited to LC staff
Sample format description, top
Sample format description, middle
Sample format description, bottom
Formats descriptions as of July 2004 AAC_MP2 (Advanced Audio Coding, MPEG-2) AAC_MP4 (Advanced Audio Coding, MPEG-4) AAC_ADIF (Advanced Audio Coding, MPEG-2, Audio Data Interchange Format) AAC_M4A (Advanced Audio Coding, MPEG-2, m4a File Format) AIFF (Audio Interchange File Format) AIFF_LPCM (AIFF File Format with LPCM Audio) ASF (Advanced Systems Format) AudCom (Audible.Com File Format) AudCom_MP3 (Audible.Com MP3) AVI (Audio Video Interleaved) AVI_MJPEG (AVI, MJPEG Codec) AVI_Indeo (AVI, Indeo Codec) AVI_Cinepak (AVI, Cinepak Codec) AVI_DivX (AVI, DivX Codec) Cinepak (video codec) DLS, Downloadable Sounds Format DivX_5, Version 5 (video codec) ID3 (ID3 Metadata for MP3) ID3v1 (ID3, version 1) ID3v2 (ID3, version 2) IFF (Electronic Arts Interchange File Format 1985) Indeo_3, Version 3 (video codec) Indeo_5, Version 5 (video codec) LPCM MIDI_SD, MIDI Sequence Data MJPEG (Motion JPEG) MODS, Module Music Format (Mods) MP3_ENC (MP3 Encoding) MP3_FF (MP3 File Format) MPEG-1 MPEG-2 MPEG-2_SP, Simple Profile MPEG-2_MP, Main Profile MPEG-2_422, 4:2:2 Profile MPEG-4 MPEG-4_V, Visual Coding (Part 2) MPEG-4_V_SP, Visual Coding, Simple Profile MPEG-4_V_SSP, Visual Coding, Simple Scalable Profile MPEG-4_V_ASP, Visual Coding, Advanced Simple Profile MPEG-4_V_CP, Visual Coding, Core Profile MPEG-4_V_MP, Visual Coding, Main Profile MPEG-4_V_SStP, Visual Coding, Simple Studio Profile MPEG-4_AVC, Advanced Video Coding (Part 10) MPEG-4_AVC_BP, Advanced Video Coding, Baseline Profile MPEG-4_AVC_MP, Advanced Video Coding, Main Profile MPEG-4_AVC_EP, Advanced Video Coding, Extended Profile NITF, News Industry Text Format Ogg, Ogg File Format Ogg_Vorbis, Ogg Vorbis Audio Format PCM PDF, Portable Document Format PDF/A, PDF for Preservation Quicktime QTA_MP3, QuickTime Audio, MP3 Codec QTA_AAC, QuickTime Audio, AAC Codec QTV_Apple, QuickTime Video, Apple Codec QTV_Cinepak, QuickTime Video, Cinepak Codec QTV_Sorenson, QuickTime Video, Sorenson Codec QTV_MJPEG, QuickTime Video, Motion JPEG Codec QTV_MPEG, QuickTime Video, MPEG-1 Codec RealAudio_10, Version 10 RealAudio_RA, RealAudio Codec RealAudio_AAC, AAC Codec RealAudio_LL, Lossless Codec RealAudio_MC, Multichannel Codec RealVideo_10, Version 10 RMID, RIFF-based MIDI File Format Sorenson_3, Version 3 (video codec) SMF, Standard MIDI File Format SVG, Version 1.1 TIFF, Revision 6.0 and earlier Vorbis, Vorbis Audio Codec WAVE WAVE_LPCM WAVE_LPCM_BWF WMA, Windows Media Audio WMA_WMA9, Windows Media Audio File with WMA9 Codec WMA_WMA9_PRO, Windows Media Audio File with WMA9 Professional Codec WMAWMA9_LL, Windows Media Audio File with WMA9 Lossless Codec WMA9, Windows Media 9 Audio Codec WMA9_PRO, Windows Media 9 Professional Audio Codec WMA9_LL, Windows Media 9 Lossless Audio Codec WMV, Windows Media Video WMV_WMV9, Windows Media Video with WMV9 Codec WMV_WMV9_PRO, Windows Media Video with WMV9 Professional Codec WMV9, Windows Media 9 Video Codec WMV9_PRO, Windows Media 9 Professional Video Codec XMF, eXtensible Music Format XML
Formats: Types & Relationships file formats –at the level indicated by file extensions, e.g.,.mp3 –as indicated by Internet MediaType (aka MIME type), e.g. text/html –versions develop through time –refinements are tailored to specific purposes, e.g., TIFF-EP for electronic photography class of related formats whose familial characteristics are important –e.g., the WAVE audio format is an instance of the RIFF format class "wrappers" distinguished in terms of their underlying bitstreams –e.g., WAVE files may contain linear pulse code modulated [LPCM] audio (like a CD) or highly compressed audio as used for digital telephony. file formats may have optional features significant to sustainability –e.g. for encryption bundling formats bind together files comprising a single digital work –e.g., text and supporting illustrations, or a movie with sound tracks in different languages
Simple Example: WAVE Wrapper for different bitstreams Simple, but extensible method for embedding metadata subtype ofRIFF may containLinear PCM, μ-law, A-law (bitstreams) has subtypeBroadcast WAVE (Linear PCM + EBU metadata) has subtypeAES (BWF + cart metadata)
More Complex Example -- PDF Much more than text A file format, a wrapper, a bundling format, all in one Complexity of relationships has version1.3 (July 2000, 696 pages) has version1.4 (December 2001, 978 pages) has version1.5 (August 2003, 1172 pages) may containTIFF, JPEG, JPEG2000, etc., etc., etc. (all at once) has subtypeTagged PDF (can represent logical document structure) has subtypeAccessible PDF (tagged + further constraints) has subtypePDF/X (ISO standard, for pre-press use, e.g., submission of graphics to magazine publishers) has subtypePDF/A (Under development as ISO standard, for archiving)
Complexity Increasing New standards have portmanteau nature –Many parts, many options, as already noted in the case of PDF JPEG2000 –Part 1..jp2 (core lossless and lossy compression schemes for continuous tone, replacement for JPEG) –Part 2..jpx (extensions, including more capabilities for embedding metadata) –Part 6..jpm (multi-layer images, can embed other bitstream encodings, including bitonal) MPEG-4 –Many profiles for different contexts, also advanced video coding Which parts of these standards will be widely adopted?
Two Types of Evaluation Factors Sustainability factors for all formats –influence feasibility and cost of preserving content in the face of future change Quality and functionality factors that vary by content category –reflect considerations that will be expected by future users
Sustainability: Disclosure Degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content. A spectrum of disclosure levels exist. What is most significant is not approval by a recognized standards body, but the existence of complete documentation. Preservation of content in a given digital format is not feasible without an understanding of how the information is encoded. Examples: –TIFF image format well documented, many products and shareware –MrSID image format partially documented, proprietary elements protected
Sustainability: Adoption Degree to which the format is already used by the primary creators, disseminators, or users of information resources. If a format is widely adopted, it is less likely to become obsolete rapidly, and tools for migration and emulation are more likely to emerge from industry without specific investment by archival institutions. Examples: –PDF text format very widely used –Microsoft eBook Reader not widely used
Sustainability: Transparency Degree to which the digital representation is open to direct analysis with basic tools, such as human readability using a text-only editor. Digital formats in which the underlying information is represented simply and directly will be easier to migrate to new formats, more susceptible to digital archaeology, and allowing easier development of rendering software. Examples: –Uncompressed raster image bitstream easy to interpret –Lossy compressed image bitstream requires algorithm to decode
Sustainability: Self-documentation Digital objects that contain basic descriptive metadata (the analog to the title page of a book) as well as technical and administrative metadata relating to creation and the early stages of the life cycle will be easier to manage over the long term than data objects that are stored separately from the metadata needed to render or understand them. Some metadata will be extracted to support discovery and collection management. Examples: –JPEG (.jpg) image files contain very scant metadata –EXIF JPEG combines JPEG compression with richer metadata –JPEG2000 (.jpx) image files may contain metadata ‘boxes’ and can include an extensive DIG35 record Important metadata includes administrative types called for by the OAIS reference model.
Sustainability: External Dependencies Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments. Some interactive digital content is designed for use with specific hardware, such as a joystick. Scientific datasets built from sensor data may require specialized software for analysis and visualization. External dependencies will make content more difficult and costly to sustain than static content. The specialized software required by some scientific datasets may itself be very difficult to sustain. Examples: Adobe eBooks require a Microsoft Passport account. Open eBook format is free of external dependencies
Sustainability: Impact of Patents Degree to which the ability of archival institutions to sustain content in a format will be inhibited by patents. Although the costs for licenses to decode current standard formats are often low, the development of open source decoders will be inhibited. Tools to transcode content in these formats when they become obsolete may be more costly to develop. Examples: –Makers of tools for MPEG-1 moving image format are not required to have licenses –Makers of tools for MPEG-2 and MPEG-4 moving image formats must pay for licenses and/or pass through royalties; dissemination of MPEG-4 content may carry per-viewing fees
Sustainability: Tech Protection Mechanisms Refers to the implementation of mechanisms such as encryption that prevent the preservation of content by a trusted repository. Preservation of the digital content requires replicating it on new media, migrating and normalizing it in the face of changing technology. Protection mechanisms may also prevent the dissemination of content to authorized users. Exploitation of technical protection mechanisms is generally optional; their use depends in a particular context may depend on business decisions. Examples: –Sound recordings from Audible.com will only play with software and/or devices from Audible –MP3 files play anywhere
Quality and Functionality Factors Vary according to content type, e.g., text, image, sound Pertain to current and future usefulness, e.g., for scholarship or repurposing Identification of factors reflects consideration of what are likely to be significant or essential features of some content items –Surround sound for audio –Color maintenance for still images –Logical structure for text documents Trying to get at this at a simplified high level
Quality and Functionality Factors Normal rendering –Baseline for behavior of content when presented to a user images permit zooming, sounds can be played, stopped, and restarted. Functionality beyond normal rendering –For special interests, e.g., vector-based images (architectural drawings) remain malleable so that they can be modified after being copied from a library collection. music notation formats (MIDI) structured to permit association with particular sounds or tone sets. At the moment, our analysis is limited to four familiar content categories: still images, sound, textual materials, and video. More to come later.
Quality & Functionality: Still Images Normal rendering includes on-screen viewing and printing to paper; likely to also include the ability to zoom in to study detail or produce publication quality output Clarity (support for high image resolution) –Bitmap format allows for high pixel count and bit depth, vector format offers choice of “clean edges” or “geometric precision” –Prefer implementations that eschew or minimize compression loss, effect of watermarking, etc. Color maintenance (support for color management) –Format allows for color management, e.g., ICC profiles Support for graphic effects and typography –Applies to vector graphics formats Functionality beyond normal image rendering (3-D models, etc.)
Quality & Functionality: Text Normal rendering includes linear reading on screen, print, search, and index; must reflect the intent of the author in representing characters, paragraph structure, lists, headings, and emphasis. Support for integrity of document structure and navigation –Format allows for navigation and automated analysis that reflects the logical structure of a work Support for integrity of layout, font, and other design features –Allows for reliable presentation in terms of look and feel, when exact choices of features like font and column layout are essential to meaning Support for rendering for mathematics, formulae, diagrams, etc. –Allows for accurate rendering of non-textual elements that are crucial to informational content (markup languages sometimes fall short in this area) Functionality beyond normal rendering –Talking books (ANSI/NISO Z39.86 for the blind), other elements
Finding the balance Preferences will be based on balancing the factors. Sometimes the factors compete. Trade-offs will come into play. Selection of acceptable formats will have to consider how digital content may be received. Sometimes, adoption may be paramount...
Quality & Functionality: Sound Two subcategories First is recorded or waveform sound. –Popular music recordings, recorded books, and digital oral histories Second is data for the dynamic construction of sound through software and hardware. –Software includes sequencers and trackers –Underlying data controls when sounds start and stop, attributes such as volume and pitch, and other effects –Sound elements may be samples of waveform sound or data elements that characterize a sound so that a synthesizer can produce it. –Sounds are generated at runtime. Second category sometimes called structured audio. Our focus is on note-based formats, the most prominent of which is MIDI, the Musical Instrument Digital Interface.
Quality & Functionality: Sound Normal rendering includes playback in mono and stereo; typical software provides user control over volume, tone, and balance, as well as navigation (fast forward, go-to-segment, etc.).
Quality & Functionality: Sound Fidelity (support for high audio resolution) –Factor associated with waveform formats –Refers to the degree to which "high fidelity" content may be reproduced within this format –Fidelity is put to the test when the reproduction is repurposed, e.g., when a "master file" is used to produce, say, a new audio- CD music release. –For linear PCM (Pulse Code Modulated) data, the two characteristics most often associated with fidelity are sampling frequency and word length (i.e., bit depth).* –Other factors may also influence fidelity, such as the presence of distortion, watermarking, or artifacts from compression. * Leaving aside for the moment the one-bit deep coding called by DSD (Direct Stream Digital), developed by Sony and presented as an alternate to PCM.
Quality & Functionality: Sound Support for multiple channels – two meanings 1. Aural space or sound field, e.g., as stereo or surround sound. 2. Multiple signal streams that provide alternate or supplemental content, e.g., narration in French and German, sound effects separate from music, etc. Waveform bitstreams –encode multiple channels in interleaved or matrixed structures –typical stereo or two-channel sound PCM bitstream in a WAVE file alternates the information from the two channels –surround-sound content employs additional data that is matrixed (or multiplexed) and decoded at playback time Note-based files, e.g., MIDI –can have as many as sixteen channels –allows separate "instruments" to play simultaneously for polyphony –can represent aural space
Quality & Functionality: Sound Support for downloadable or user-defined sounds, samples, and patches –Applies to note-based formats –Does the format permit references to or the inclusion of digital sound data and articulation parameters? These are needed to create one or more voices or instruments in a musical presentation.
Quality & Functionality: Sound Functionality beyond normal rendering –Beyond play, stop, etc –Note-based formats Can produce notation on screen or on paper Permit file playback with selective control Feature karaoke content, with synchronized texts –Waveform formats “Rich-data versions” - high sampling rates and 24-bit data samples Not convenient to "play" such a file in real time But excellent for repurposing and for the production of reduced-data service copies. The “beyond normal” feature of rich data is more strikingly felt with still and moving images.
Factor scorecard, waveform sound WAVE- LPCM WAVE- BWF-LPCM MP3AAC (MPEG-4) Real Audio Disclosure Adoption Transparency Self documentation External dependencies Patents Tech protect when implemented Not available w/in frmt Not available w/in frmt (?) AvailableAvailable (?) Fidelity Multiple channels Downloadable sounds, samples n/a
Content States in a Production Process A bit of a simplification but... Content in a publishing or distribution stream can be seen as existing in three states: –Initial “while the author is creating it” –Middle “while the publisher manages and archives it” –End “what is presented or sold to an end-user” Different formats are often associated with these three states, appropriate to the task at hand
Initial State, Early Creative Processes Example for sound recording Multiple separate tracks in a recording studio Complex multipart entity, e.g., twelve tracks for instruments and voices “Edit decision list" manages elements Very high fidelity Specialized production formats, e.g., proprietary format produced by the SADiE digital audio workstation No technological protections embedded in bitstream Other examples Text: writer using word processing software, e.g., MS-Word Video: multi-segment work in progress has elements in AAF wrapper Initial state formats are often proprietary and may be limited to creator's favorite software package
Middle State, in Hands of Publisher Example for sound recording Mixed master, often in stereo or surround sound Not as complex as the studio session recordings, ready or close-to-ready for distribution as digital-file or compact disk Edit decision list may still be required Very high fidelity Specialized industry formats, e.g., AES-31 recorded sound format No technological protections embedded in bitstream Other examples Text: author’s journal article marked up and in document management system Video: program archived in MXF format, transmitted by TV network at designated time Middle state formats used by industry to send or exchange data, possible formats for archiving within an industry.
Final State, Distributed to End Users Example for sound recording Simple entity, may be high, moderate, or low fidelity Common, current media-independent formats, e.g., WAVE-LPCM file, Windows Media Audio (.wma), or MP3 Security elements may be embedded in the bitstream Other examples Journal article “published” as PDF file Video program disseminated as MPEG-4 compressed file Final state formats are for items in the marketplace. “This year, we released the song in RealAudio, next year we’ll reissue it as encrypted AAC on iTunes.”
Prefer the Middle-state Formats? The best formats for the long term may well be the middle state formats. These are likely to have higher quality than final-state formats, may be easier to manage for preservation, and may be the focus of developing archiving approaches by industry. To seek middle-state digital formats for LC collections, however, would differ form the most widespread current practice. The selection of best editions authorized by copyright law and LC practices today is generally limited to works in their final state
What is a format? Working definition from global format registry: A format is a fixed, byte-serialized encoding of an information model. Working definition we have used: Formats are packages of information that can be stored as data files or sent via network as data streams (aka bitstreams, byte streams)