Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges of Digital Media Preservation

Similar presentations


Presentation on theme: "Challenges of Digital Media Preservation"— Presentation transcript:

1 Challenges of Digital Media Preservation
Karen Cariani, Director Media Library and Archives Dave MacCarn, Chief Technologist Today I will talk about WGBH’s DAM system, our open source project and plans for future digital preservation.

2 Who we are: WGBH Media Library and Archives
Who are we? WGBH is Boston’s Public television station. We produce fully one third of the content broadcast on PBS, including the series you see here, as well as Downton Abbey and Sherlock. In addition to television, we have 2 radio stations and a large, award winning Interactive department that is the number one producer for the sites you’ll find on PBS.org. As you can see, we produce a wide variety of programming from public affairs, to history and science, to children’s program, arts, culture, drama and how to’s. We have been on the air since 1951 with radio and 1955 with television. At heart and through our mission we are an educational and cultural institution. We originated out of a consortium of academic universities in the Boston area. Because we have produced so much we have a large archive of educational programming that is of interest to scholars and researchers, in addition to the public.

3 Challenges Transitions (Analog to Digital) What Born Digital Brings
New Workflows New Capture Formats Access vs. Preservation New Systems New Storage This is what we plan to talk about

4 Transitions (Analog to Digital)

5 Transition challenges (Analog to Digital)
Preservation needs are more complicated New and changing content formats Network connections Software Storage media Hardware Access expectations challenging Faster access Anywhere, anytime So these are some of the main challenges we have with the transition to digital content. A physical item is easier to keep track of – sort of….. You put it on a shelf you can see it and find it. Keep it cold and dry and it’s good for 20 years at least. With digital content, we need to pay closer attention to different formats, storage, access expectations and workflows. Because we are now managing digital files, not physical tapes, the preservation needs change and are complicated – we need to pay attention to hardware, software, storage media, network connections, content formats – upgrading and migrating all these facets every 3-5 years. And because it’s digital everyone thinks everything should be available from their desktop immediately……access expectations are challenging.

6 Common Physical Formats (and some not so common)
16mm film (some 35mm) 2” videotape 1” videotape 3/4” videotape Betacam Betamax MII Digital Betacam D1, D2, D3, D4, D5 DVCam DVCPro & DVCProHD Mini DV & HDV Keep in mind: The Machine was the Format. Here is a list of the most common analog and digital tape formats over the past 60 years. We’ve never been in control of the formats our producers use to create programs, or that PBS mandates as the broadcast master. But the choices for professional broadcast quality formats were limited, or at least manageable. And you had an item you could put on a shelf. But the formats were limited and machines you could play them on also limited. A good and bad thing. The biggest challenges to legacy digitization is the condition of the tape and the playback facilities and expertise. We have one of the few working 2” machines in New England and we have insisted that we keep 2 - one for parts. Our 3/4” deck sounds like the inside of an aircraft hanger when you switch it on. When it fails, we might not have the parts to repair it because 3/4” decks are no longer being manufactured or maintained. And then there is the quality of the tapes themselves. As they get older tapes will begin to stick and shed oxide. Parts of cassettes will become loose and affect play and rewind. The original quality of the stock purchased will affect the overall quality of the recording. Tapes may be so fragile that you will only get one pass at transferring them. We recently transferred some 2” and 3/4” interview masters from the series Vietnam: A Television History. Several of them had deteriorated so much that we lost significant portions of the audio and video.

7 While there are a bewildering number of options of what resolution to digitize, choices have somewhat settled over the last five or so years. Yes, you can go ahead and capture 3/4” at 10-bit uncompressed, but you’ll be wasting bits. DVCPro50 or DVCPro25 are more suitable for that format. We’ve created guidelines for tape capture. We can also use the pre-existing data for the physical tape to help populate the record in our digital asset management system. So analog and digital tape migration is more of a question of finding the resources to tackle our collection in a systematic way. These recommendations are from Broadway Video and the chart was put together by Dirk Van Dall.

8 Thought Migration TAPE - Condition - Machinery - Priorities
BORN DIGITAL - Codecs/wrappers - Nested data - Playback - Storage We pretty much feel we have the workflow for legacy digitizing sorted out. It’s more of a question of funds and priorities. Setting aside our current grant projects, we see 3/4” masters as the next big task to tackle. The format is deteriorating rapidly and machinery is scarce and fragile. Born-digital? Another story…..The variations on each file formats are enormous. And there are more factors to considers for playback, storage, etc.

9 Thought Migration Before we had 1 shoot, 10 physical tapes, many shots per tape Now each time the camera is in record, a new file is created Now 1 shoot can generate many files across multiple digital storage containers (Optical Disk, Hard Drive, Solid State Memory) From analog to digital content creation we need to go through a thought process change. Until relatively recently, one shoot would produce a finite number of physical tapes. Each tape would hold a variety of shots, and were represented as one record item. Now each time a camera is turned on on and off, a new file is created, and that creates a lot of files. And the files can be stored across multiple types of storage containers. Writing a label on the outside of the storage container tends to be more cryptic than a label on a tape.

10 Digital Formats If this looks daunting, it is. This is a chart of the different file formats coming out of digital cameras. So the content formats have multiplied. We’re not just managing 4-5 tape formats. The sheer scale and variety of capture formats is overwhelming. We need to know if they are progressive, or interlaced, the container, and the wrapper. And this is just HD file formats. We need to make sure that productions capture this type of information even though in the future the information may be automatically pulled in. All this information is important for us to know how to store the files, play them back, reuse them, etc. Instead of decks to play it back we need software.

11 New Workflows

12 For Access: Data Organizational Issues
Technical Metadata Characterize files Descriptive metadata Need description for video to be useful, findable How to capture that How to make sure it is linked to video files In storing these digital files, we need to have descriptive data about them so we can find them again, and know what they are. We need to be able to search for content.

13 Original Footage For many years we’ve been supplying productions with Filemaker templates to organize their media. As we move towards tapeless, we are finding we are stretching the limit of what we can do with this framework. This is the record relating to Sony EX materials. The field names highlighted in blue are what we have added in the transition phase to capture information about the tapeless materials. There are dropdowns to help standardize the information. But the problem is that these databases are usually filled out at the end of production. For tapeless acquisition, organization and workflow have to be decided before shooting.

14 That was easy, wasn’t it? Plug-ins to view files
Depending on the file type we may have to re-wrap to a QuickTime wrapper, or fully transcode the source file Redundant storage of raw and unwrapped materials Quality of data But as fast as we are testing workflows and adding dropdowns to our Filemaker templates, ground beneath us is changing. We are fighting a continuous battle to stay abreast with changing file formats and the plugins needed to even view these formats, let alone process them. An MXF file will not ingest correctly or play in our Digital Asset Management system. For some formats, we will also have to sync up the audio and video files pre-ingest. For preservation we want the best, highest quality file, so we are looking to store both the raw (or straight out of the camera files) and the re-wrapped or transcoded files needed for editing too. This doubles our storage needs. And data, just like with the items in our physical archives, is only as good as the person who types it in. The difference with born-digital is that there is no easy way to retrieve the file if the data is bad or incomplete. And how do we link the descriptive data to the file in our system. This is an uphill battle as productions often leave data entry to the end of their project, and assign it to the most junior member of the team.

15 Metadata Entry in the Field:
In-camera - encourage tagging files with data. User-generated clip naming --- (Health001, Health002) Card Labeling Content Labeling through folder structures We are telling production teams that file-based workflows really need to start in pre-production, and it begins with more than just hitting record. Metadata starts here. The camera tags each file with certain pieces of data such as date, time code, codec, etc. But it can also add additional information such as aperture, focal length, camera set-up, type of lens, etc. The majority of the technical metadata is automatic with every shot. But there is some user generated metadata that can be added on a clip by clip basis such as limited shot logging or adding your own name to the head of every clip. And because cost, most of the file-based media out there requires that productions re-use the media cards so that they have to save the files on another medium like a laptop or hard drive on location. We encourage everyone to back up at least once, if not twice. We highly recommend that one of the drives be kept offsite during the duration of the project. How productions move this media and organize it is very important as it will save time in post production.

16 Folder Structure Create folders by card Assign unique number
Continue numbers Add description Place ENTIRE card contents into this folder!! Just so you know- every time a camera activates or inactivates the record button, a new file is created. That means LOTS of files are being created as content. So we have decided to managed the files coming from our productions in a standardized folder structure. That way everything is organized in the same way and we can develop some systematic processing to ingest the data and the files into our DAM system. Here you see two layers of folder structures from one production program. This would be 2 different shoots with 2 different cameras. One folder is filled with P2 material and the other is XDCam EX. Both used memory cards that can be re-used after dumping to a hard drive. 16

17 Video, Video, Where for art thou video?
So when the MLA gets the footage on a drive, the folder structure on the drive may look like this. To find the XDCam EX video, you have to look 5 folders down from the main project folder. These have one folder for the recording with the audio and the video in the same file - multiplex.

18 Finding both audio & video
With these files the camera manufacturer using a different MXF application specification (OP-Atom) that splits the audio and the video into separate folders. For an archivist is harder to manage because you need to be able to put the two together and ingest together into the DAM system. So the naming convention is important in order to find the matches. So, how do we marry this nested material with our production databases, DAM metadata model, and QuickTime-based ingest tools? Use a OP-atom to OP1a tool or use bagit (but limit partial restore.)

19 Storage and Retrieval How do we:
Capture the audio and video generated by myriad cameras Store the project information to allow potential re-edit Store files with rich, meaningful metadata Store born-digital materials Display and retrieve born-digital materials So the question now is: how do we archive all this stuff: capturing the audio and video generated by many camera types And It’s not just about storing. It’s also about storing so you can see it and get I tout again. What goes in, must come out in the same way or in a useable way. We need to archive not only a transcoded file and have that be the new master, but the original source file should always be there for future use. Compression algorithms change, platforms change, so it’s important to be able to match back to that original file.

20 Original Footage This is how we capture more descriptive data – shots logs. It’s easier for our production to use and and it creates standard fields to be filled out and ingested. What images are on this digital file. Create a mapping document between Filemaker and DAM. © 2011 WGBH

21 Proposed tapeless workflow
Create a mapping document between Filemaker & DAM Used to generate an xml style sheet Video is ingested simultaneously with the metadata from Filemaker using the xml style sheet Technical metadata is ingested simultaneously with the video and production data using the xml generated by the source digital files We have mapped all our Filemaker templates for original, stock and stills to our DAM framework and are working on creating xml style sheets to ingest the data at the same time as the video. Technical metadata will be automatically generated, though we will need to map certain format fields such as the camera type.

22 Access vs. Preservation

23 Access File size – need proxies Speed of access Formats
Want consistency for playback Reuse Retrieval of original files/preservation files Search/findable Metadata Organize files Once you’ve put your files into a system you will need to be able to access them again. And you will want this access to be easy, intuitive. You will want to view the object, potentially reuse the object in another way, and therefore be able to get it out of the system. Hopefully you will get it out of the system in the same format that it was put in. That there was no conversion to store it into the system. Or at least you will get it out in a format that is useable and currently accessible. Many DAM systems will only ingest a small number of formats – so you have to transcode your files to meet those format requirements. And usually that is because of the need for uniformity of formats for easier access and easier migration. Many DAM systems also manage versions of items for production workflow and that is one of their main features. So you can track changes among working groups. But that isn’t necessarily important for preservation and means you are storing more stuff long term – possibly as larger files. You are storing drafts of final work or working copies of originals.

24 Preservation Needs Multiple Copies Save original files
Validity – check sum Regular storage migration Persistence File format issues Migration ease Future playback Fixity check big files Big files Speed of access of preservation files for reuse Processing speed And then what do you need to do to preserve these digital files? If it were stone it would be easier – it would last longer. It’s always useful to have multiple copies, and with digital copies, there is generally no quality loss with copies if an exact copy is made in same format with no compression or transcoding. You need to make sure the file you have placed into storage is indeed an exact copy of the original – you need to validate the file in storage. To guarantee preservation over time you need to check to make sure the all the bits (1’s and 0’s) are still there in the file. Loss of 1’s and 0’s means loss of bits of the object. For video it means less image quality, or it could even mean file corruption and loss. Preservation means it will last a long time and be accessible over a long time. Persistence. So you need migrate files and systems forward as technology evolves. Access to formats will change and you need to be able to deal with that for access long term. Think about how much your computer or digital camera has changed in the last 5 years. The repository systems will also change and you need to make sure you migrate the files and the systems forward to make sure you can still access the materials – software, hardware, storage media. For preservation, you need to make decisions about what versions you want to keep long term, and what might be a work in progress. You also have to decide that if you plan to go with a normalized format – all the same format – what does the transcoding to that format do to your file? Any loss? Is the loss acceptable? What needs to happen every time you migrate?

25 What makes video different?
Preservation files are large Uncompressed Slow to move around Need proxy files for viewing Smaller size for quick transport over network Complicated formats Not just one file type Codecs, wrappers, frame speed, etc. For us at WGBH most of the materials we are storing are media files - audio and video. They are different from other files in that they are bigger and the content is time based – it changes as the files are played. Because of the large size of the preservation files which are uncompressed, we need to create smaller proxy to make it easier on the network to view and stream. And then there are the sheer variety of file types, codecs, wrappers and frame speed. With analog to digital conversion we have standards we can adhere to like 10-bit uncompressed and we can mandate the file format and codec. But born digital creates a completely different state of affairs – or the wild west as I like to call it. We have no power over the way our producers capture born-digital content or what camera they use. And each camera is creating a different file format. We make recommendations about file and folder structure and some producers do adhere to them, but that ‘s about it. And different camera types nest the files differently in their folder structure or, as you know, split the audio and video tracks. So what can we do to build a repository and user interface to handle complex and large files that are not consistent in their format? So DAM systems really just want to manage same format viewing proxies for fast access. But then need some way to pull down the preservation or ‘Master” file for re-use. What do you do if those file formats are all different? And the issues with that are as much network pipes not being big enough for large files as systems management.

26 New systems New storage

27 Software / Network File management Needed for access to files
Where are the files? Needed for access to files Large preservation files Smaller access, proxy files Network speed Larger files, need faster network to meet speed expectations With digital files, we also need software and networks to manage the files and move them around. For software there are Vendor solutions vs. open solutions. For the network, the speed of moving files around impacts access expectations.

28 Issues with current mgmt. systems/software
Preservation not a priority Interface issues Access vs. Preservation IT relationship Tech support Vendor reliance issues Need library based system for Archivist needs rather than traditional IT company needs Expense License cost Development Customizations We currently use a vendor system to manage our files. So why don’t we want to stay with our current vendor system? Well, first and foremost, the interface is not in the least intuitive and difficult to learn, making user adoption extremely hard. It is difficult to customize if we want to change or add features. And it’s expensive – we have to pay a license fee to Open Text and Oracle plus professional services if we want anything major changed or customized. Also we have to rely on our IT dept. for technical help and they don’t really understand our needs as a library. They are focused on business software needs – telephone, desktop services, etc. I’m sure we’re starting to paint a familiar picture. Vendor solutions failing to adapt to local needs, workflows and budgets.

29 Access repository as preservation
Fedora repository only stores proxy files Great interface Great search and indexing Faceted searches No preservation files or migration process For our website we are using a Fedora repository and Blacklight/Solr front end. It’s easy to use, looks nice and we can find stuff. The issues with our website as a repository are the following. Currently our fedora repository only references the proxy files – for streaming and viewing on the web. It has no preservation activity. But it’s flexible and it’s easy to configure the interface and we think we can extend it’s function to include preservation. Our original plan was to have Fedora sit on top of our DAM system - to have the DAM system be the storage and Fedora be the common manager that allows a variation of web application to display the content. Potentially Fedora could become an DAM system, but it would take much more development to build out. But now we are leaning towards another solution.

30 Technology Mix: This is an example of the combination of technology and software bundles needs to build a system…I’m trying to show that it’s complicated. And different software solution address different needs….it’s a puzzle. A vendor will give you all this in a package. One advantage of building it yourself is you can pick components that speak to your specific needs. Technology Mix:

31 New Tools Combine preservation system with access system
Better interface Flexible design Easy to evolve We have undertaken an NEH project to build an open source DAM system to manage preservation and access of media materials using a Fedora, Hydra, Blacklight technology stack. This is a one-year test project to get the hydra head implemented and to test a number of audio and video file formats. We won’t be able to test all file formats, codecs or wrappers but hope to get to a place where we can share a working prototype based on our HSM integration that is extensible enough to be scaled up or down to suit the needs or different cultural organizations. It might not work. We might not get as far as we planned, but we feel open source tools are at a stage where we can tackle the types of files currently unserved or under served by cultural institutions.

32 Hydra project Blacklight Hydra heads Hydra mgmt. layer
Fedora repository HSM storage system The overall idea is that we’ll have a common repository with different hydra heads serving as different web access into the repository. So depending on authorization and security permissions different users have access to different content. So our external users would have access to the proxy files like Open Vault, and internal users would have access to the preservation files much like our DAM system. But it would be in the repository, so no need to duplicate content. Fedora is used for storage – although I’m not sure we will actually store video files in fedora. Solr is the indexer, Blacklight is for browse and search and Hydra is for ingest, describing and managing. Hydra adds micro services on top of the repository that allows for rapid development of web applications and user interface. It’s very friendly and flexible. It was built to be the front end of a fedora repository. We are very hopeful that it will be a great solution for us for a flexible repository, easy to evolve, based on an open source community, and great for users. We are not embarking on this alone. We have 2 partners – WNYC New York Public radio for audio files and South Carolina Educational Television for video and audio, as well as advisors including Tom Cramer, John Dunn, Adam Wead,, and Richard Wright at PrestoPrime. We see the need get the best advice from the community.

33 Hydra Fundamental Assumption #1 & 2
No single system can provide the full range of repository-based solutions for a given institution’s needs ...yet sustainable solutions require a common repository infrastructure. No single institution can resource the development of a full range of solutions on its own, …yet each needs the flexibility to tailor solutions to local demands and workflows In particular I like the Hydra philosophy. It is very community driven and depends on the collaboration of institutions.

34 New Storage Types and Costs
Need hierarchical storage (HSM) Video files are large Spinning disks are expensive Tape can help save cost Tape copies/migration can be automated For storage we use a SUN SL8500, a HSM robotic tape system that lays the files onto LT04 tape. Spinning disk is expensive so we only keep proxy files for immediate on-line access, files under 1GB, and larger files that have been very recently ingested or staged. Those large files are quickly written back to the LT04. Later retrieval of these high res files can be extremely time consuming based on network load (download shares the network with the rest of the foundation. We are told we should plan for migration very 3-5 years both for data tape integrity and to make sure we keep up with technology changes and software versioning. This is expensive and as the digital collection builds migration can take a long time.

35 Hardware/Storage media: HSM
Access Online XX bytes Spinning disk Offline Nearline Preservation (offline) Robotic tape library system LT04 data tapes 2 copies One stored off site Migration needs 3-5 years Both tape migration to newer formats Technology migration Hardware – storing digital files you need a computer, for lots of files, and big files, you need a really big one, or at least one that has lots of processing power. Spinning disk for access copies, storage tape for a large library. And it needs to migrate every 3-5 years – new hardware. So we need to hook this new software into our storage system. In fact whatever management software we use we have to do this.

36 New Storage Types and Costs
Proprietary HSM has licensing issues Some systems license by gigabyte managed, others by tapes managed Need Open Source alternative Other systems license by number of tapes managed on and off line

37 Q & A Karen: Dave:


Download ppt "Challenges of Digital Media Preservation"

Similar presentations


Ads by Google