Download presentation
Presentation is loading. Please wait.
Published byHannah Byrd Modified over 7 years ago
1
Transcodes and Transfers Exploring the MDPI Back-end Processing
Brian Wheeler Senior System Engineer Indiana University Bloomington Libraries I'm Brian Wheeler, Senior System Engineer for the IUB Libraries I'm very excited for this opportunity to talk about the MDPI back end processing This is by far the most complex system I’ve ever created People's eyes usually glaze over when I talk about it, so hopefully I can avoid that today
2
Media Digitization and Preservation Initiative
Announced 10/2013 by President McRobbie Goal Digitize and preserve rare and unique time-based media in the University collections by 2020 Around 272,000 objects identified for digitization Partnership with Memnon Archival services IU will digitize fragile objects More Information Quick review of MDPI Announced by president McRobbie in 2013 after several years of study Purpose is Digitizing rare and unique audio and video media by the University's bicentennial. More than a quarter of a million objects have been identified for preservation The bulk of the digitization of the media will be performed by Memnon Archival Services, a division of Sony. They have facilities at the 10th and the bypass. Fragile items will be digitized by IU staff located next door to the Memnon facility
3
My MDPI Timeline 2/2011: Provide technical information for pilot
6/2013: Council of Enterprise Architects meeting 3/2014: Basic processing architecture defined 4/2014: Implementation begins 5/2015: First test batches from Memnon 6/2015: First production batches processed In 2011 I was asked to provide technical information for the pilot project which would eventually become MDPI. Things like storage cost estimates, video processing questions, etc. During this phase the production requirements were starting to come together and I was feeling sorry for the poor fool who would eventually have to implement this. In June 2013 I was asked to attend a meeting of the UITS Council of Enterprise Architects to discuss some of the technical questions I’d been asking. At the meeting I found out that I was doing the implementation. It was kind of a surprise Over the next 10 months I worked with UITS to define the processing architecture . I started the prototype software implementation in April, 2014 and The first test batches of content from Memnon were processed just over a year later The first production content was processed on June 12th, 2015 – two years after I officially became involved
4
Definitions Derivative – Something created from a master file
Thumbnail image Transcode – Converting a file from one format to another Preservation video -> Streaming video derivative Physical Object Database (POD) Software used by MDPI project to track processing of physical objects Scholarly Data Archive (SDA) IU’s tape-based storage system 1 Petabyte = 1,000 Terabytes = 1,000,000 Gigabytes A derivative is something that's created from a master file. In the image world, a good example would be a thumbnail image. Transcoding is the conversion of one file format to another. A preservation video is huge and hard to manage, so it is transcoded into a streaming video derivative with lower quality that can be used easily The Physical Object Database (POD) is software used to track the MDPI physical objects. It's also used by the QC staff to interact with the back-end processing software Scholarly Data Archive is IU’s tape-based storage system where all of the MDPI data is stored. I may refer to it as HPSS since that’s the name of the software running underneath. The data sizes for MDPI are a little bigger than what most people are used to. A petabyte being a thousand terabytes, or a million gigabytes. The total MDPI content is going to be over 6 petabytes – 6 million gigabytes. When I say “I” or “we” when describing the software functionality, I really mean what I told the software to do. The system is automated, so I'm not actually doing many of those things
5
Processing Overview Every object must be verified … and processed
Is the barcode in the POD? Are the files from the digitizer complete? Is it stored correctly onto SDA’s tapes? Does the object match what we’re expecting to get? … and processed Create derivatives for delivery Gather metadata for preservation Perform quality control So what happens to all of the objects as they're processed? We have to verify that the objects are what we're actually expecting and not damaged. The objects may be incomplete, not stored correctly on the tapes, had errors during transfer, or may simply be created wrongly by the digitizer. Once they've been determined to be what we're expecting, derivatives are created for eventual delivery. Preservation metadata for the object is gathered from different sources, including IUCAT and the POD. Beyond automated processing, a small portion have quality control checks performed by humans.
6
Conceptually Simple Conceptually, it's easy. The objects get uploaded to SDA, a bit of magic happens, and then we get the derivatives and preservation data
7
The devil is in the details!
Unfortunately, it isn't quite that simple – reality is a harsh mistress There are lots of things I had to consider when building the magic part of this process...
8
Technical Inspiration
My First Job Digital Library’s Image Processing System Video Streaming Service Transcoding Digital Library’s Archiver System LEGO Robotics Programming IU Computer Hardware Design Class 20 years of Perl Programming and Unix System Administration When building this system, I drew a lot of inspiration from my previous experiences. [ask who’s worked in a factory] My first job out of high school was Quality Assurance Technician in a factory that supplied parts for Toyota in the early 90’s. The factory used a just-in-time methodology which is very similar to the MDPI model In that job I learned a lot about how to design quality control systems For those of you who know Jim Champion, he was an engineer at that factory when I was there The ImageProc system that I created for Libraries utilizes transaction-based processing and is highly threaded When the Library’s Video Streaming Service was created, I modified ImageProc to do video transcoding The Archiver tool that’s used in the Digital Libraries to archive data to SDA was the first tool I wrote to automate SDA access My experience with robotics programming came in handy when creating image recognition algorithms I took a computer hardware design class here at IU where we had to recreate a mid-60’s mini computer. I learned a lot about state machines And finally, 20 years of perl programming and system administration turned out to be useful
9
Overall Considerations
Murphy’s Law Transfer corruption, transcoding failures, machine crashes Processing must be reliable: Available 24/7 Fail-safe: no data loss or corruption Extensive logging required to track down issues Staff must be able to direct processing Failed objects require a human decision Staff can QC objects and pass or fail them Murphy was an optimist: what can go wrong will go wrong. Files get truncated or corrupted during transfers, the transcoding can fail, or machines can crash There’s only a limited amount of time available to process a day’s worth of content, so it’s important that the system be running all the time. Since this is preservation, data loss and corruption are not acceptable Extensive logging is required to make sure that in the event of a problem, the cause and the scope of the problem can be determined and resolved. The QC staff need to be able to direct the processing system in the event that an object fails for some reason, to retry it or schedule it for redigitization. Additionally, the staff manually checks a percentage of the objects to verify that there are not any systemic issues.
10
Daily Data Quantity Estimated peak transfer is 9TB per day
Enough to fill 13,000 CD-R discs! Transfer over common 1 Gigabit Ethernet takes more than 26 hours Many hours to transcode daily Up to 117 hours of video and 283 hours of audio content Nearly 54 hours of computer time required for transcoding More than 3 days are needed to handle 1 day of data! When I first learned about these numbers, I wasn’t sure if it was even possible to handle this much data 9TB of data is a lot of data. 13,000 CD-Rs stacked without cases is nearly 50 feet tall. Every day – for around 3 years. Transferring the data between machines wasn’t feasible with existing network infrastructure, since it would take more than a day with 1GbE. In addition to the size of the objects, the content length is also daunting. Processing a day’s worth of audio and video takes more than 2 days of computer time. When run one at a time, back-to-back, it would take more than 3 days to process a single day’s content
11
Transcoding Hardware 4 Servers for Transcoding
10 Gigabit Ethernet to SDA 136G RAM 48 vCPU 1TB Fast Disk Array Like most of life’s problems it can be partially solved by throwing more hardware at it. For transfers, we were able to utilize 10 GbE connections to SDA which cut the transfer time down to less than 3 hours per day. For transcoding, after a lot of math and testing, I determined that the processing could be handled using 4 fairly hefty servers. This configuration reduced the transcoding time from 54 hours down to 13 ½ hours For reliability, the servers have redundant systems and monitoring facilities which can alert us if a component fails or warn us if a failure is predicted.
12
Transcoding Software All processing parallel Minimize waiting
FFMPEG software used for transcoding, can use many CPUs Multiple objects processed concurrently Up to 3 video objects, 24 audio objects, or a mix – per transcoder batch Minimize waiting Transcoder object batches are scheduled independently Objects are queued for processing as they are available Transaction-based processing Objects cannot be partially OK With 192 CPUs in the transcoding farm, it’s important to make sure that all of the CPUs are busy. For many media types, FFMPEG will use multiple CPUs when transcoding. For video, I’ve seen it use 15 CPUs. The scheduling software will schedule a batch of objects to a transcoder so all its CPUs can be used. Up to 3 videos or 24 audio files simultaneously, or a mix Each transcoder is scheduled individually to keep them busy, and individual objects are put into the transcoding queues as soon as they are ready. So, if only one object is ready for processing, it will start transcoding without waiting to fill up a transcoder If one part of object processing fails, the entire object is marked as failed. This avoids the question of what to do if something “mostly” worked.
13
SDA as Primary Storage SDA has real benefits … and some real drawbacks
Capacity to store 6PB that MDPI will need Tape systems are cheaper than spinning disks Uses a disk cache to speed up access UITS SDA crew is awesome … and some real drawbacks SDA’s implementation requires reading tape copy to ensure data integrity Concurrent accesses to a single tape kills throughput It was not designed for this kind of use MDPI will create over 6 petabytes of data over the project’s lifetime. SDA is the only storage system on campus that can handle that amount of data for preservation Tape is cheaper in the long term because when a tape isn’t being used, it doesn’t require any power SDA utilizes a disk cache that stores frequently requested files to minimize access times. Files can be staged from tape onto disk in advance if access will be needed in the (near) future. The UITS crew is great. They’ve been incredibly supportive and so far they haven’t blacklisted my , despite constantly pestering! Due to the way that SDA is implemented, the transfer from the SDA disk cache to the tapes is not verified. This is a preservation project, we have to verify that the data on the tape matches what the digitizer has sent us. So we have to discard the SDA disk copy and re-read the tape to verify the integrity of the content. This substantially increases the amount of time required to retrieve data from SDA. It can take several minutes to retrieve a single file from tape Since tape is linear, if multiple processes are trying to access a tape, it will spend a lot of time fast forwarding and rewinding to find the file on the tape. This becomes especially bad when one process is writing since the writes can only occur at the end of the tape. SDA was never designed for this kind of usage, so a lot of work was required to optimize access times.
14
SDA Interaction Tape reads and writes are serialized
Waits for all tape writes to finish before starting reads All staging from tape is centralized Requests are sorted by tape order Requests for the same tape are bundled together, removing duplicates Files already staged to disk are skipped Different tape pools for masters and derivatives To avoid trying to read and write a single tape at the same time, the processing system is designed to wait for all writes to complete before reads are started. The system throughput is increased, even though the transcoding waits until the digitizer has stopped transferring new objects. All of the requests for staging files from tape to the SDA disk cache are sent though a single process which can guarantee that all of the tape reads are happening in the most efficient way by Re-ordering the requests to be in the same order they are stored on tape, bundling similar requests, removing duplicates, and skipping files we already have on disk. The derivatives are stored on different tapes than the masters so there is no contention when the masters are read from tape while the derivatives are being written
15
Software Environment Linux Perl 5 Cron Stable & Efficient
Most tools included and installed by default Blank machine to working transcoder in less than 30 minutes Perl 5 Rapid development CPAN Text manipulation and system tools are what it does best! Cron System components start automatically – no need to restart daemons Only minimal locking required Avoids memory leaks Some technical details – feel free to cover your ears for this slide! The software environment for this project is based around Linux, Perl, and Cron Linux as the base OS It is stable and efficient. I was able to fully utilize the server hardware with minimal tweaking. Most of the tools needed to make this work were included and installed with the OS, or readily available In the event that I need to rebuild a transcoder, I can have one up and running in less than 30 minutes Perl 5 Provides a rapid development environment since it is script-based CPAN – the comprehensive perl archive network provided many of the libraries need to handle things like HTTP, XML manipulation, and Amazon S3 integration. A module called Parallel::ForkManager made it trivial to create a highly-parallelized environment for processing There’s a lot of text manipulation and pattern matching in this project and Perl excels at that. Additionally, there’s a lot of system interaction which perl handles very well I’m using cron to start all of the programs It runs the system programs every minute The only locking required is for ensuring only one copy of a program is running at a time If another copy is already running or there’s nothing to do, they just exit This avoids memory leaks since a fresh copy of the program is started each time Currently it is around lines of code spread across about 100 files.
16
System Integration Activity Protocol SDA data access
HSI command line wrapper SDA change ownership SSH wrapper to SDA node POD data access XML via HTTP Memnon status change notifications (PAM) SOAP via HTTP Avalon metadata ingest JSON via HTTP Avalon derivative storage S3 via HTTP IUCAT metadata z39.50 ADS for master access control LDAPS MDPI statistics Mysql Media Transcoding and Identification Wrapper around FFMPEG and FFPROBE Last technical slide! The processing system ties lots of different resources together when processing Unfortunately, many of them use different protocols! This is a list of most of them – I’m sure I’ve forgotten some.
17
Conceptually Simple Remember this slide? Here’s where we started
18
Reality This is what actually happens with the objects after dealing with all of the details and working around the quirks in the different systems involved. In computer science-y terms, this is a state machine diagram. Each bubble is a different state that an object can be in. A successful object will go through 19 states between the time it is uploaded and when it is archived. The implementation continues to evolve. [Show online version for clarity]
19
Processing Walk-through
Everyone take a deep breath -- we’re going to follow an object through the system
20
Processing walk-through
Following a single object through the system MDPI Barcode Betacam SP tape from IU Archives AR-266; B-Town Sounds #507 Beth Lodge-Rigal; August 12, 1998 Digitized by Memnon on January 29th, 2016 This is a recent object I picked at random It’s a video tape of a B-Town Sounds performance that came from the IU Archives and was digitized at the end of January.
21
Object Delivery 5:27am 2/1/2016 3:35am 5:37am 5:27am 5:37am
Memnon uploaded this object along with 300 others to SDA starting at 3:35am Feb 1st. At 5:27am, this object finished transferring and the POD was notified. After transfer is complete, the first thing that we do is change ownership from Memnon to IU. Usually this is a near-instant operation, but this time it took 10 minutes because SDA was busy At 5:37 basic_qc started. A number of things are checked: the filenames are correct, the data was transferred to SDA correctly, and all of the expected files are present. It took 6 seconds to do those checks. Before processing takes place, a copy of the files have to be written to SDA tape. At 5:37 the object began waiting for SDA to make the tape copy from the data on SDA cache disk. Sometimes things go wrong: If the object had been a duplicate or unknown to the POD, the staff would have been notified, and processing stopped If any of the delivery steps had failed, the object would have been rejected and the QC staff would have the option to discard the object or retry it.
22
Tape Motion 12:42pm 5:37am 7:36am 7:43am 10:06am
After two hours SDA had finished writing a copy of the object to tape To verify that the tape copy has been written correctly, we have to erase the SDA disk copy so the next read is from the tape. At 7:36 the object began purging from the disk cache. Like changing ownership, this is something that usually happens in seconds, but for this object it took 7 minutes. At 7:43 the purge was complete. To make sure that SDA isn't writing the same tapes that I want to read from, processing will wait until all of the incoming objects have been purged before continuing. By 10:06 all of the incoming objects from the night's upload had been written to tape and erased from the disk, so all of the night’s objects (including this one) began staging from tape to SDA disk. Finally, at 12:42pm the object had been read from tape into SDA disk. The object was then accepted for processing. 5:37am 7:36am 7:43am 10:06am
23
Transcoder Queueing 12:42pm 1:27pm
Even though the object was ready for processing at 12:42, it took 45 minutes for the transcoders to finish processing the objects that were ahead of it in line. The object was queued to be processed on transcoder 3, along with 2 other video objects at 1:27 1:27pm
24
Transcoding 1:30pm 1:28pm 1:30pm 1:27pm 1:30pm 2/1/16 1:46pm
This is the transcoding stage – where most of the heavy lifting takes place. One minute after it was queued on transcoder 3, the object started downloading from SDA’s disk to the transcoder. Downloading took two minutes. Because this object was digitized by Memnon, we have to inform their system that we have a valid copy. This took place at 1:30. The object began a more thorough automated QC process. This verifies that the files are the right format and the media has the right parameters, like video frame size and the right number and type of audio tracks. Since all of the data is on the transcoder, the check is very fast – roughly 1 second. Derivative processing began at 1:30 and finished 16 minutes later. During processing, all of the derivatives that were created were added to the object stored on SDA. This included the 3 streaming derivatives, a thumbnail sheet, a closed caption extraction, and preservation metadata files. At 1:46 the object had completed processing and began waiting for a QC check by a human. If any of the processing steps had failed, the object would be marked as failed and the QC staff could either retry it or discard it in favor of a newly digitized version. 2/1/16 1:46pm
25
Object Manual QC 4:19pm 3/2/16 4:19pm 2/1/16 1:46pm
The objects will spend most of their time in the system waiting for QC. Memnon gives us 30 days to check video objects and 40 days for audio objects before finalizing them. The QC staff will manually check a percentage of the items to identify any systemic issues. After reviewing the content, the QC staff can either pass or fail the object. In the event that the object has failed manual QC, the QC staff can ask the system to delete the object so Memnon can send a new one, or distribute it anyway if the error was on the original media Since not all objects will get manually QC’d, there is a timeout process which automatically passes any object which has expired. This object is a video that was not manually checked, so it was automatically passed 30 days after it was transferred -- March 2nd at 4:19pm. All objects which have passed are automatically marked for distribution 3/2/16 4:19pm 2/1/16 1:46pm
26
Avalon Distribution 4:44pm 4:19pm 4:44pm
We're now on the tail end of processing, and things start moving fast again. Objects which have made it this far will be distributed to an instance of Avalon Media System that is accessible to collection managers only. Avalon objects may consist of multiple MDPI objects. For example, a multi-LP set is a single Avalon object consisting of several MDPI objects. When all of the constituent objects for an Avalon object are ready to distribute, the processing continues. In this case, there is a 1:1 mapping between the Avalon and MDPI object. At 4:42 the object began ingesting into Avalon. At this time copies of the derivatives are made on the streaming server storage Metadata is structured to be compatible with Avalon’s ingest Content segments are identified in the media. By 4:44 it had been submitted to Avalon and was waiting for the ingest to complete. Less than a minute later, Avalon indicated a successful ingestion, and the object was marked as being ready for archiving. If, for some reason, Avalon rejects the object, the staff can retry the ingestion or just archive the object without it in Avalon 4:44pm
27
Finalize Objects 3/2/16 4:46pm 4:44pm
Object finalization is the last step for objects in the system. The object we're following was moved out of the processing system and into an archive directory on SDA at 4:46 on March 2nd. Overall, the object took just over 30 days to process. 4:44pm
28
Access to Completed Object
Ingested into “dark” Avalon Collection managers have access to media for review May be published to Media Collections online at a later date Show the example! The completed object is moved into a “dark” Avalon where only collection managers have access to the media. Content may be published to the public at a later date, depending on rights
29
Mission Accomplished?
30
Processing Totals Type Count SDA Usage (G) Duration (h) 45 RPM Disc
1,576 845 209 8mm 225 18,072 307 Betacam 13,300 564,372 8,999 DAT 5,610 16,103 11,694 Lacquer Disc 1,059 1,314 322 LP 31,253 91,877 22,633 Open Reel Audio Tape 29,511 49,729 16,858 All Types 82,534 742,312 61,022 Values as of 4/3/16 (around 9 ½ months) nearly 7 years of content! We crossed the ¾ of a petabyte mark yesterday
31
Problems encountered Less than 0.5% of the objects have required redelivery Murphy was right – but no data loss or corruption! Disk failures in different machines Transcoder restarts Network outages SDA and POD downtime Overloading on larger than anticipated data transfers Less than 400 objects have been rejected that required redelivery from the digitizer – nearly all were during the startup for a new format We’ve had servers with disk problems, required reboots, network problems, service downtimes, and times when Memnon has sent us way more data than we planned for -- but no data loss or corruption!
32
Future Directions Support for new formats Continued Improvement Film?
Video: VHS, Umatic, … ; Audio: Cassette, 78 rpm, … Continued Improvement SDA updates should speed up retrievals New features and improvements Clean up code Fix bugs! Film? Apply what I’ve learned about SDA to other Digital Library processes Future directions New formats are always on the horizon Updates to SDA should include Tape Ordered Recall That will shorten the amount of time I have to wait for files to come from tape, so I’m excited about that New features and improvements are always on the horizon There are a few places where the code is a bit of a mess – it could use some cleanup Most of the bugs have been squashed, but there are always more There’s a persistent rumor that film will be added some day so I have to handle that. Knowing now what I know about SDA, the other SDA-using systems can be improved
33
More Than Just Transcoding
Besides driving the transcoding, there have been a lot of interesting side projects which are tightly integrated with the processing
34
System Monitoring Live system monitoring, so administrators can check the status of the system It is updated every 60 seconds Shows the count of objects in different states As explained earlier, most of the objects in the system are in qc_wait so they dwarf the other parts of the chart
35
Statistics Statistics are run nightly and provide several reports to track overall progress Reports by object, date, and unit are available The reports are ed every day, although I’m considering creating a web page as well
36
Master Access Tool The master access tool provides content managers and QC staff a method for retrieving master media files from SDA It’s written in a combination of perl and javasript using AngularJS and Google’s material design – so it looks like an android app Downloaded data is cached on library servers for a period of time and content download links can be generated and shared with others I plan to make a variation of this tool for retrieving content that is in the Digital Library archives on SDA
37
Thumbnail sheet During processing a thumbnail sheet is created for all videos It shows a 30-second frame capture and the audio waveform for all of the audio channels in the file Below the color bars you can see that there’s a tone on audio tracks 1 & 2. I originally wrote it for debugging the transcoding, but it proved useful in troubleshooting other problems as well so I left it in.
38
Closed Captions MAJOR FUNDING FOR THIS PROGRAM WAS MADE POSSIBLE BY THE GENEROUS SUPPORT OF THE GRAHAM FOUNDATION FOR STUDIES IN THE FINE ARTS, THE COLUMBUS AREA VISITORS CENTER, COLUMBUS CONTAINER, AND THE I also wrote a closed caption extraction tool that is run on every video It isn’t perfect, but it should at least give us an idea of what videos have captioning and which ones do not. That is the actual text extracted from the video The image above the title is an example of the raw encoding for two characters in the caption On the video, it appears as a flickering line at the top of the picture.
39
Content Section Detection
Finally, Since we’re digitizing the physical media from beginning to end, the files are a continuous stream of content. For audio there are usually different tracks and video will have different program segments Also, if the content ends before the media does, there will be long periods of blank or silent content at the end of the media To add some structure to the content, I wrote a tool which will try to determine when sections of the content begin and end. Like closed captions, It isn’t perfect, but it is a good starting point In the video depicted here, the software found 5 segments It works reasonably well on things like Video and LPs It needs more tweaking on music school recital performances One thing I’ve noticed: there always seems to be an awkward silence before applause during many of the recitals – probably because nobody wants to be the person who claps in the middle of a work… so many of the segments found by my tool start with applause!
40
Thank you! Questions? Try to stump me! Thank you for listening!
I’ve been working on this for far too long, so try to stump me with your questions!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.