Technology Fundamentals for Digital Preservation

Technology Fundamentals for Digital Preservation
Developed in partnership with the This POWRR Institute is generously funded by the Curriculum developed in partnership with Digital Preservation Coalition (DPC)

What We’ll Be Learning Common Computer Systems & File Formats
We’ll become familiar with the main aspects of many computer systems we may encounter, including operating systems, file systems, and file formats Open Source Software, Packages, & Metadata We’ll learn how to begin choosing and deploying open source software at our institutions OAIS Standard We’ll gain familiarity with the main concepts of OAIS, particularly with regards to the Information Model Section One – Common Computer Systems and File Formats (30 mins) The first section of the module will look at the some of the most common computer systems the participants may need to work with. It will provide an overview of their main features as well as providing a basic introduction to some useful functions. Some consideration will also be given to file formats and their issues relating to preservation. Section Two – Open Source Software (40 mins) Section two of the module will allow participants to understand the role open source software can play in their preservation work. The section will explain the ethos of the open source movement as well as the benefits and constraints of using open source software. It will also introduce the key open source products that are available for digital preservation in libraries and archives. Section Three – OAIS (40 mins) The final section of the module will provide participants with a broad overview of the OAIS standard, including a deeper dive into the information module. This will include an understanding of the structure and purpose of information packages and their place in the digital preservation lifecycle.

Common Computer Systems Developed in partnership with the
& File Formats Developed in partnership with the Section One – Common Computer Systems and File Formats (30 mins) The first section of the module will look at the some of the most common computer systems that we may need to work with. It will provide an overview of their main features as well as providing a basic introduction to some useful functions. Some consideration will also be given to file formats and their issues relating to preservation.

Expected Outcomes Common Computer Systems & File Formats
Identify Windows and Unix Operating Systems and the key similarities and differences Navigate the standard file systems for these OS and use basic functions Describe the main issues relating to the preservation of common file formats Section One – Common Computer Systems and File Formats (30 mins) The first section of the module will look at the some of the most common computer systems that we may need to work with. It will provide an overview of their main features as well as providing a basic introduction to some useful functions. Some consideration will also be given to file formats and their issues relating to preservation.

What is an Operating System?
System software Manages hardware and software programs Schedules tasks Exist on all platforms PCs/Laptops Smart phones/tablets Servers Operating Systems are the layer of system software that operates between the software programs we interact with and the computer’s hardware. The Operating System manages both the software programs and hardware, scheduling tasks relating to each to ensure the most efficient use of the computers capabilities. Operating Systems are required on all forms of computer platform including PCs and laptops, smart phones and tablets, and servers. Although all Operating Systems carry out similar functions, they have their individual quirks and it is important to be familiar with these to aid your digital preservation efforts.

Many Flavors: The OS Family Tree
Many different types of Operating Systems exist, as can be seen in the family tree diagram here. But the majority you will encounter are either Microsoft (DOS) or UNIX-based. For example the leading operating systems on the following platforms are: PCs/Laptops – More than 80% Windows, c. 10% MacOS (Unix) and c. 2% Linux (Unix) Smart phones/tablets – c. 90% Android and c. 10 iOS, both Unix Servers – Unix systems are dominant

Some Important Differences
Cost Licenses Customization Command Line and GUIs Storage There are many similarities and differences between Operating Systems, particularly between Microsoft (DOS)-based systems and UNIX-based systems. Some key differences to be aware of are: Cost – Microsoft Operating System software usually costs several hundred dollars, most UNIX-based systems are available free or very cheap as open source software. Licences – Microsoft OSs will be accompanied by strict commercial licences which restrict how they can be used and distributed. Licences for most UNIX-based systems are more open and allow for redistribution and reuse. Customization – Microsoft OSs allow minimal customisation compared with UNIX-based systems. This is probably a negative for the user but a positive for digital preservation as it means there is more reliable consistency between systems. Command line and GUIs – Interaction with Microsoft OSs can be managed almost exclusively through graphical user interfaces, making the user experience more simple and consistent. Using UNIX-based systems will almost certainly require some use of the command line for actioning processes which can be intimidating for some users, but also (as with customisation) allows the user more power. Storage – Microsoft OSs organise information into files and folders, but the actual physical locations of the data can be spread across different parts of the storage, making it more difficult to copy to a new system without relying heavily on the OS. UNIX-based systems use the terminology files and directories for their storage structure and data is co-located so it is easier to find on the storage and to move and copy. One small but important detail to note if moving data between UNIX and Windows systems, and vice versa, is that Windows systems use backslashes for file locations while UNIX systems use forward slashes (like website URLs). These need to be converted if data is moved, thankfully there are free tools available to automate this process.

Getting to Know What’s Inside
As already mentioned, one of the key differences between Microsoft and UNIX-based systems is the way files are structured. In Microsoft systems the main folders to the be aware of, and to look in for content for preservation, are C:/Windows, C:/Program Files, and C:/Users In UNIX-Based systems content is most likely to be in the /home (or /Users on Macs) or /mnt (/Volumes on Macs) directories, • the first being the users home directories and the second temporary mounted directories which may include shared content or external storage media.

Don’t Fear the Command Line
Before GUIs, this was the primary way to interact with computers Benefits: Fewer system resources used More control, power and precision Can automate common processes Used to run many digital preservation tools The Command Line Interface was the primary way in which users interacted with early computers, before Graphical User Interfaces become commonly available. The command line is still used by many advanced users as it brings benefits such as: Quicker processing times as fewer system resources as used to execute an command line instruction Users have greater control over the processes actioned as well as more power and precision in issuing instructions to the computer It allows the automation of common processes through simple scripting, such creating folders/directories or moving data Becoming comfortable using some of the most simple commands is useful for digital preservation purposes as some tools only operate via the command line. But don’t panic, often only a few simple commands need to be used! There are differences between commands used in Microsoft and UNIX-based environments, but there are plenty of guides and introductions to their use.

File Formats: Just Keep the Bits…
We’ll now move on to consider the main issues relating to file formats and digital preservation. There are 3 key issues relating to the preservation of file formats: retaining the original bitstream of the file, making sure it isn’t altered over time, and providing access to the file.

What’s In a File? … SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 200x392 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2… To start it is important to understand what is in a file. • At a fundamental level, all digital data is stored as a series of 0s and 1s, as we can see on the left. These are “binary digits” or “bits”. These bits are interpreted by the computer to render the information we ultimately see onscreen. • Most files will contain a file header which contains information about the file. This can include the file format and version and information about the contents of the file. The column shows an example of information from a file header. Useful metadata can often be extracted automatically from this. • Using the bits and information in the file header (and sometimes a footer) the computer will render the file onscreen, be it as an image, document, spreadsheet etc.

What Are the Risks? Media obsolescence
Media failure or decay (such as “bit rot”) Natural / human-made disaster File format obsolescence Digital preservation is necessary because the files we keep face a variety of risks. If left alone, they are not likely to survive intact on into the future in the way that physical documents might So, what are the key risks faced? They include: Media obsolescence – this is when storage media, such as tape, floppy disks or CDs become obsolete and you no longer have the hardware needed to read them. Example – most new laptops don’t have a disc drive. Media failure – storage media is a commodity product and tends to have a reasonably short lifespan. Most hard disks tend to have a reliable lifetime of around 5 years. A commonly cited example of media failure is ‘bit rot’ – though all forms of storage media are subject to different forms of decay, bit rot refers to the loss of data due to the small electronic charge of a bit being ‘flipped’ from 1 to 0 or vice versa, or alternatively, this can happen due to cosmic rays or other high energy particles. Disasters that damage digital data can come in many forms from fire, flood, etc. to human-caused issues such as viruses and malicious attacks. Finally, file formats themselves can become obsolete as the software they were created in goes out of use (backwards compatibility is not always guaranteed) or the file format itself is no longer used. This is a particular concern for proprietary formats as they are more difficult to reverse engineer. Worse still, any loss may not be entirely clear to the casual observer. In fact it may require massive manual effort just to work out if any of your data has become damaged. Unless some care is taken to manage and preserve the data properly. Images by Aldric Rodríguez Iborra, Erin Standley, Marie Van den Broeck, Edward Boatman and Dilon Choudhury from the Noun Project

What Is the Result? So what happens when these risks bite? The outcome is often unpredictable: Media degradation will often lead to a complete failure of the storage device. In other words, you can’t read back any of the data stored on it. In some cases, damage might be more subtle. Some of the bits in a bitstream might become lost or damaged. This might lead to an obvious result as in the case of this before (left) and after (right) screenshot of a digitised newspaper page. Alternatively, damage might be less difficult to recognise visually. Some of the newspaper pages from the same collection as the one here that were also damaged looked fine until you zoomed in, and they became fuzzy. Although the bitstream was damaged, the viewer software did it’s best to render the image without informing the user. Things are not always as they seem! Image courtesy of the British Library

Stuff Happens Whenever a digital collection is moved, processed, curated or altered in any way.... things can go wrong! Network dropouts at critical times Disks get full, subsequent data copied there is lost Software bugs lead to unexpected results Human error leads to all sorts of issues Stuff happens a lot more at scale! It’s important to remember that the risks described are always present but also whenever a digital collection is moved, processed, curated or altered in any way things can and will go wrong! This can include: Network dropouts at critical times, such as in the middle of moving a large number of files. This can damage files, or result in incomplete transfers. Without careful planning, storage disks can get full and any subsequent data copied there can be lost Bugs in in software can lead to unexpected results, including changes to files and data copied to unknown locations And the biggest danger is often human error, a simple loss of concentration can sometimes lead to problems like files accidentally deleted. It’s also important to remember that these problems can be multiplied many times when doing things at scale!

How Do We Solve These Problems?
Keep more than one copy Refresh storage media Know what you have Integrity check your data (also called “Fixity”) Use ‘open’ formats Carry out preservation actions So how do we solve these problems? Keeping more than one copy of the data is essential. 2 is okay, 3 is great. Some organisations store 4 or more copies. It is also recommended to use more than one form of storage media. Note that digital preservation is a trade-off between risk and cost. The more copies the better, but keeping more copies is costly. There is no precise answer for the perfect number of copies as the sweet spot is likely to depend on your own circumstances. Keep one copy in a different geographical location – provides some insurance against natural or unnatural disaster Storage media will degrade over time, so be prepared to periodically migrate data to new storage media. Having a refreshment plan is good practice. Understanding what you have in your collections is key, even if this is only a list of the number of files, their sizes, types and locations. There are tools to help generate this type of list (sometimes called a manifest). Things will still go wrong, so implement a process of integrity (or fixity) checking so you can automatically tell if your bitstreams are still intact. Encourage the use of open or stable file formats where you can. Many commonly used formats are open, and their specifications may even be international standards. PDF is a good example of an open standard, but that doesn’t necessarily mean it is the right format for you! Carrying out preservation actions will help ensure continued access to your data. Migration and emulation are the most common preservation actions and we will examine them in more detail shortly.

Making Sense of a Collection
Understand the data, then assess risks, plan, take action to preserve Characterization: How many files? How big are the files? What file formats? Is the data dynamic or interactive? Does it contain personal information? Is it encrypted? Scale = automation = software tools Knowing what you have in your collections is an important to first step to understanding them and planning for their preservation, allowing you to understand risks faced and to take appropriate actions. This process is commonly referred to as Characterization in digital preservation and can answer questions such as: How many files at there? How big are the files in the collection? What file formats are included? Is any of the data dynamic or interactive? Does it contain personal information? Are any of the files encrypted? What risks are associated? If you are carrying out this process at any kind of scale, it is obviously useful to be able to automate it rather than having to analyse each file individually. Thankfully, there are a number of characterization tools available for digital preservation. Which one you choose to use will depend on the type of collections and systems you have.

Characterization Tools
Pronom: a register of file formats and their behaviors (probably the world’s most boring database) DROID: a tool that analyses the files on a system (using the most boring database in the world) Also in this space: C3PO JHOVE TIKKA FITS One of the most commonly used characterization tools is DROID, developed by The National Archives in the United Kingdom. DROID produces detailed reports on the files in a particular folder of group of folders by harvesting information such at the file extension or from the file header and comparing this with the Pronom file format registry. Pronom contains information on a large number of file formats and their behaviours. DROID is just one of the many characterization tools available, others include C3PO, JHOVE, Apache TIKKA and FITS.

Assume nothing, validate everything
While characterization is a very useful process it is essential to note that these tools are not infallible and it is important to validate their data through quality control. • One way this can be done is using validation tools, a number of these are available. A recent example is a PDF validator produced by a project called VeraPDF.

What is a “checksum” or “hash value”?
the past 02ace44afd49e9a522c9f14c7d89c3e9 A less pleasant future the future 02ace44afd49e9a522c9f14c7d89c3e9 02ace11afd49e9a522c9f14c7d79c3e2 Another process we mentioned that is core to digital preservation is integrity checking. This is a way to check that files remain unchanged over time using values called checksums or hash values. To perform integrity checking we need to use a simple technology called checksums or hashes. This animation shows how they work. Let’s say we have a bitstream, or digital file, that we want to preserve. • We begin by creating a checksum. This is simply a fairly unique short number derived from the file using a software tool. • Think of a checksum as finger print. • At some point in the future, we want to verify that our file remains exactly as it was, back when we first created the checksum at the top of the screen. • We generate a new checksum from the file. • We then compare the new checksum with the old one. • In this case, the checksums are identical, so we know the file is undamaged and exactly as it was. • However, if the file had become damaged, the checksum we would generate from it would be different. • On comparing the two checksums, we can see that they are different from each other. This confirms that our file is no longer identical to how it was. Perhaps it has been damaged by media failure or “bit rot”. Image by Arthur Shlain from the Noun Project

Combined Strategies: Keep 3 Copies & Perform Integrity Checks
Integrity checking can be combined with the strategy of keeping multiple copies of each file to provide a robust digital preservation approach. This works as follows: • We make 3 copies of the file we want to preserve, ideally placing one file offsite to protect against natural disasters • We generate checksums from each file, and we can see that they are all the same, and each file is good. • Over time we can then recalculate our checksums, and see that the three copies of the file are still exactly as they were • Until at some point in the future we recalculate our checksums and discover that one of them is different! • Straight away we know that the middle copy has become damaged • So we then discard the damaged file • And replace it with a copy of one of the others Using these techniques we can dramatically reduce the chance of losing any of our data.

Integrity Checking – Tools
Fixity Auditing Control Environment (ACE) For alternatives – see COPTR Some characterization tools include functionality for integrity checking but there are also software tools specifically for this process. These include: The tool “Fixity” is a good place to start for generating checksums. ACE is a more advanced tool, ideal for performing scheduled integrity checks. More options can be found on COPTR, a registry of digital preservation tools.

Approaches to Preservation
Bit-Level Migration Emulation Hardware Preservation Digital Archaeology etc.…… Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark The processes described in the last few slides combine to form one of the main form of preservation, often known as bit-level preservation. As mentioned earlier, this addresses the issue of preserving the original bit stream, but we must also address how to preserve access to that bit stream. There are several different approaches to this issue, including preserving original hardware (sometimes known as the computer museum approach), also using digital forensic techniques to uncover ‘lost’ data via digital archaeology. The most common approaches, however, are migration and emulation.

Migration Normalization To New Versions
There are two main forms of migration for digital preservation, and it is possible to use one or both. The first is a method often referred to as ‘normalisation’. This is where all files of a particular type (for example, text documents) are ‘normalised’ to one file format. The example on the slide shows Word documents being normalised to PDF. For images this could be JPEGs and GIFs normalised to TIFFs. The choice of normalised files format used will depend on the needs of the organisation and its users. The second method involves migrating old file formats to newer versions when they are at risk of becoming obsolete. This could be migrating an old .xls spreadsheet to a newer .xlsx format. Both methods have their positives and negatives: Normalisation creates homogenous, easier to manage collections and means that users need to know how to use fewer files types. Migrating to new versions means that files can be accessed in current computer environments. Both processes can be automated but quality control is incredibly important and careful consideration must be given to migration pathways to avoid loss of data and functionality.

Emulation Emulation is the process of recreating the original environment in which a file was created and used via a layer of specially written software: the emulator. • Emulation has been particularly successful in the world of computer games, where enthusiasts will create emulators to allow them to play older games. • It is also an increasingly popular preservation method and several projects have produced emulators for everything from early browsers to old versions of PowerPoint. Many of these emulators are freely available, either online or as software downloads. Emulation perhaps seems like the ideal version of digital preservation as it allows users to access the files in their originally environment, providing a more authentic experience. It is however, very resource intensive and emulators will require updates (or their own emulators) as computer environments change. It can also be difficult to confirm the emulator truly captures the original environment unless there is still access to an original example to compare.

Common Computer Systems
& File Formats QUESTIONS?

Developed in partnership with the
Open Source Software Developed in partnership with the Section Two – Open Source Software (40 mins) Section two of the module will allow participants to understand the role open source software can play in their preservation work. The section will explain the ethos of the open source movement as well as the benefits and constraints of using open source software. It will also introduce the key open source products that are available for digital preservation in libraries and archives.

Expected Outcomes Open Source Software
Explain the ethos of the open source software movement and the main benefits and constraints of using this type of software product List the main digital preservation open source software tools for libraries and archives Describe the differences between using open source software and products offered by a vendor. Section Two – Open Source Software (40 mins) Section two of the module will allow participants to understand the role open source software can play in their preservation work. The section will explain the ethos of the open source movement as well as the benefits and constraints of using open source software. It will also introduce the key open source products that are available for digital preservation in libraries and archives.

Software 101 Written in a human-readable programming language
Most often ‘Compiled’ using an intermediary program into computer-readable form Proprietary software provides only compiled version Can’t make modifications beyond program’s inbuilt functionality Source Code Compiler Machine Code To understand some of the issues and benefits of Open Source Software it is important to first be familiar with the basics of how software works. Most computer programs are written in a human-readable programming language such as Java or C. These contain complicated series of instructions for the computer to carry-out. These programs are not, however, understandable by a computer’s hardware and so an intermediary translation must happen. This can happen in 2 ways: Programs that are written in ‘Interpreted languages’ are parsed action by action as the program is run and the translations are supplied to the computer hardware. Alternatively the program is passed through an intermediary programme called a compiler to translate the complete program into machine readable code. This will be the example we will use in this presentation. Most proprietary software is supplied in its compiled form. As this is not human-readable it makes it virtually impossible to understand and alter.

History of OSS First conceived in late 1990s
Adopt best practices from Free and Commercial Software Open development = better software First program released as OSS: Netscape browser Server/software infrastructure early priorities The concept of Open Source Software (OSS) was first introduced in the late 1990s as an evolution of the Free Software movement. It was created with the idea of adopting the best practices from both Free and Commercial software development. They hoped to retain the superior open development model of Free Software which had been proven to produce better software. This would be couched it in a more structured (but open) legal framework. The first program to be released as OSS was Netscape’s browser, the code for which has since become the basis for the development of several other OS browsers including Mozilla’s Firefox. Early efforts in the OSS domain focused mostly on server and software infrastructure projects but has since expanded to include all forms of software.

Ethos of OSS “Software should be made universally available in its entirety, with everyone afforded the opportunity to understand, change and re-distribute it.” Andrew McHugh, DCC Manual, 2005 Key Elements of OSS: Transparency Openness Community The ethos of the OSS movement has been summarised well by Andrew McHugh in his chapter on the subject for the Digital Curation Centre’s Manual. The OSS Movement believe in the ethos that: “software should be made universally available in its entirety, with everyone afforded the opportunity to understand, change and re-distribute it”. So, key to OSS are: Transparency – Making development process and decision-making transparent for all stakeholders. Openness – The source code for all OSS should be open, likewise development should be open to all those wishing to participate. Community – Fostering a strong and engaged community will create the best products.

Ten Criteria for OSS Free Redistribution Include Source Code
Allow Derived Works Integrity of Author’s Source Code No Discrimination Against Persons or Groups No Discrimination Against Fields of Endeavor Inherited Distribution of License License Must Not Be Specific to a Product License Must Not Restrict Other Software License Must Be Technology-Neutral As well as the general ethos of OSS, there are 10 key criteria that a product must adhere to to be considered Open Source. They are: The product’s license must allow for free redistribution, under the same license conditions. All releases must include full access to the original source code. Licenses must allow derived works, allowing users to customise software to their own needs. Restrictions can only be placed on the redistribution of altered source code if the license allows the alternatives of the redistribution of: The original source code with patch files, or The altered source code with a different name or version number. OSS cannot discriminate against any persons or groups It can also not discriminate against any fields of endeavor, common examples being businesses or controversial domains such as genetic research The terms of the original license are inherited by all who use a redistributed version of the product, this is sometime referred to as a viral license If the product is part of a bigger package with other software the original license remains relevant even if the individual software is redistributed separately The license cannot restrict what other software the product may be used with The license cannot dictate the use of a particular technology or interface

A Free Beer, A Free Cat, or Free Speech?
OSS is not necessarily free as in ‘gratis’ A Free Cat Costs relating to implementation, upkeep, training, support, etc. Free Speech Access to source code Ability to adapt to own needs Can redistribute Freedom and openness are key to OSS but many mistake this to mean that the software should be available free of charge. This is not the case and the freedom of OSS is often expressed using the analogies of ‘a free beer’, ‘a free cat’ and ‘free speech’. If you receive a free beer, this is something that comes to you at no cost and you can consume without any further ramifications other than slight inebriation. This is not the type ‘free’ that applies to OSS. It is often offered for no or a low cost but there is no requirement to be free as in ‘gratis’. Some have likened OSS instead to the idea of a free cat; while the original gift may not cost you anything, caring for the cat will cost money for food, toys, vet bills etc. With OSS although the original software may be free or relatively cheap, you will likely incur costs relating to implementation, upkeep, training, support and other issues. The other essential freedom of OSS has been likened to free speech in that there is a requirement for free access to the source code, to adapt the software if desired and to freely redistribute the product.

Development Model Users as co-developers Early releases
Frequent integration Different versions: beta vs stable High modularization Dynamic decision-making There are several key ways in which the development of OSS differs from commercial solutions, these all aimed at creating more complete and stable products. The differences include: Users are considered to be co-developers alongside programmers. This emphasises both the collaborative nature of OSS as well as the belief that testing and bug identification are as important to the development process as writing code. Programmers are encouraged to release code as early as possible to allow the interaction described in the previous point, users can check functionality is fit for purpose and spot bugs early. This input leads to more productive development cycles. Multiple programmers may be working independently on the product so they are encouraged to frequently integrate their work to ensure consistency. The creation of modularized products which make it easier for multiple people to work on the product as well as enabling customization and updates. Dynamic-decision-making is encouraged to ensure changes can be incorporated quickly.

Different Types of Contributions
“Give as you can” Help with: Scoping developments Identifying requirements Writing code Providing feedback Identifying Bugs As mentioned in the previous slide, the term ‘developers’ is used quite widely in the OSS world, including more than just those creating code. This is an important factor to remember if you are planning to use OSS but do not have the skills or resources to contribute to the programming efforts. Contributions are encouraged on a “give as you can” basis and all types are equally valued. These can include helping: Make suggestions for and scope new developments Identifying the more details requirements for developments Contribute to the writing of code Providing feedback on new functionality to make sure it is fit for purpose Identifying bugs early to create more stable software Even a small amount of time spent on one of these activities helps the community at large.

SPRUCE Project Community orientated approach to digital preservation
Collaboration on tools and resources Held 3 Mashups and 1 Hackathon SPRUCE Mashup Manifesto Be agile Re-use, don’t reinvent the wheel Keep it small, keep it simple Make it easy to use, build on, re-purpose and ultimately, maintain Share outputs, exchange knowledge, learn from each other The community orientated approach of OSS complements the approach many take to digital preservation and an excellent example of this was the SPRUCE project. It used a similar ethos to bring practitioners together to collaborate on the development of various tools and resources. These efforts included 3 ‘Mashups’ and 1 ‘Hackathon’ where DP practitioners and developers came together to work on small scale solutions to a variety of practical digital preservation problems. Like many similar OSS endeavors the project had a manifesto that encouraged participants to: Be agile in their developments Re-use existing code or solutions where possible so they were not reinventing the wheel To keep things small and simple, approaching problems at a more atomic level with the idea that tools could be used together for more complex issues Create tools that were easy to used, build on, re-purpose and ultimately, maintain Share outputs and exchange knowledge between different development groups so they could learn from each other. These are all generally great points to remember for those getting started with digital preservation. All of the outputs of the SPRUCE project can now be found on the project pages hosted by the Open Planets Foundation.

Some Major OSS Organizations
Open Source Initiative Apache Foundation Mozilla Linux Foundation Free Software Foundation WordPress OSS is more widely used than many realise and a large number of organizations exist to oversee various processes and products. The Open Source Initiative is the original Open Source organization and the key body for oversight of OSS. They set the requirements for what can be considered OSS and approve licenses. The Apache Foundation maintains a large number of products that include the Apache HTTP Server, the market leading server software. They also now maintain the Open Office suite. Mozilla offer a range of Open Source Internet-related products including their most well-known output, the Firefox browser. The Linux Foundation also oversees a number of projects, the most famous of which is open source operating system Linux. Linux has been adopted by a wide variety of organisations around the world including several government bodies and the world’s biggest financial exchanges, the NASDAQ and the London, and Tokyo Stock Exchanges. The Free Software Foundation was the precursor to the Open Source Initiative and has a more socio-political focus. It provides the framework to support the GNU project which, among other things, is responsible for one of the most common open source licenses. Another high-profile example of OSS is the WordPress blogging platform. Users can pay to have a blog hosted by the company or they can download the blogging software for free through an open source agreement and install it on their own web space. Open Source solutions have also been adopted and are distributed by a number of large commercial companies including IBM, Oracle and Google.

Benefits/Opportunities
Likely to be lower cost More freedom Influence new tools/functionality Fewer license restrictions Improved debugging Builds communities Easier to emulate Can share tools with data creators If you are considering using OSS it useful to be aware of the potential benefits and opportunities it provides as well as the risks and constraints. Benefits and opportunities include: Lower costs – Many of the OSS packages and tools for digital preservation are available free of charge. There may be costs involved in relation to implementation, upkeep and support but it has generally been found that OSS costs less in the long term than vendor solutions. More freedom – As already stated freedom to access, alter and redistribute is key to OSS. Influencing new tools/functionality – participation in the community surrounding an OSS project allows users to have direct influence on how it is developed. Fewer license restrictions – Licenses for OSS are far more open and through clauses like free redistribution can significantly reduce costs. Improved debugging – As debugging is a key part of the development process, with users actively contributing, this tends to lead to stable releases with far fewer bugs. Builds communities – The creation of communities around the development and use of the software provides users with a peer network to provide support and discuss implementations. Easier to emulate – key to digital preservation, the availability to access the original source code means that it is easier to emulate the original software environment if the program itself becomes obsolete. Can share tools with data creators – As well as only needing one license for multiple users, OSS allows you to share tools with data creators inside and outside of your organisation. This can help with preparing data before it is ingested in to your repository.

Risks/Constraints Tech resources/skills needed
Lack of clear leadership and governance Requires community engagement Variable documentation Misconception about costs Securing institutional buy-in Potentially less diversity Too much customization Funding/sustainability Although there are many benefits and opportunities to the use of OSS, is essential to be aware of the potential risks and constraints so that you can take steps to mitigate them. They include: Skills/resources needed - OSS may need more technical skills and/or resources to allow its implementation that commercial products. This may be as simple as using a few command line operations through to full programming skills. Leadership/governance - sometimes OSS development suffers from the lack of clear governance and leadership which can result in the software becoming unfocussed or stagnating. This is mitigated by the existence of an identified owner or groups of owners to oversee decision making. Requires community engagement – without community engagement OSS may fail to meet requirements or find and sustain an audience. Variable documentation – The quality of documentation accompanying OSS has at times been poor, although this has generally improved in recent years. Misconception about costs – Too common and opposite misconceptions often exist about the costs of OSS. The first being that is free, the second that a large amount of resources are needed to maintain it without vendor support. The truth being somewhere in the middle, but explanation of true costs may be needed to get permission to use OSS. Institutional buy-in – it can often be difficult to secure institutional buy-in as those in management will often believe that OSS is inherently unstable and therefore presents a risk. Being prepared to advocate for OSS is important. Less diversity – Too much investment from the community in a single solution can sometimes lead to negative outcomes, reducing competition and diversity. Too much customization – Customizing OSS too much can move the software too far away from the main development and leave you without community support. Keeping customizations to atomic add-ons where possible is advised. Funding/sustainability – Some OSS projects suffer from a lack of funding and planning for sustainability. This has been true in relation to a number of tools developed by digital preservation projects. Once the project funding is finished the tool is no longer maintained. It is therefore information to check on these issues before committing to a particular solution.

OSS Licenses ‘Copyleft’ licenses Approved by OSI
Emphasis on collaboration, openness and reuse Derived works must have same license Popular licenses include: Apache License 2.0 GNU General Public or Library General Public Licenses BSD 3-Clause or 2-Clause Licenses Mozilla Public License The types of license used with OSS are often referred to as ‘copyleft’ as their emphasis is on providing a framework for freedom of use rather than focusing primarily on restrictions. To be accepted as a true Open Source license it must be approved by the Open Source Initiative. They maintain a list of approved licenses on their site, which also includes information on the most commonly used. OSS licenses are written to encourage collaboration, openness and reuse, with proper attribution to the original creators one of the few restrictions on reuse and redistribution. They are usually far simpler than their commercial cousins and a fraction of the length. They also require that all derived works adhere to the same license. Ensuring the ethos of OSS is passed to those works. Although there a large number of custom OSS licenses, many projects choose to use one of the more common standard licenses listed here. More information on these can be found on the Open Source Initiative website.

Comparison with Vendor Solutions
Issue OSS Vendor Initial Cost Installation Source Code Customization Licenses Bugs Support Documentation Training Motivation for Developments Succession The table on this slide shows a simplified visual comparison of using Open Source Software versus vendor provided solutions. Green = good in this area, yellow = mixed, some strengths and weaknesses, red = not available. Both have their strengths and weaknesses. Looking at these in a little more detail: Initial Cost - Much OSS is free but even if there is an initial cost in procuring OSS it likely to be small in comparison with vendor solutions which may require a significant investment as well as an ongoing commitment to additional services or updates. Installation – Vendors are likely to provide support with the installation of more complex pieces of software and simpler pieces are distributed in an executable format that is usually easy to install. Installation of OSS is more variable and may require an invest of resources and more technical skills to get it up an running. Source Code – OSS provides access to the original source code whereas vendor supplied solutions are pre-compiled. Customization – By having access to the original source code, this means it is possible to fully customise OSS for your organization if you have sufficient programming skills. There is also the potential for greater customisation and collaboration from the wider community. Customization of vendor solutions is usually limited to the in-programme options and tools. Any more significant customisation will depend on the vendor and their priorities. Some vendors will create customised modules for their software for a fee. Licenses – It is generally only to acquire only one license for OSS no matter how many installations are needed. They are also more open to the creation of derivative works and redistribution. Vendor licenses tend to be far more restrictive, limiting how and where the software can be used. It also normal that multiple licenses may need to be purchased one for each user or installation of the software. Bugs – Due to the collaborative nature of OSS development stable versions of the software tend to be less buggy and when bugs are identified they are addressed more promptly if a reasonable-sized community exists for the product. Vendor software tends to be more buggy when first released as getting the product on the market is a key. They may also be slower to address identified bugs later depending on their current commercial and/or development priorities. Support – Vendor software often comes with a support package, or this can be purchased as an addition. Meaning that there is some expectation of good and prompt support. The situation is more mixed for OSS software and depends on factors such as the size and engagement of the user community or the availability of paid-for support services. Documentation – Documentation was historically poor for OSS but the situation is much improved in recent years but it cannot be relied upon for all software. For vendor solutions, there is a reasonable expectation that good documentation will be provided when the product is procured. Training – Like documentation, training for OSS is mixed and sometimes only available for larger/more established programmes. Commercial vendors will more likely have training resources available. Depending on the size and complexity of the software this may anything from online resources to the provision of in-house training for staff. Motivation for Developments – One of the key strengths of OSS is that developments are normally motivated directly by the needs of the user community. The main motivations for vendors are usually focused on commercial concerns; such as making a profit and strengthening their position in the market. This can mean they are less responsive to user needs and will adhere to business models such as planned obsolescence. Succession – No matter the type of solution chosen it is important to carry out succession planning to make sure data can be retrieved in the event of the discontinuation of the solution. With OSS the continued support and development of a product relies on the ongoing engagement of the community. If this abruptly ends, access to the source code means users are in a strong position as long as they have the skills/resources required. With vendors, it is very important to include succession planning in any service agreements but issues may still occur in cases on bankruptcy.

Things to Consider When Selecting OSS
Longevity Stability Costs Ubiquity Skills required Documentation/training Compatibility There are lots of factors to consider when choosing an OSS solution or tool, but a short checklist of issues might include: Longevity – How long has the software been available? Does it have a robust and active community of support? Stability – Do user comments indicate that the software buggy? Has a stable version been introduced? Costs – Is there a purchase cost? What will be the costs for implementation? Will you need to pay for support? Ubiquity – Is the software used by similar organisations? Can you rely on peer to peer support? Skills required – Do you have the necessary skills required to implement and use the software? If not, would they be easy to acquire? Documentation/training – Is the software supported by good documentation? Are there training resources available? Compatibility – Is the software compatible with your systems and other solutions or tools you have or would like to implement?

Beta vs. Stable Beta Version for community testing More bugs
Latest features More updates Stable Thoroughly tested Less buggy May lack new features Security updates One last important decision you may need to make when using OSS is whether to implement a Beta or Stable version of the software. Both choices have their positives and negatives. Beta versions are early releases of software that provide users with early access to new features so they can be tested by the community. You will have quicker access to new and potentially useful functionality but the software is also likely to have more bugs. This will mean that you will probably need to make more frequent updates to the version you are using (usually c. weekly). If you want to actively contribute the development of the software it is usually the beta version that you should use. The stable version has generally been thoroughly tested and more robust and less buggy. But it is likely to lack the latest features and, if the user community is not active enough, it may be a while before these are released. Stable versions generally require fewer updates (c. monthly or less frequent) and these are generally to fix security issues identified between versions.

GitHub A code hosting platform
Collaboration Version Control (Git) Used by developers of the majority of OSS digital preservation tools and solutions Public and private development spaces Basic account = free Access to full source code Best way to contribute to software development GitHub is a platform for hosting software source code. It allows developers to collaborate on the creation of software no matter their location and to managed version control through GitHub’s Git solution. GitHub is the most popular online code hosting platform and is used by the majority of digital preservation-related OSS development. GitHub provides spaces for both public and private developments and basic accounts, which allow participation in public projects, are free. A project’s ‘respository’ will provide full access to the software’s source code. The repository is the name used by GitHub for a particular project. Interacting with developers on GitHub is the best way to help identify bugs and make suggestions for new functionality.

Search Starred Projects User Info The first step of participating in GitHub is signing up for a user account. As mentioned before, a basic account is free and paid upgrades are only required if you want functionality such as hosting a private development space or more advanced permission controls. • Having a GitHub account will allow you to: Add information about yourself and interact with other users in a ‘social media-style’ ‘Star’ (or bookmark) projects you are interested in. The repositories for projects are named by the owner (an organisation or individual) and the project name. Begin contributing to developments You can also see here the search box which will allow you to search for projects of interest.

Project Name Issue Log Bookmarking Contributors License Tags Download Source Code Files ReadMe File The GitHub Repository Page is the main portal for accessing information on a development project. Some important parts of the page include: • In the top left the Project Name, this has the owner (an individual or organisation) and the specific project. Tags providing information on the type of project which can also be clicked to find similar projects. All of the Source Code Files and Supporting Documentation, this should always include a ReadMe file which should provide an introduction to the software and project. This will usually include a link to where you can download an executable version of the software (if available). The files can all be downloaded individually or there is a button to allow you to Download a complete version. There is a section that contains details of the corresponding Licence for the software. You can also see a list of Contributors to the project, including data on when they contributed and how much. At the top right, there are buttons to facilitate Bookmarking a project. Choosing to ‘Star’ a project will added it to your ‘Starred’ projects list. ‘Watching’ a project will notify you of any conversations around the project. This useful for projects you plan to actively participate in. Finally, there is a tab for the Issue Log this is the primary place where non-programmers can contribute to a project.

Raise New Issue Issue Types The Issue log allows project contributors to raise issues they have found with the software. This can include bugs that have been identified, suggestions for improvements in performance and suggestions for new features. • A well-structured issue log will have the issues tagged by type to help the developers identify priorities. There is a button in the top right of the page to allow participants to raise new issues. As previously stated, testing code and providing feedback is essential to the successful development of OSS and so contributions to the Issue Log are both important and welcomed.

Types of OSS for Digital Preservation
Two main types of open source for digital preservation Large-scale applications Repository systems Storage Workflow Tools for particular functions Characterization Migration De-duplication … When looking at OSS software for digital preservation there are two main types of product you may consider using. The first are large scale applications which can be used to manage multiple processes. These can include complete repository systems, software for managing storage and workflow management systems for implementing potentially complex process. The other main type of OSS for digital preservation is smaller tools that carry out particular functions, which can be smaller-scale processes or a step in larger processes. These can include tools for characterising a digital collection, for migrating a particular file type or to check a folder for duplicate file. There are many of these smaller tools and often several that will carry out the same function.

Example Repository Systems
OSS repository systems include: Archivematica RODA DSpace Fedora Islandora Eprints Samvera (Hyku) If you wish to implement a full repository system for your digital collections, there are an increasing number available. Archivematica and RODA are two examples of repositories that are well supported by both a specific organisation and their user community. Both system offer free repository solutions with additional plug-ins and paid-for support services. DSpace, Fedora and Eprints have emerged from and generally been used more in the research data and publication domains but have been implemented by a variety of different organisations. Samvera is a repository solution that has evolved from the Hydra project and collaborative work of a number of Higher Education institutions. Their Hyku solution aims to be an easy to install ‘out of the box’ repository.

Example Tools: Characterization
Various tools with different functionality: DROID Apache Tika C3PO FIDO JHOVE FITS One of the most widely used types of OSS tools for digital preservation are those used for characterization, and several are available. DROID is developed by The National Archives of the United Kingdom and links to their PRONOM database of file format information. It is one of the most widely used characterisation tools as it is available with a graphical user interface and includes functionality such as fixity checking. Apache Tika, C3PO, FIDO and JHOVE all offer similar functionality but with different strengths and weaknesses. For example, JHOVE provides the richest output including file format validity but only for a limited number of format types. FITS (the File Information Tool Set) is a little different from the other tools as it actually packages together a number of the other characterisation tools including DROID, Apache Tika and JHOVE. This means it can produce rich results but also inconsistencies between the different tools.

Other Types of Tools De-duplication Forensics Decryption Fixity
Planning Migration Emulation Validation Policy etc…… There are many other types of OSS tools for digital preservation such as tools for: Identifying duplicate files Carrying out forensic analysis of files or directories (particularly useful for handling processes such as disk imaging as well working with protected or sensitive files) Decrypting files Checking file integrity using fixity values Carrying out preservation planning Migrating file formats Accessing file using an emulator Validating a file format matches the format specification Helping to write policy And many other tasks and processes….

COPTR Tools registry for digital preservation
Includes OSS and Vendor solutions Part of DigiPres Commons Hosted by the Open Preservation Foundation Browse by: Name Function Type of content When looking for tools for digital preservation, one of the most useful places to start is the tools registry COPTR. The registry includes listings of both OSS and vendor software and solutions It is one of the information resources offered by the Digi Pres Commons and it hosted by the Open Preservation Foundation. COPTR allows you to browse tools by name, function and type of content. It is a community developed resource so the amount information varies by tool but contributions are welcomed.

POWRR Tool Grid The POWRR Project has been a major contributor to the COPTR repository and another eay to navigate the site is using the POWRR Tool Grid shown here. It allows users to identify relevant tools by object type and by lifecycle stages, which is particularly useful when developing new processes.

Open Source Software QUESTIONS?

The OAIS Reference Model, Developed in partnership with the
Packages, & Metadata Developed in partnership with the Section Three – OAIS (40 mins) The final section of the module will provide participants with a broad overview of the OAIS standard, including a deeper dive into the information module. This will include an understanding of the structure and purpose of information packages and their place in the digital preservation lifecycle. This section will provide an overview of the Open Archival Information System Reference Model It will introduce key terms and concepts as well as giving guidance on how to practically implement aspects of the model

Expected Outcomes OAIS, Packages, & Metadata
Explain at a high level the main components of the OAIS standard; including the mandatory responsibilities, functional model, information model and key terms Describe the elements of the information model and their relevance to the preservation lifecycle Design a basic information package and select relevant metadata standards Section Three – OAIS (40 mins) The final section of the module will provide participants with a broad overview of the OAIS standard, including a deeper dive into the information module. This will include an understanding of the structure and purpose of information packages and their place in the digital preservation lifecycle.

Why Do We Need Models? High-level conceptual map for activities
Can help set requirements Supports identification and development of standards Framework for comparing and assessing approaches Before getting into the detail of OAIS it is important to understand why having models for activities like digital preservation is important. Models: Provide a high level conceptual map of the generic activities that need to be undertaken, providing a view of the bigger picture. Allow us to start understanding our needs and set requirements for the policies, documentation, tools and systems we will look to develop. Provide a consensus between different organizations and practitioners allowing them to identify and develop standards that will benefit their efforts. Can be a basis for comparing and assessing the approaches taken by different organisations. In digital preservation, this has seen the development of audit methodologies.

What is OAIS? Open Archival Information System Reference Model
Originally developed by Consultative Committee for Space Data Systems An international standard ISO 14721:2012 Vocabulary and basic framework for much digital preservation work The acronym OAIS can be used to refer to the Reference Model for an Open Archival Information System standard itself, or a repository that adheres to the standard (an OAIS). The standard originally was developed out of the space data community through the Consultative Committee for Space Data Systems. It has since been established as an international standard, the 1st edition published in 2006, and a 2nd edition in A 3rd edition is currently in development. The full text is still available for free via the CCSDS website in the form of the ‘Magenta Book’ OAIS is particularly important as it provides a basic common framework for communicating about digital preservation activities and has provided much of the vocabulary used by practitioners. An important caveat: while understanding OAIS will facilitate your work in digital preservation it is important to remember that you may not want to use all of the standard, instead using what is useful/relevant to your organization.

Basic Definition of an OAIS
a reference model … to establish a system for archiving information, both digitalized and physical, with an organizational scheme composed of people who accept the responsibility to preserve information and make it available to a designated community The basic definition of an OAIS is included here It is important to note that it mentions both systems and organizational activities Not just technology It also establishes that the OAIS assumes a responsibility for the information within it, to be preserved for a particular period of time Designated community is a key term from OAIS which is important to understand The designated community is an identified group of primary users for the OAIS’ content. The size of this group may depend on both the content/collection and the organization. It can be anything from one person to the whole world It is important to define your designated community as their needs should shape decisions about how the content of the OAIS will be managed and preserved. The designated community must be monitored on an ongoing basis to capture changing needs. Who are your stakeholders what are their needs aka patrons, donors, other systems?

Scary OAIS Spaghetti Monster
But let’s be honest…. An OAIS covers all activities between the creation of the data to its use by the designated community. This encompasses activities such as: Acceptance into the repository Cataloguing Storage Preservation actions Access provisions Important to note: it is intended that the model defined by the standard is meant to be relevant for both physical and digital collections.

Functional Model…. …still scary but let’s give it a chance.
WHAT needs to happen The OAIS standard includes within it a number of models describing aspects of an OAIS. The main two are the functional model and the information model. Pictured here is the high level functional model which shows: The OAIS functional entities The information objects that they interact with The actors who interact with the OAIS The following slides will examine these in more detail

Actors…. …are just the folks* in your normal professional encounters.
COMPARE TO PEOPLE THEY ALREADY INTERACT WITH WHO makes it happen The standard defines 3 actors who interact with the OAIS: Producer – The individuals, organizations, or systems that transfer information to the OAIS for long-term preservation. Interactions include negotiation, completing agreements and transfer of data. Management – The individuals or organizations responsible for formulating, revising and enforcing the high-level policy framework governing an OAIS. They may also provide the funding required to run the OAIS. Consumer – The individuals, organizations, or systems that use the information preserved in the OAIS. Interactions may include searching catalogs and making requests for access. The designated community is a special subset of potential consumers. MANAGEMENT *and sometimes Systems

Objects…. …are just the materials and the information about them that bounce around your world. COMPARE TO OBJECTS THEY ALREADY INTERACT WITH To WHAT is it happening The objects shown in the functional model here is the information to be preserved in various states as it is processed in the OAIS. These objects are referred to as Information Packages and there are three types: Submission Information Packages Archive Information Packages Dissemination Information Packages Information Packages contain the original information object as well as any necessary documentation and metadata. We will examine the structure of information packages further when we look at the Information Model.

Functional Entities…. …are just the activities that someone needs to do in your world. COMPARE TO ACTIVITIES THEY ALREADY DO HOW does it happen The functional entities defined by the OAIS standard are the core set of mechanisms that allow it to deliver its primary mission of preserving information. They are: • Ingest: this includes the set of processes responsible for accepting information submitted by producers and preparing it for archival storage. This may include: checking that data is uncorrupted and complete, virus checking, and creation of finding aids/catalog information. • Archival Storage: performs the storage function of the OAIS, including ensuring information is stored in the appropriate location, error-checking and refreshing storage as required. • Data Management: this function maintains the descriptive and administrative data required for the management of the OAIS. • Administration: undertakes the day-to-day management of the OAIS and coordinates the activities of the other functional entities. This includes varied tasks such as negotiating with producers, managing policies, and customer service. • Preservation Planning: maps the preservation strategy for the OAIS, including monitoring changes in technology and the designated community, as well as developing preservation plans. • Access: manages the processes and services which provide the consumers with access to the information held by the OAIS.

Information Packages….
…are just a way to keep the materials and the necessary information about them together. Introduced earlier the main information objects with the OAIS model – Information Packages They are the original digital objects plus relevant documentation and metadata and come in three forms: Submission Information Package Archive Information Package Dissemination Information Package Remaining slides in this section will examine the structure of IPs in more detail and how you can begin designing/constructing them Submission Information Package Archive Information Package Dissemination Information Package

Information Package Structures
May be influenced by: Designated community needs Existing systems Resources available Preservation plans Options from simple to complex Standard folder system Databases XML wrappers Tools available to help with creation of IPs How information packages are implemented will differ from organization to organization. Decisions on how to structure the packages will depend on a number of factors. Need to consider: What are the needs of your designated community? This will influence choices in relation to what metadata is kept and where as well as driving all decisions about Dissemination Information Packages. Do you have any existing systems that you will need the information packages to be compatible with? Will these determine the formats and/or technology you will use? What resources do you have available? Consider skills as well as costs. Have you already made decisions for preservation? This may determine the amount of metadata needed and what form of the data object to store in the Archival Information Packages. The options for formats are virtually endless and can be tailored to specific organizations. Common choices include: Storing the data objects in specially ordered folder structures. Capturing metadata in a database with links to the location of the data objects ‘Wrapping’ the data objects and their metadata using an XML file that includes file locations and some or all of the metadata. There are a number of tools available for managing the creation of information packages. Most commercial repository systems include this functionality but there are also free or open source tools available, the most popular being Bagger/BagIt.

What’s in the AIP? An Example
AIP Example from Chris Prom at UIUC Archives Unique ID Accession # System ID Descriptive info Access copies of original digital files, maybe migrated to new formats Online = online version Nearline = made available in person Original submitted digital files, pre-preservation actions The AIP contains your materials from the SIP, with preservation actions done to them…they may be migrated or transformed in some way so that they can be better preserved.

Planet of the AIPs Just remember,
You might have a number of different looking AIPs depending on what kind of material you are preserving. Your AIPs will live on something of a spectrum, in terms of what components will be included, and how you choose to package them up. So like this slide….some AIPs may be VERY FANCY….some of them even wear clothes and talk! Some of them are very specialized…. But some of them are very simple and pure. And your AIP can grow over time and mature, depending on the resources you have at your disposal. There’s NO WRONG WAY TO BE AN APE. And there’s really no wrong way to MAKE an AIP.

Getting From Objects To Information
DIGITAL Image Information about it & how to render it Digital Information Object PHYSICAL Slide Information about it & a projector Physical Information Object Information about it & how to render it is what’s known as “representation information” In OAIS: The Data Object is interpreted by the Representation Information to yield a useable Information Object A comparable example in the physical world: A slide is interpreted by a projector and yields on screen an image(or series of images). The Representation Information is the mechanism for providing access to the preserved information.

Representation Information
Two types: Structure Information File Format, Software….how to render it…the projector! Semantic Information User Documentation, Data Dictionary….the information about it! Can be simple through to very complex Determined by needs of your Designated Community Tends to become more complex over time Two types of representation information: Structure information – about the technical requirements of rendering the data object, includes information such as file format and software package. Semantic information – how to interpret and understand the data object, can include user documentation on a software package or a data dictionary identifying columns in a database. Representation information can be simple or very complex, and may include things like full software packages. Some representation information may require its own representation to interpret it. How much representation information is saved is determined by the needs of the designated community. Less may be needed for a specialist user group and more for the general public. Needs tend to become more complex over time as users become less familiar with the data objects and environments they are created in.

Preservation Description Information (PDI)
Supports preservation, authenticity and dissemination Describes‘the past and present states of the data Consists of 5 components: Reference information Context information Provenance information Fixity information Access Rights information Preservation Description Information is the additional metadata that is required to support and document the OAIS’s preservation processes. This includes: Recording preservation activities that have been undertaken Retaining metadata to demonstrate the continued authenticity of the data objects Providing metadata to facilitate the dissemination of the data objects The OAIS models describes this as metadata that “is specifically focused on describing the past and present states” of the data objects. There are five component of Preservation Description Information: Reference Information – a unique identifier for the data object within the OAIS and, potentially, outside it. Examples may include an ISBN or a persistent identifier like a DOI. Context Information – describes the data objects relationship to others within the OAIS. For example, if it belongs to a particular collection, or to other versions of the same document. Provenance Information – captures information about the history of the data object, which can include who created it, chain of custody and any actions to preserve it (e.g. migration). Fixity Information – documents that no changes have been made to the data object through mechanisms such as checksums, digital signatures or watermarks. Access Rights Information – captures any restrictions relating to the data objects in relation to preservation and access. This may include information on licences, lists of those with permission to access, and information on agreed preservation options from the depositor agreement.

A Little Bit On PREMIS Widely adopted preservation metadata standard.
Covers elements of representation information and preservation description information. Output is NOT created by hand; depends upon the output of tools who perform actions on your files. Record can grow over time, as preservation actions occur. Steep learning curve. ☹ But various repository platforms; Archivematica, DataAccessioner and other tools/systems will create PREMIS records for you. PREMIS is a metadata standard specifically created for preservation metadata and has become the de facto standard for digital preservation. It covers many of the elements required within the OAIS standard’s Representation Information and Preservation Description Information but is specifically focused on the needs of those managing the repository and the repository systems themselves. It does not claim to be comprehensive and use of additional standards or custom metadata elements may be required. PREMIS exists only for capturing information about preservation actions. It does not include elements for descriptive metadata. Users are instead advised to use existing standards such as MARC, ISAD(G) or Dublin Core. It is important to always consider what is necessary and what is feasible when deciding what metadata to capture. It is advisable to carefully consider how much of PREMIS you might use as it is a large and complex standard, only adopting the elements that are necessary.

What does PREMIS capture?
PREMIS can capture: The program on which the file was created. The version of that program. The operating system on which that program ran. Who created the file. The rights associated with the file. When the file was ingested into the preservation system. Dates the file was validated. And more…. These are the types of things that PREMIS captures. You may or may not be able to implement PREMIS – or maybe you can create some PREMIS now, but not automate any preservation actions that can add to it. Maybe you can do that later. You can get much of the information that lives in PREMIS from running tools that extract various kinds of technical metadata – tools that do File Format identification, validation, characterization, such as JHOVE, DROID, Fido, Siegfried, FITS, etc. We will be working with some of these tomorrow! Most of these tools produce an output in XML that can be converted to fit the PREMIS schema if you can learn a bit about XML and XSLT processing. Even if you’re just saving those XML files, you are adding Representation Information and PDI to your AIP.

METS Standard for packaging.
Wrapper for XML metadata – you put PREMIS, Dublin Core, MODS, etc INSIDE it. Contains seven sections: Header Descriptive Metadata Administrative Metadata File Section Structural Map Structural Links Behavior METS and PREMIS cover most of the metadata requirements of OAIS. METS – Metadata Encoding and Transmission Standard is an XML-based standard for the encoding of the metadata necessary for the management of digital data object. It also helps ensure information packages are transferrable between different repository systems. It is one of the most commonly used structures for creating information packages and is compatible with the PREMIS standard. A METS document contains seven major sections: METS Header – information about the METS document itself e.g. creator Descriptive Metadata – may include a pointer to descriptive metadata held elsewhere (e.g. in a database) or contain it within the METS document. Multiple descriptions can be included. Administrative Metadata – information required to manage the data object, e.g. provenance information. File Section – a list of all of the individual files that make up the data object, if there are several it is possible to group these if needed. Structural Map – The heart of the METS document, stores the structure of the data object, including both how files relate to each other and how metadata relates to the parts of the object. Structural Links – allows the storage of links between different parts of the object as defined in the Structural Map. Particularly useful for websites as hyperlinks between pages can be stored. Behavior – captures information on behaviors of the data object, including mechanisms for their execution. Many organizations use both METS and PREMIS metadata standards as between them they cover most of the metadata and packaging requirements of OAIS.

The OAIS Reference Model,
Packages, & Metadata QUESTIONS? Curriculum DEVELOPED IN PARTNERSHIP WITH DPC and is on your flash drives and the POWRR website

OAIS in the Wild A quick case study
Final few slides in this section will provide a short case study of how OAIS was used by an archive to carry out a gap analysis of their digital preservation capabilities and needs.

An Introduction to RCAHMS
A medium-sized archive and survey institution based in Edinburgh Mission to record Scotland’s built heritage Archive built from: Outputs of RCAHMS’ own survey work Material collected from external depositors Acronym commonly pronounced ‘R-CAMS’ The Royal Commission on the Ancient and Historical Monuments of Scotland (now part of Historic Environment Scotland) was a medium-sized organization based in Edinburgh, Scotland. Its mission was to maintain a record of Scotland’s built heritage through architectural and archaeological surveys and the collecting and management of its archive collections. RCAHMS collections were built from the outputs of the organization’s survey work and from material collected from external depositors (primarily architectural firms, archaeological companies and individuals with an interest in built heritage).

First Digital Archive First received digital data in 1992
Report detailing preservation needs Contract to develop systems in 2003 Limited standards and tools available Systems: Area in database to record metadata Dedicated storage area Batch processing for digital images No preservation or dissemination systems RCAHMS received its first deposit containing digital data in 1992, from an archaeological dig. At this time (and for the next decade) the digital data was stored on the original media with physical archives. In 2002, it was decided that steps needed to be taken to manage this information and, in partnership with the UK’s Archaeology Data Service, a report was produced detailing RCAHMS’ digital preservation needs. A 6-week contract was awarded in to 2003 to an external contractor to develop solutions for digital archiving. At that time, limited tools and standards were available. The solutions that were developed were: An area in the RCAHMS catalogue database to record limited technical metadata A dedicated ‘digital archive’ storage area was created on the organization’s network. A tool was created for batch processing digital images, including automated generation of metadata. These solutions only address a limited number of issues relating to ingest and archival storage. No consideration was given to solutions for preservation and dissemination.

Motivations for Redevelopment
New Digital Archivist hired Exponential growth of digital deposits Emergence of new standards and tools A more strategic approach to management and development required In 2006, RCAHMS hired a new Digital Archivist and this was the initial catalyst to a review of the organization’s digital preservation activities. Over the previous 5 years RCAHMS had also seen an exponential growth in the number of deposits they received containing digital material, with: Over 500,000 digital objects now in the archive New and more complex data types being received, particularly things like CAD, GIS data and laser scanning data The current systems were not coping with the strain of the increased input During this time, a number of new standards and tools had been developed meaning: There were opportunities for new and enhanced systems and procedures The current systems were far from adhering to ‘best practice’ and did not meet minimum standards for preservation. It was agreed that a more strategic approach to the management and development of the digital archive was required to make better use of the limited resources that were available.

A Plan of Action Questions considered: What might success look like?
How could buy-in be secured from stakeholders? How to identify useful standards and tools? OAIS core to the process: Helped set aims Provided a framework to guide choices Used to carry-out a gap analysis Realised it was important to approach the redevelopment carefully, thinking about the process and ultimate aims. Questions considered included: What might success look like? And how could it be measured? How could buy-in from stakeholders both inside and outside the organization be achieved? How could the best and most relevant standards and tools be identified? Quickly established that OAIS would be core to the redevelopment process. The standard helped: As a benchmark for setting aims and objectives for what was to be achieved. Provide a framework for planning what standards and tools were needed and how they should fit together. As the basis for a gap analysis comparing the existing systems. Current workflows were mapped against proposed/ideal workflows and gaps identified. This was a particularly successful process.

Gap Analysis: Ingest Using the full functional model of OAIS as the basis, ideal workflows were created for Ingest, Archival Storage, Preservation, and Dissemination. The slide shows the Ingest workflow. Each element of the workflow was labelled with the number of the corresponding function from the full functional model. A comparison was then made between the existing workflows and the ideal workflows. Each element was color-coded depending on if: The element existed and was fit for purpose – the black circles The element existed but was not fit for purpose and required upgrading – beige circles The element did not exist – white circles It can be easily seen that a very large gap exists, covering almost the entire workflow. These graphics proved very useful for planning developments but also for advocacy purposes. They were instrumental in gaining senior manager support and having a digital repository programme established as a key aim of the organization. A fuller account of this work can be found in an article in the Journal of the Society of Archivists, listed in the course resources.

Technology Fundamentals for Digital Preservation

Similar presentations

Presentation on theme: "Technology Fundamentals for Digital Preservation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Technology Fundamentals for Digital Preservation

Similar presentations

Presentation on theme: "Technology Fundamentals for Digital Preservation"— Presentation transcript:

Similar presentations

About project

Feedback