Presentation is loading. Please wait.

Presentation is loading. Please wait.

File Format Identification and Archival Processing

Similar presentations

Presentation on theme: "File Format Identification and Archival Processing"— Presentation transcript:

1 File Format Identification and Archival Processing
William Underwood NARA Briefing GTRI Washington, DC Atlanta, Georgia February 6, 2009

2 Overview Background File Command- Magic Expressions
DROID-File Format Signature Expressions Comparison-File Command/Magic & DROID/FFSignatures Summary Overview

3 Presidential Electronic Records PilOt System (PERPOS) (2001-2006)
Advanced Decision Support for Archival Processing of Presidential Electronic Records ( ) Background – Projects

4 Backgound: Electronic Records at George H.W. Bush Pres. Library
One of the first presidential libraries to have electronic presidential records, particularly from hard drives Word Processing Files Databases Spreadsheets Presentations Computer Programs Scanned Paper Records During the administration of George H.W. Bush, new computer and electronic mail technologies were available to the President and the White House Staff for creating and saving records. These included: IBM PS/2 personal computers based on the Intel 386 microprocessor chip. New operating systems such as PCDOS, Windows 3.1 and IBM’s OS/2 Word Processing Software such as WordPerfect 5.1 and Microsoft Word 2.0 Database management systems such as dBase IV and Advanced Revelation Spreadsheet programs such as Lotus and Quattro Pro Presentation programs such as Harvard Graphics Electronic Messaging Systems such as Digital Equipment’s All-in-One Computer Programs created by White House Staff Document Scanners for creating and storing digital images of paper records Backgound: Electronic Records at George H.W. Bush Pres. Library

5 Background: Where We Began
The archival functions needed to process paper records are well understood. We had few tools to identify, view or review electronic records in response to PRA/FOIA requests Tools Initially Needed: File Format Identification Tool Viewers for Records in Legacy File Formats Tool for Filtering OS and Office Applications Software from User-created Files Tools for Converting Legacy to Current Formats Tools to Support Redaction of E-records While we knew the requirements for processing paper records, the processing of electronic records introduces new issues that we did not thoroughly understand. For instance, File name extensions alone are not adequate to identify the file formats of e-records. For instance a .doc file extension might indicate a WordPerfect or a MSWord document. An .exe file extension usually indicates an executable program, but may indicate a self-extracting archive, which might contain presidential records. Therefore, a file format identification tool was needed, because we do not know which viewer to use until we know the file format. The office automation software used to create the records, could not be used to view records because that software often does not operate on current computers and operating systems. Furthermore, using such software could result in the record being accidentally modified. File viewers were needed for viewing the records on current computers without the possibility of changing the records. An archivist must open documents as fully as possible, and redact only the portions that have restrictions. How can one be sure that contents of the redacted parts of e-records are absolutely removed? We acquired redaction tools for experimentation. There were no viewers for some of the legacy file formats encountered, for example, Borland Reflex databases. One needs to be able to convert these files into another file format for which there is a viewer. We either acquired or built these tools, tested them, and learned other capabilities that were needed. Background: Where We Began

6 Background: Evolutionary Prototyping
Computer Scientists Build Tools Archivists Test Tools Archivists Formulate New Require-ments Result: Integrated set of tools called PERPOS As the diagram illustrates, our approach was: The Computer Scientists develop a set of tools that that are used by archivists on actual e-records. The Archivists learn more about the processing requirements and report these to the Computer Scientists Computer Scientists refine these tools to meet new and better understood requirements and suggest new technologies The result: Evolution of a prototype system called the Presidential Electronic Records PilOt System, or PERPOS PERPOS is a research prototype for investigating advanced technologies supporting archival decisions in processing Presidential e-records. The research project is a cooperative research project between the Bush Library archivists, and the Georgia Tech Research Institute, and sponsored by NARA’s ERA Research Program. PERPOS is an evolving research prototype which provides an environment for evaluation of the system’s performance based on testing done by archivists at the George H.W. Bush Presidential Library. Background: Evolutionary Prototyping

7 Background: Archival Activities Supported by PERPOS
PERPOS Repository (Ingestion) Accession Arrange Preserve Search Review Describe The resulting Prototype System supports the accession or ingestion, storage, arrangement, and preservation of e-records in a repository. As you are all aware, archivists review records for a variety of restrictions – statutes, deed of gift, State Freedom of Information Act restrictions, Sunshine Laws, HIPAA, and so on. At the George H.W. Bush Presidential Library, archivists review records under the restrictions of the Presidential Records Act of 1978, or PRA, and the Freedom of Information Act, or FOIA. Therefore, the System’s developing capabilities and prototypes support the search of the repository for records relevant to a PRA/FOIA request, the review of those records, and the description of those records in the form of a finding aid. [Pursuant to the Presidential Records Act, all presidential records proposed for opening must be notified to both the former and current presidents. At this time, this is not a function that has been developed for or tested on the PERPOS system.] In the early stages of the PERPOS project, each of these functions were accomplished by separate tools/programs/icons on a computer. Now, these have all been integrated into two subsystems – the Archival Repository and Archival Processing Tools (or ART and APT). Before we move on, however, it is important to note that PERPOS is a research prototype, not an operational system, and for purposes of this presentation, all “electronic” records you see today are actually open textual documents that have been scanned, OCR’d, and accessioned into PERPOS for experimental purposes. Background: Archival Activities Supported by PERPOS

8 Contents of PC Hard Disk

9 File Format Names

10 Filter Contents of a Hard Drive

11 OS and Software Application Files Blocked by Filter

12 File Types of Passed Files

13 Properties of Filtered Files

14 OS/App Hash Code Filter

15 National Software Reference Library

16 NSRL Reference Data Set

17 Currently, PERPOS has the capability to recognize about 500 legacy and current file formats or File Types. Most of these can be associated with Viewers. For instance, the picture on the right shows the association of an ARJ Self Extracting file format with a viewer for that file format. Archive and self-extracting file formats can also be associated with archive extractors as seen in the picture at the left. Password protected files can also be associated with password recovery programs and with programs that decrypt the password protected programs, given the password. Damaged files can be associated with repair programs that may be able to repair the file fully or in part. Finally, files that have no viewer can be converted to file formats that do have a viewer. Viewers, Archive Extractors, Password Recovery, Decrypters, Converters, Repairers

18 Magic File – Man Page

19 Magic File – Man Page

20 Magic File – Man Page

21 Extensions of File Command and Magic File
Magic for individual file formats Output of file command/magic file is File Format ID Rewriting file command code for identifying Characteristics of Text files and Document Types Defined approx. 750 file format signatures Collected examples of approx. 500 of the file format types Created File Signature Database Verified that magic file correctly identifies approx. 500 File Types Extensions of File Command and Magic File




25 GUI for File Type Identifier

26 File signatures for about 200 File Formats that are currently defined in DROID File Signature file only by file name extensions Examples: Microsoft Outlook Personal folders ( ), AIFF (Compressed), AutoCAD Design Web Format, Adobe Framemaker Document, Applixware Spreadsheet, Chiwriter 3 Document File signatures for about 300 file formats that probably should be included in Pronom Registry and DROID Signature File. Examples: MHTML Web Page Archive, Outlook Express E- mail Folder, Autodesk Revit Project, CATIA Model File V4, CATIA Drawing V5, ClarisWorks 3 Document, MacWrite 4.x Document, PDF/X1a

27 DROID – File Signature Expressions
In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a continuous sequence of hexadecimal byte values and, optionally, regular expressions. A signature byte sequence is modelled by describing its starting position within a bitstream and its value. The starting position can be one of two basic types: •Absolute: the byte sequence starts at a fixed position within the bitstream. This position is described as an offset from either the beginning or the end of the bitstream. Variable: the byte sequence can start at any offset within the bitstream. The byte sequence can be located by examining the entire bitstream. DROID – File Signature Expressions

28 The value of the byte sequence is defined as a sequence of hexadecimal values, optionally incorporating any of the following regular expressions: ??: wildcard matching any pair of hexadecimal values (i.e. a single byte). *: wildcard matching any number of bytes (0 or more). {n}: wildcard matching n bytes, where n is an integer. {m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or ‘*’. (a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a hexadecimal byte sequence of arbitrary length containing no wildcards. [a:b]: wildcard matching any sequence of bytes which lies lexicographically between a and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole. [!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards). [!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically between a and b, inclusive (where a and b are both byte sequences of the same length, containing no wildcards, and where a is less than b).


30 DROID Applied to Sample Files

31 Comparison of DROID and GTRI file Type Identifier Technologies
Matches sequences of hex values at offsets Regular expressions on hex values Efficient substring search Identifies all possible signatures and then selects the one of highest priority Includes offsets from EOF GTRI File Type Identifier Matches a variety of data types at offsets Regular expressions on strings in lines Less efficient substring search, but more indirect offsets increase efficiency Preorders signatures and stops search when pattern matches. Lacks offsets from EOF Comparison of DROID and GTRI file Type Identifier Technologies

32 Summary PERPOS File Format Resources Research Issues
File Format Signatures File Format Specifications/Reverse Engineering Documents Software Viewers/players Archive Extractors Converters Password Recovery & Decryption Repairers Sample Files Research Issues File Signature Representation Languages Metadata Extraction Languages File Format Description Languages Summary

Download ppt "File Format Identification and Archival Processing"

Similar presentations

Ads by Google