Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grammar-based Specification and Parsing for Binary File Formats

Similar presentations


Presentation on theme: "Grammar-based Specification and Parsing for Binary File Formats"— Presentation transcript:

1 Grammar-based Specification and Parsing for Binary File Formats
William Underwood Georgia Tech Research Institute Atlanta, Georgia, USA PASIG Lightning Talk, May 23, 2013 Thank reviewers for their suggestions for not only improving the paper, but extending the research. Thank organizers for this venue to report and discuss my digital curation research. This presentation reports on an investigation of potentiallly sustainable methods for the long-term preservation of digital data and records. Specifically, it reports progress in addressing the research question: Is it possible to extend the context-free grammars used to specify the syntax of programming languages to the specification of binary file formats and to use these grammars with parsers for validating the file formats of binary files? Filename - 1 1

2 Motivation: Digital Curation Tools
Digital Curators need automated tools for: Validation of file formats, with pertinent error messages Extraction of metadata Viewing/playing/reading file formats Conversion of legacy formats to current/standard formats 1 Identification is required, to know which validator, metadata extractor, renderer, or converter to use. 2. Validation is required because If a file has been damaged, it may be possible to obtain an undamaged copy, or repair the file. File might need to comply with a standard format, e.g., PDF/A. Good error messages are needed in order to correct problems with records. 3. Metadata is needed for identification or characterization of records 4. Can’t render or read data without viewers/ players and readers 5. Conversion to current or standard formats is necessitated by obsolescence

3 Motivation: Digital Curation Tools
Most all of these tools, especially those for binary file formats, are manually programmed from file format specifications. The tools become obsolete when the hardware/software platform on which the programs run becomes obsolete. The tools either need to be reprogrammed, or become unavailable. Need for a sustainable, more cost effective, digital preservation strategy for binary file formats. With sustainable methods for validation, metadata extraction, viewing/playing /reading, might not need to convert legacy to current or standard file formats. [Don’t say this here]

4 Research Questions Is it possible to extend the concept of context-free grammars from textual languages to binary file formats? Is it possible to specify binary file formats using these extended context-free binary array attribute grammars? Is it possible to develop a parser generator that takes a binary file grammar for a binary file format and generates a parser that can validate the file format?

5 Generating Parsers for Binary Formats
Goal is a parser generator for binary file formats. ANTLR (Another Tool for language Recognition) is a parser generator Input: Attribute LL(k) grammar for a string language Output: Source code (e.g. Java for a recognizer of that language) Wrote functions for each data type that converts character (byte) tokens to binary data types. Uses ANTLR to demonstrate the capability of generating parsers from file format grammars

6 5x7plate.tif

7 Pretty Printed (Partial) Parse Tree

8 Results It is possible to extend context-free grammars for textual languages to the specification of some chunk-based and directory-based binary file formats. It is possible to create parsers from these grammars for these binary file formats. ANTLR, a parser generator for LL(k) grammars, has been successfully used to generate parsers for two chunk-based file formats and two directory-based file formats. Another family of binary file formats is directory-based. A directory-based file format consists of one or more directory tables which contain one or more directory entries. A directory entry specifies where the actual data for a type of information is located. This is the scheme used in TIFF files, OLE (Microsoft Object Linking and Embedding) files, OASIS OpenDocument and Microsoft Open Office files. The significance of success in this research task is that if binary file formats can be specified with binary file grammars, then only one parsing generator is needed to generate the parsers for verifying that a file conforms to a particular format. Similarly, the same parsing generator can be used for conversion of legacy file formats to current or standard formats. Finally, the same parser generator can be used for generating viewer/players for most file formats. This would increase the likelihood of preserving and making available into the indefinite future that digital data and those digital records encoded in binary file formats.

9 Further Research Questions
Can semantics be incorporated into the grammars for binary file formats to enable the generation of Validators? viewers/players? file format converters?

10 Further Information http://perpos.gtri.gatech.edu
W. Underwood. Grammar-based Recognition of Documentary Forms and Extraction of Metadata. The International Journal of Digital Curation, Vol 5, Issue 1, W. Underwood. Grammar-based Specification and Parsing of Binary File formats. International Digital Curation Journal, 2012.


Download ppt "Grammar-based Specification and Parsing for Binary File Formats"

Similar presentations


Ads by Google