Presentation on theme: "Reversing Data Formats: What Data Can Reveal by Anton Dorfman ZeroNights 0x03, Moscow, 2013."— Presentation transcript:
Reversing Data Formats: What Data Can Reveal by Anton Dorfman ZeroNights 0x03, Moscow, 2013
About me Fan of & Fun with Assembly language Researcher Scientist Teach Reverse Engineering since 2001 Candidate of technical science
Observed topics Samples Basics Thoughts Fields Related works Applications
What is it?
What is it? Hint 1
What is it? Hint 2
What is it?
What is it? Hint 1
What is it? Hint 2
Standard way Hex editor Researcher Brain (equally important as the Hex editor) Basic knowledge how data can be organized (in brain) Analysis of the executable file that manipulates with data format
3 ways of researching Dynamic analysis – use reverse engineering of application that manipulate with data format Static analysis – try to extract header, structures, fields and try to find relationship between data Statistic analysis – when have some amount of samples and use statistics of changes and ranges of values
Breaking the rules There is the rule RTFM (Read The F**king Manual) Nobody likes it Im not exception First of all I start my research, and second – try to find related works and generalize ideas from them
Definitions Field – some value used to describe Data Format Structure – way for organizing various fields Data – information for representing which Data Format is developed Header – common structure before data, may contain substructures
Main Idea Data format is developed by human (not pets or aliens) Sometimes looks like that no human works on it… Format developers use common data organization concepts and similar thoughts when creating new data formats If we find regularities in data format organization rules we can automate searching of them
Three types of data format File formats – multimedia formats, database formats, internal formats for exchanging between program components and etc. Protocols – network protocols, hardware device interaction protocols, protocols of interaction between driver and user space application and etc. Structures in memory – OS structures, application structures and etc.
Levels of abstraction Bit Order Byte Order Fields Size Field Basic Type Field Type Structure Field Semantics
Tasks to solve Extract header, separate it from data Find field boundaries Find structures and substructures Find types of fields Detect bit and byte ordering Determine semantics of fields Interpret goal of field – its semantics
Bit order There are most significant bit – MSB and least significant bit - LSB MSB is a left-most bit – commonly used – value is b – 95h LSB is a left-most bit - value is b – A9h - 169
Byte order Little-endian – Intel byte order Big-endian – network byte order Middle-endian or mixed-endian – mixed byte order – when length of value more then default processor machine word
Types of fields Service fields – for describing Data Format (size of structure and etc.) Common fields – fields from life (time, date and etc.) Specific fields – we can find range for that type (bit flags and etc.) Application specific – can be interpret only by application, we should use dynamic analysis
Field Size and Levels of field interpretation Commonly field is a byte sequence Sometimes field is a bit sequence Fixed size in bytes (1, 2, 3, 4, …) Fixed size in bits (1, 2, 3, 4, …) Variable size in bytes Variable size in bits
Basic field types Hex value – by default Decimal value Character value (up to 4 symbols) String (ASCII or Unicode) Float value Bits value
Types of field Text Offset Size Quantity ID Flag Counter DateTime Protect
Text Usually fixed size, remaining space after text filled with 0 Variable sized text ended with 0, or there are size of this text field Usually text string contain their meanings
Offset Offset to another field Offset to data Pointer to another structure (may be child or parent) Pointer to another instance of such structure (next or previous in linked list) Can be relative and absolute
Export in PE-EXE
Size Size of data Size of structure Size of substructure Size of field Can be not only in bytes, but also in bits, words (2 byte), double words, paragraphs (16 byte) and etc.
Quantity Quantity of substructures Quantity of elements IMAGE_NT_HEADERS.IMAGE_FILE_HEADER. NumberOfSections IMAGE_EXPORT_DIRECTORY. NumberOfFunctions
ID Identifier of data format, often called magic number ID of structure ID of substructure Often ID is value consist of char symbols Often ID looks like magic - BE BA FE CA
Flag Bit Flags Enum values as a flags Special case – Bool value
Counter Increment (possibly decrement) Starts with 0 / begins with another value Changes by 1 or another value
DateTime Storing format Resolution Moment of beginning Range
Protect CRC value of data CRC value of substructure Various Hash functions and etc.
Projects, articles and authors Protocol Informatics Project – based on Network Protocol Analysis using Bioinformatics Algorithms by Marshall A. Beddoe Discoverer - Discoverer: Automatic Protocol Reverse Engineering from Network Traces by Weidong Cui, Jayanthkumar Kannan, Helen J. Wang Laika – Digging For Data Structures by Anthony Cozzie, Frank Stratton, Hui Xue, and Samuel T. King RolePlayer – Protocol-Independent Adaptive Replay of Application Dialog by Weidong Cuiy, Vern Paxsonz, Nicholas C. Weaverz, Randy H. Katzy Tupni – Tupni: Automatic Reverse Engineering of Input Formats by Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, Luiz Irun-Briz
Protocol Informatics Project Global and local sequence alignment - Needleman Wunsch and Smith Waterman algorithms
Laika Using Bayesian unsupervised learning For fixed size structures only
Tupni Based on taint tracking engine
Applications RE any program RE undocumented/proprietary file formats RE undocumented/proprietary network protocols RE undocumented structures in memory Fuzzing Examination of protocol implementation Replay network interaction Zero-day vulnerability signature generation