Presentation is loading. Please wait.

Presentation is loading. Please wait.

The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer.

Similar presentations

Presentation on theme: "The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer."— Presentation transcript:

1 The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer Architectures Group, Dept. of Computer Science, University of York, York, YO10 5DD, UK. {bojian, martyn.fletcher, austin}} Presented by Leslie Smith, University of Stirling

2 Slide 2 Data problems / issues. Our solution: Neurophysiology Data translation Format: NDF. What is NDF and what does it provide? Future work. Overview

3 Slide 3 The CARMEN (Code Analysis, Repository and Modelling for e-Neuroscience) project provides an environment for sharing neurophysiological experimental data and algorithms using GRID technology. A consortium effort to create a virtual laboratory for neurophysiology, led by 11 UK universities in collaboration with other academic and commercial partners, for the benefit of the neuroscience community. The CARMEN Project

4 Slide 4 The CARMEN system has to handle a wide range of incoming data types as well as derived data. Often unreadable unless you use vendor specific software or know the encoding format Data may be used by users or services. In a processing chain, the output of a service may be the input of the other services. It is impractical to have services that use arbitrary input and output data formats, particularly for workflows Data interchangability problem There is a need for data translation to allow resources to access a standard data format to facilitates an environment where data can be processed in a consistently interpretable way for both human users and machines.

5 Slide 5 Remote data: to avoid unnecessary data downloading / moving and processing: a. A user needs to know as much information as possible about the data before the data is downloaded or processed. b. A service needs to verify the data as a valid input type before processing the data. c. A workflow editor needs information to pre-verify the type of the input data set from a remote data depository or output from another service in the construction of a workflow script. Questions: 1. How do we interrogate and understand the remote data without downloading / accessing the whole binary data set? 2. A file extension is not enough to pre-verify workflow input / output file, so where is a workflow editor to get information to perform the verification? Remote Data Issues

6 Slide 6 Sub-dataset selection and partial data extraction / downloading: a. Neurophysiological experimental data are complex data sets. Most CARMEN services are designed to process only one of the data types within a data set. b. Raw data contains multiple channels from the acquisition equipment but only parts of these data channels may be desired. c. The volume of data in a channel of data may be very large but only some channels and time intervals are of interest. d. Processed data and raw data may be mixed in the same data set. Question: 1. Can we tell a service exactly which data portions we need to process? 2. Can we download (or use) only the channels (or parts of channels) of interest? Partial data access issues

7 Slide 7 In a research environment new data types / formats are created whenever new scientific instruments or services / algorithms are introduced. It is difficult / impossible to try to specify these precisely in advance. Questions: 1. Can we create services that accept new data types as input? 2. Can we create services that create a new data types as output? 3. Can all this be done in a consistent manner, using the predefined data types? 4. How can a service that uses new data types perform pre- verifying as for the predefined data types? Evolving data type issues

8 Slide 8 Use of a generic metadata system: most users are specialists and will not appreciate many of the generic metadata specifications. On manual completion of a metadata upload form, a user doesn’t know which fields are required for the data set. Consequently, the metadata uploaded may be incomplete and not usable. On uploading a data set, the metadata may not be directly available for the user – a special tool for a particular data format may be required. It is impractical to upload metadata manually for a huge number of data files. Automatically uploading metadata is equivalent to having a data standard. This implies that the metadata is already included in the data set and a data standard must be used. Metadata for a temporary data sets, such as the output of a service (which may be the input of the other services) are not available from the metadata system. Separating the metadata from a data set affects the data set portability. Our conclusion: The metadata used for the above purpose should be integrated with the data set. Can a well designed metadata system solve the problems?

9 Basic data types The primary data types are TIMESERIES: continuous time series. NEURALEVENT: events such as spike times EVENT: other event data (e.g. stimuli) SEGMENT: sections of TIMESERIES data GMATRIX: generic matrix data: user-defined IMAGE: image data Since the content is described using XML, additional data types can be added to cope with new developments in electrophysiology. Slide 9

10 Slide 10 The NDF wraps metadata, binary data together with a configuration file. 1. A separate NDF configuration file, using an XML format, minimizes the work necessary to extract metadata from a data set, obviating the need to look inside the associated binary data file. It is only necessary to download the NDF configuration file and the metadata information can be easily viewed using a web browser. 2. Two semi-defined data types are extendable on an application basis and conventional vendor data files may also be “wrapped” as an NDF data set. A particular ID field allows these application specified data to be identified. 3. NDF supports the most commonly used numerical data types from 8-bit integer to double precision floating point data. This helps to reduce the data size by using the most efficient data types as well as reducing the network traffic load when downloading / uploading NDF data sets. 4. The NDF data format permits the download of data “regions of interest” (partial data access) rather than the whole data set, reducing network traffic. Partially accessing a MAT file zipped stream is supported. The NDF data format (1)

11 Slide 11 5. For a data processing chain a history or “foot-print” of each previous process can be included in the output data. This information is useful (may also be required) for later processing or reference. In particular, other researchers can easily repeat the work by reference to the data processing history records. 6. The NDF supports image data or image sequence data. 7. A separate XML file can be used to store the experimental event data, annotation and additional third party data objects. 8. The NDF minimizes the need for re-implementation of research tools currently used by neuro-scientists and researchers. A MAT file is used as the main numerical data file format. This is a publicly descrbed data format 9. Supports multiple data files for one data channel. This allows data size of either a single channel or full data set to exceed 2GB both in 32-bit and 64- bit operating systems. The NDF data format (2)

12 Slide 12 The CARMEN Portal NDF Data Channel & Time Selector

13 Slide 13 The NDF API: Is implemented as a C library. Provides a low level I/O interface for accessing the NDF data set including the XML format header file, MAT format host data files and the XML format annotation files. Translates the XML tree/node to C style data structures. Insulates the MAT data format and (and image format data) from the clients. Provides a standard way for data structure memory management. The NDF Data I/O API (1)

14 Slide 14 The NDF API: Supports multiple-run data writing modes for large data sets with known total data length. Supports multiple-run data writing modes for data stream with unknown total data length. Supports zipped data stream for MAT file. Supports partial data reading on both compressed and uncompressed data in MAT file. Automatically manages the data file splitting for large data set. The NDF Data I/O API (2)

15 Slide 15 The NDF MatLab Toolbox has been implemented on top of the NDF C library API. It consists of a set of object oriented MatLab classes and functions that provide high level support for NDF data I/O. A “multiple data formats” to NDF converter is embedded to the toolbox as data input module. Full protection and auto-correction for misused data types on parameter structure. It has been used within the CARMEN service code programming. It is also used as a set of convenient tools on a researcher’s desktop for NDF data I/O and data conversion. The NDF MatLab Toolbox

16 Slide 16 Expand the specification to improve compatibility with data sets from the fields other than neuro-science. Provide services for partial data downloading of remote data sets. Provide services for data preview of remote data sets. Extend the data converter to support data conversion from additional appropriate formats. … and enabling future-proofing! Detailed information is available at the CARMEN portal, Future work

Download ppt "The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer."

Similar presentations

Ads by Google