Standardization and Data Formats

A common standard for data formats is a prerequisite for efficient data handling and evaluation. Here, the aim is to both facilitate access to data that is independent of specifics of instrument or facility for those researchers having “ownership”, and to facilitate evaluation and analysis by standardising this access. We require the following from the data format, to:

provide a complete set of parameters that describe the experimental setup and all the measured data; the format should in turn be self-descriptive,
be structured, flexible, extensible, and platform independent,
be highly efficient in terms of access speed,
implement suitable compression mechanisms, and
be readily editable with journaling of changes.

The creation of a standard data format involves the specification of a data model, the definition of the contents, and their implementation. The data model is an abstract description of the structure of the data, and specifies its organisation. It is expected that the model will feature a hierarchical organization of the data, i.e. the data will be represented in a tree structure with named nodes, named datasets, and attributes of the datasets. The data model must be capable of accommodating all setups and experimental strategies. Flexibility, extensibility, self documentation, and access speed are key design considerations. The content definition specifies the labels for nodes and elements, plus a detailed description of how these are to be used and interpreted. Catalogues of content definitions for particular experimental applications and techniques (scattering, imaging, spectroscopy, crystallography, … ) will be made available, and be straightforwardly tailored for specific experiments. Here, the aim is to avoid the use of proprietary formats, and also to provide enough information to fully determine the experimental setup used just from the data file. In our opinion, this will significantly facilitate the creation of new data evaluation software as well as the creation of interfaces between existing software and the data files.

The implementation of the data model should be based on existing and well established formats, such as the Hierarchical Data Format HDF5. An attempt to create a more specific data format for neutron and SR X-ray data is NeXuS, which supports the use of HDF5. Both formats are also endorsed by other international laboratories (e.g. ILL, ISIS, ESRF, ANSTO, SOLEIL, DIAMOND, and PSI) and are possible candidates to serve as a basis for the HDRI data format. If necessary for the implementation, the NeXuS structures can be extended. For the direct implementation a set of Application Programming Interfaces (APIs) is necessary which perform the basic tasks of building the data structures and retrieving the data. Utility programs are required for basic browsing, visualization, and inspection of the data. For both HDF5 and NeXuS generic utilities exist for these purposes. Development of APIs and utilities can thus be based on respective software already provided in the frame of HDF5 and NeXuS. Special attention must be paid to the implementation of the data format on the data acquisition side. Since the contents of the data structures may differ from experiment to experiment ways are required to define the specific data structure being used. The acquisition system has then to construct the data structures on the basis of this definition or template. The various data sources involved in the experiment deliver the data via a well-defined interface of functions to the data acquisition software. The acquisition software takes the data provided by these functions and constructs the data structure on the basis of the template. Interfacing a new component to the acquisition system involves providing a corresponding function for the data source. In the best case the complete set of data describing the setup and the experiment is generated by the acquisition software fully automatically. This prevents the situation arising that important information is missing from the data. However, it will certainly occur that some meta data may need editing after the acquisition has finished (e.g. comments on the sample, but also some fixed parameters of the setup), therefore, among the utilities, a corresponding editor requires to be provided. Every change made will be tracked and documented in the file (e.g. in an attribute to the corresponding data element) indicating the author and date that the change was made. The specification of the data model and the collection and specification of the contents in the catalogues of data elements may be completed within half a year after start of the project. In any case, the catalogues can successively be extended, if additional data elements should be identified. For each fundamental experimental area this can be done with the help of users and the experts contributing the evaluation software systems, using a moderated web-based tool (e.g. a Wiki). The final decision on including a new data element into the catalogue is made by a group of experts in the corresponding field. After the definition phase the implementation phase will be started. APIs and utilities have to be constructed and the implementation of the elements for a data acquisition system shall start with case studies at certain instruments.