SWMR and Hdf5 Translation
Recently (Jan 29 2015) I (davidsch) took a look at the beta work of the SWMR model (single writer multiple readers) that the hdf5 group is working on. As it is planned now, there are a few limitations that would necessitate changes in how the DAQ records data, as well as in the Translator. Below is an email exchange with the hdf5 group that covers this. The summary is that SWMR will only support the datasets, all the groups have to be known and created before SWMR can take place.
Another limitation that we gathered after meeting with Quincy at SLAC sometime in late 2014 was that you cannot start moving hdf5 files as soon as they are written. That is once the file is closed, the library needs to do some things to the header of the file. This would affect our data management model were we to write natively to hdf5.
email transcript:
Parallel I/O and HDF5 Translation
The Hdf5 library currently supports parallel I/O using MPI underneath. In the meeting with Qunicy at SLAC in late 2014 we get a good sense of how we would have to change things to use this model. Here are the main points and consequences:
- In this model, you create all your groups and datasets collectively, and then different MPI ranks fill in the datasets.
- No dataset can contain variable length data nor can a dataset contain compressed data.
This would be a big change for the Translator. It seems we would have to wait for the xtc to be written so that we know what datasets to create, and how long each one is (how many events we took, etc). So we wouldn't be able to do 'live' translation, and let users read parts of the translation (like a translated calib cycle) while the whole run is getting translated (see The XTC-to-HDF5 Translator - the MPI split scan section). Also no compression is a big limitation for users moving files offsite. However I would expect faster translation.
It may be that the hdf5 group can lift these limitations and are working on it - I believe I saw a reference to parallel compressed data in a white paper Amedeo sent around once, but no details.
Virtual View and HDF5 Translation
The hdf5 group has announced (around Jan 2015) their plans for a virtual view layer. I believe this is coming within 1-2 years? Now we can create a master hdf5 file that contains links to hdf5 groups or whole datasets in other hdf5 files. My understanding of the virtual I/O layer is that within a dataset in the master file, we can link each dataset entry to a different dataset entry in another hdf5 file. This would allow us to translate each of the DAQ streams in parallel, into separate hdf5 files, while maintaining an easy to navigate master file that orders the events appropriately from the different streams.
Compression and HDF5 Translation
Presently when we compress, we limit ourselves to the standard compression algorithms that are available with the Hdf5 library. When users export their data offsite, their offsite installation of hdf5 will be able to decompress the data. Although we could use a custom compression algorithm and get much better performance, we would then we responsible for maintaining builds of the custom compression algorithm for all of our users target platforms, or a mechanism for users to build the library themselves. In a recent technical talk that Qunicy gave (January 2015) I asked if they had thought about including the decompression filter in the hdf5 file in some way, perhaps as Java byte code. He said they had, but one of the goals of hdf5 is longevity of the data - on the order of 40 years. Including java byte code or even a Python script, adds a dependency on Java or Python that is not desirable. He then contrasted this to the current dependency that one gets by using the standard compression algorithms, such as on an external gzip library - but since those algorithms are so common, it is not as problematic of a dependency.