Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Fixed chunk size

The test scripts are stored here.

...

I evaluated the performance of HDF5 read/write with lots of small appends. The test scripts write a 10 32 GB file by appending a single 1 MB element at a time. The file is then read back in, again one element at a time. I tested both HDF5's fixed integer and variable length datatypes. For comparison, I repeated the test with analogous calls to fread()/fwrite() to a binary file format. The reads/writes are done sequentially with a single core.

Below I plot histograms of the measured speeds for 1024 and 16 element chunking. In both cases the The chunk cache was set to twice the size of the chunk.

The HDF5 group indicated that variable length performance would be poor. However, I see it outperform fixed size in most cases. Variable length writing is significantly faster for both chunking cases and reading 1024 element chunked data. Fixed size reading for 16 element chunks is about 10% faster than variable length.

 

 

 

Image RemovedImage Removed

Chunk caching

The read performance is 20-50% less than what we see from dd (~600 MB/s). It could use some improvement.

The HDF5 group notes that chunk caching can impact performance. In our usage case (sequential read/write), this may however not be the case.

It is important to remember that chunk caching will only give a benefit when reading or writing the same chunk more than once. If, for example, an application is reading an entire dataset, with only whole chunks selected for each operation, then chunk caching will not help performance, and it may be preferable to completely disable the chunk cache in order to save memory. It may also be advantageous to disable the chunk cache when writing small amounts to many different chunks, if memory is not large enough to hold all those chunks in cache at once.

https://support.hdfgroup.org/HDF5/doc/_topic/Chunking/

 

The HDF5 C++ API sets the chunking with the following property

void H5::FileAccPropList::setCache(int mdc_nelmts,
  size_t rdcc_nelmts,
  size_t rdcc_nbytes,
  double rdcc_w0 
 ) const

https://support.hdfgroup.org/HDF5/doc/cpplus_RM/class_h5_1_1_file_acc_prop_list.html#a0a8c753e6d36ea936a0095b9d935d35

 

where the defaults are (0,521 ,1048576 (bytes), 0.75). I changed those parameters to (0, 2xchunk_size, 2xchunk_size (bytes), 1). These plots compare the reading speed from the default settings (top) to those with the increased cache size. In some cases the read speed increased (16 element fixed, 16 element variable, 1024 element variable) while decreasing for 1024 element fixed.

Image RemovedImage Removed

 

I find that variable length data is slightly faster (5%) in writing and significantly slower (20%) for reading as compared to fixed length. HDF5 is slower in all cases than the binary files by  20-50%.

Image Added